cg

annotate grant-oldtext.txt @ 109:a6b99bc50476

.
author bshanks@bshanks.dyndns.org
date Thu Apr 23 03:12:01 2009 -0700 (16 years ago)
parents 395faa66383e
children

rev   line source
bshanks@15 1
bshanks@15 2 ------
bshanks@15 3
bshanks@15 4
bshanks@15 5
bshanks@15 6 Massive new datasets obtained with techniques such as in situ hybridization (ISH) and BAC-transgenics allow the expression levels of many genes at many locations to be compared. This can be used to find marker genes for specific anatomical structures, as well as to draw new anatomical maps. Our goal is to develop automated methods to relate spatial variation in gene expression to anatomy. We have five specific aims:
bshanks@15 7
bshanks@15 8 (1) develop an algorithm to screen spatial gene expression data for combinations of marker genes which selectively target individual anatomical structures
bshanks@15 9 (2) develop an algorithm to screen spatial gene expression data for combinations of marker genes which can be used to delineate most of the boundaries between a number of anatomical structures at once
bshanks@15 10 (3) develop an algorithm to suggest new ways of dividing a structure up into anatomical subregions, based on spatial patterns in gene expression
bshanks@15 11 (4) create a flat (2-D) map of the mouse cerebral cortex that contains a flattened version of the Allen Mouse Brain Atlas ISH dataset, as well as the boundaries of anatomical areas within the cortex. For each cortical layer, a layer-specific flat dataset will be created. A single combined flat dataset will be created which averages information from all of the layers. These datasets will be made available in both MATLAB and Caret formats.
bshanks@15 12 (5) validate the methods developed in (1), (2) and (3) by applying them to the cerebral cortex datasets created in (4)
bshanks@15 13
bshanks@15 14 All algorithms that we develop will be implemented in an open-source software toolkit. The toolkit, as well as the machine-readable datasets developed in aim (4) and any other intermediate dataset we produce, will be published and freely available for others to use.
bshanks@15 15
bshanks@15 16 In addition to developing generally useful methods, the application of these methods to cerebral cortex will produce immediate benefits that are only one step removed from clinical application, while also supporting the development of new neuroanatomical techniques. The method developed in aim (1) will be applied to each cortical area to find a set of marker genes. Currently, despite the distinct roles of different cortical areas in both normal functioning and disease processes, there are no known marker genes for many cortical areas. Finding marker genes will be immediately useful for drug discovery as well as for experimentation because once marker genes for an area are known, interventions can be designed which selectively target that area.
bshanks@15 17
bshanks@15 18
bshanks@15 19
bshanks@15 20
bshanks@15 21
bshanks@15 22
bshanks@15 23
bshanks@15 24 The method developed in aim (2) will be used to find a small panel of genes that can find most of the boundaries between areas in the cortex. Today, finding cortical areal boundaries in a tissue sample is a manual process that requires a skilled human to combine multiple visual cues over a large area of the cortical surface. A panel of marker genes will allow the development of an ISH protocol that will allow experimenters to more easily identify which anatomical areas are present in small samples of cortex.
bshanks@15 25
bshanks@15 26
bshanks@15 27
bshanks@15 28
bshanks@15 29
bshanks@15 30
bshanks@15 31
bshanks@15 32
bshanks@15 33
bshanks@15 34
bshanks@15 35
bshanks@15 36
bshanks@15 37 For each cortical layer, a layer-specific flat dataset will be created. A single combined flat dataset will be created which averages information from all of the layers. These datasets will be made available in both MATLAB and Caret formats.
bshanks@15 38
bshanks@15 39
bshanks@15 40
bshanks@15 41
bshanks@15 42
bshanks@15 43
bshanks@15 44
bshanks@15 45
bshanks@15 46 ----
bshanks@15 47
bshanks@15 48
bshanks@15 49
bshanks@15 50 New techniques allow the expression levels of many genes at many locations to be compared. It is thought that even neighboring anatomical structures have different gene expression profiles. We propose to develop automated methods to relate the spatial variation in gene expression to anatomy. We will develop two kinds of techniques:
bshanks@15 51
bshanks@15 52 (a) techniques to screen for combinations of marker genes which selectively target anatomical structures
bshanks@15 53 (b) techniques to suggest new ways of dividing a structure up into anatomical subregions, based on the shapes of contours in the gene expression
bshanks@15 54
bshanks@15 55 The first kind of technique will be helpful for finding marker genes associated with known anatomical features. The second kind of technique will be helpful in creating new anatomical maps, maps which reflect differences in gene expression the same way that existing maps reflect differences in histology.
bshanks@15 56
bshanks@15 57 We intend to develop our techniques using the adult mouse cerebral cortex as a testbed. The Allen Brain Atlas has collected a dataset containing the expression level of about 4000 genes* over a set of over 150000 voxels, with a spatial resolution of approximately 200 microns\cite{lein_genome-wide_2007}.
bshanks@15 58
bshanks@15 59 We expect to discover sets of marker genes that pick out specific cortical areas. This will allow the development of drugs and other interventions that selectively target individual cortical areas. Therefore our research will lead to application in drug discovery, in the development of other targeted clinical interventions, and in the development of new experimental techniques.
bshanks@15 60
bshanks@15 61 The best way to divide up rodent cortex into areas has not been completely determined, as can be seen by the differences in the recent maps given by Swanson on the one hand, and Paxinos and Franklin on the other. It is likely that our study, by showing which areal divisions naturally follow from gene expression data, as opposed to traditional histological data, will contribute to the creation of a better map. While we do not here propose to analyze human gene expression data, it is conceivable that the methods we propose to develop could be used to suggest modifications to the human cortical map as well.
bshanks@15 62
bshanks@15 63
bshanks@15 64 In the following, we will only be talking about coronal data.
bshanks@15 65
bshanks@15 66 The Allen Brain Atlas provides "Smoothed Energy Volumes", which are
bshanks@15 67
bshanks@15 68
bshanks@15 69 One type of artifact in the Allen Brain Atlas data is what we call a "slice artifact". We have noticed two types of slice artifacts in the dataset. The first type, a "missing slice artifact", occurs when the ISH procedure on a slice did not come out well. In this case, the Allen Brain investigators excluded the slice at issue from the dataset. This means that no gene expression information is available for that gene for the region of space covered by that slice. This results in an expression level of zero being assigned to voxels covered by the slice. This is partially but not completely ameliorated by the smoothing that is applied to create the Smoothed Energy Volumes. The usual end result is that a region of space which is shaped and oriented like a coronal slice is marked as having less gene expression than surrounding regions.
bshanks@15 70
bshanks@15 71 The second type of slice artifact is caused by the fact that all of the slices have a consistent orientation. Since there may be artifacts (such as how well the ISH worked) which are constant within each slice but which vary between different slices, the result is that ceteris paribus, when one compares the genetic data of a voxel to another voxel within the same coronal plane, one would expect to find more similarity than if one compared a voxel to another voxel displaced along the rostrocaudal axis.
bshanks@15 72
bshanks@15 73
bshanks@15 74
bshanks@15 75
bshanks@108 76 We are enthusiastic about the sharing of methods, data, and results, and at the conclusion of the project, we will make all of our data and computer source code publically available, either in supplemental attachments to publications, or on a website. Our goal is that replicating our results, or applying the methods we develop to other targets, will be quick and easy for other investigators. In order to aid in understanding and replicating our results, we intend to include a software program which, when run, will take as input the Allen Brain Atlas raw data, and produce as output all numbers and charts found in publications resulting from the project.
bshanks@15 77
bshanks@15 78
bshanks@15 79
bshanks@15 80
bshanks@15 81 We also expect to weigh in on the debate about how to best partition rodent cortex
bshanks@15 82
bshanks@15 83
bshanks@15 84
bshanks@15 85 be useful for drug discovery as well
bshanks@15 86
bshanks@15 87
bshanks@15 88
bshanks@15 89 * Another 16000 genes are available, but they do not cover the entire cerebral cortex with high spatial resolution.
bshanks@15 90
bshanks@15 91
bshanks@15 92 User-definable ROIs
bshanks@15 93 Combinatorial gene expression
bshanks@15 94 Negative as well as positive signal
bshanks@15 95 Use geometry
bshanks@15 96 Search for local boundaries if necessary
bshanks@15 97 Flatmapped
bshanks@15 98
bshanks@15 99
bshanks@15 100
bshanks@15 101
bshanks@15 102
bshanks@15 103
bshanks@15 104 == Specific aims ==
bshanks@15 105
bshanks@15 106 ==== Develop algorithms that find genetic markers for anatomical regions ====
bshanks@15 107 # Develop scoring measures for evaluating how good individual genes are at marking areas: we will compare pointwise, geometric, and information-theoretic measures.
bshanks@15 108 # Develop a procedure to find single marker genes for anatomical regions: for each cortical area, by using or combining the scoring measures developed, we will rank the genes by their ability to delineate each area.
bshanks@15 109 # Extend the procedure to handle difficult areas by using combinatorial coding: for areas that cannot be identified by any single gene, identify them with a handful of genes. We will consider both (a) algorithms that incrementally/greedily combine single gene markers into sets, such as forward stepwise regression and decision trees, and also (b) supervised learning techniques which use soft constraints to minimize the number of features, such as sparse support vector machines.
bshanks@15 110 # Extend the procedure to handle difficult areas by combining or redrawing the boundaries: An area may be difficult to identify because the boundaries are misdrawn, or because it does not "really" exist as a single area, at least on the genetic level. We will develop extensions to our procedure which (a) detect when a difficult area could be fit if its boundary were redrawn slightly, and (b) detect when a difficult area could be combined with adjacent areas to create a larger area which can be fit.
bshanks@15 111
bshanks@15 112
bshanks@15 113 ==== Apply these algorithms to the cortex ====
bshanks@15 114 # Create open source format conversion tools: we will create tools to bulk download the ABA dataset and to convert between SEV, NIFTI and MATLAB formats.
bshanks@15 115 # Flatmap the ABA cortex data: map the ABA data onto a plane and draw the cortical area boundaries onto it.
bshanks@15 116 # Find layer boundaries: cluster similar voxels together in order to automatically find the cortical layer boundaries.
bshanks@15 117 # Run the procedures that we developed on the cortex: we will present, for each area, a short list of markers to identify that area; and we will also present lists of "panels" of genes that can be used to delineate many areas at once.
bshanks@15 118
bshanks@15 119 ==== Develop algorithms to suggest a division of a structure into anatomical parts ====
bshanks@15 120 # Explore dimensionality reduction algorithms applied to pixels: including TODO
bshanks@15 121 # Explore dimensionality reduction algorithms applied to genes: including TODO
bshanks@15 122 # Explore clustering algorithms applied to pixels: including TODO
bshanks@15 123 # Explore clustering algorithms applied to genes: including gene shaving, TODO
bshanks@15 124 # Develop an algorithm to use dimensionality reduction and/or hierarchial clustering to create anatomical maps
bshanks@15 125 # Run this algorithm on the cortex: present a hierarchial, genoarchitectonic map of the cortex
bshanks@15 126
bshanks@15 127
bshanks@15 128
bshanks@15 129
bshanks@15 130
bshanks@15 131
bshanks@15 132
bshanks@15 133
bshanks@15 134
bshanks@15 135 gradient similarity is calculated as:
bshanks@15 136 \sum_pixels cos(abs(\angle \nabla_1 - \angle \nabla_2)) \cdot \frac{\vert \nabla_1 \vert + \vert \nabla_2 \vert}{2} \cdot \frac{pixel\_value_1 + pixel\_value_2}{2}
bshanks@15 137
bshanks@15 138
bshanks@15 139
bshanks@15 140
bshanks@15 141
bshanks@15 142
bshanks@15 143
bshanks@15 144 (todo) Technically, we say that an anatomical structure has a fundamentally 2-D organization when there exists a commonly used, generic, anatomical structure-preserving map from 3-D space to a 2-D manifold.
bshanks@15 145
bshanks@15 146
bshanks@15 147 Related work:
bshanks@15 148
bshanks@15 149
bshanks@15 150 The Allen Brain Institute has developed an interactive web interface called AGEA which allows an investigator to (1) calculate lists of genes which are selectively overexpressed in certain anatomical regions (ABA calls this the "Gene Finder" function) (2) to visualize the correlation between the genetic profiles of voxels in the dataset, and (3) to visualize a hierarchial clustering of voxels in the dataset \cite{ng_anatomic_2009}. AGEA is an impressive and useful tool, however, it does not solve the same problems that we propose to solve with this project.
bshanks@15 151
bshanks@15 152 First we describe AGEA's "Gene Finder", and then compare it to our proposed method for finding marker genes. AGEA's Gene Finder first asks the investigator to select a single "seed voxel" of interest. It then uses a clustering method, combined with built-in knowledge of major anatomical structures, to select two sets of voxels; an "ROI" and a "comparator region"*. The seed voxel is always contained within the ROI, and the ROI is always contained within the comparator region. The comparator region is similar but not identical to the set of voxels making up the major anatomical region containing the ROI. Gene Finder then looks for genes which can distinguish the ROI from the comparator region. Specifically, it finds genes for which the ratio (expression energy in the ROI) / (expression energy in the comparator region) is high.
bshanks@15 153
bshanks@15 154 Informally, the Gene Finder first infers an ROI based on clustering the seed voxel with other voxels. Then, the Gene Finder finds genes which overexpress in the ROI as compared to other voxels in the major anatomical region.
bshanks@15 155
bshanks@15 156 There are three major differences between our approach and Gene Finder.
bshanks@15 157
bshanks@15 158 First, Gene Finder focuses on individual genes and individual ROIs in isolation. This is great for regions which can be picked out from all other regions by a single gene, but not all of them can (todo). There are at least two ways this can miss out on useful genes. First, a gene might express in part of a region, but not throughout the whole region, but there may be another gene which expresses in the rest of the region*. Second, a gene might express in a region, but not in any of its neighbors, but it might express also in other non-neighboring regions. To take advantage of these types of genes, we propose to find combinations of genes which, together, can identify the boundaries of all subregions within the containing region.
bshanks@15 159
bshanks@15 160 Second, Gene Finder uses a pointwise metric, namely expression energy ratio, to decide whether a gene is good for picking out a region. We have found better results by using metrics which take into account not just single voxels, but also the local geometry of neighboring voxels, such as the local gradient (todo). In addition, we have found that often the absence of gene expression can be used as a marker, which will not be caught by Gene Finder's expression energy ratio (todo).
bshanks@15 161
bshanks@15 162 Third, Gene Finder chooses the ROI based only on the seed voxel. This often does not permit the user to query the ROI that they are interested in. For example, in all of our tests of Gene Finder in cortex, the ROIs chosen tend to be cortical layers, rather than cortical areas.
bshanks@15 163
bshanks@15 164 In summary, when Gene Finder picks the ROI that you want, and when this ROI can be easily picked out from neighboring regions by single genes which selectively overexpress in the ROI compared to the entire major anatomical region, Gene Finder will work. However, Gene Finder will not pick cortical areas as ROIs, and even if it could, many cortical areas cannot be uniquely picked out by the overexpression of any single gene. By contrast, we will target cortical areas, we will explore a variety of metrics which can complement the shortcomings of expression energy ratio, and we will use the combinatorial expression of genes to pick out cortical areas even when no individual gene will do.
bshanks@15 165
bshanks@15 166
bshanks@15 167 * The terms "ROI" and "comparator region" are our own; the ABI calls them the "local region" and the "larger anatomical context". The ABI uses the term "specificity comparator" to mean the major anatomic region containing the ROI, which is not exactly identical to the comparator region.
bshanks@15 168
bshanks@15 169 ** In this case, the union of the area of expression of the two genes would suffice; one could also imagine that there could be situations in which the intersection of multiple genes would be needed, or a combination of unions and intersections.
bshanks@15 170
bshanks@15 171
bshanks@15 172 Now we describe AGEA's hierarchial clustering, and compare it to our proposal. The goal of AGEA's hierarchial clustering is to generate a binary tree of clusters, where a cluster is a collection of voxels. AGEA begins by computing the Pearson correlation between each pair of voxels. They then employ a recursive divisive (top-down) hierarchial clustering procedure on the voxels, which means that they start with all of the voxels, and then they divide them into clusters, and then within each cluster, they divide that cluster into smaller clusters, etc***. At each step, the collection of voxels is partitioned into two smaller clusters in a way that maximizes the following quantity: average correlation between all possible pairs of voxels containing one voxel from each cluster.
bshanks@15 173
bshanks@15 174 There are three major differences between our approach and AGEA's hierarchial clustering. First, AGEA's clustering method separates cortical layers before it separates cortical areas.
bshanks@15 175
bshanks@15 176
bshanks@15 177
bshanks@15 178
bshanks@15 179
bshanks@15 180 following procedure is used for the purpose of dividing a collection of voxels into smaller clusters: partition the voxels into two sets, such that the following quantity is maximized:
bshanks@15 181
bshanks@15 182 *** depending on which level of the tree is being created, the voxels are subsampled in order to save time
bshanks@15 183
bshanks@15 184
bshanks@15 185
bshanks@15 186
bshanks@15 187
bshanks@15 188 does not allow the user to input anything other than a seed voxel; this means that for each seed voxel, there is only one
bshanks@15 189
bshanks@15 190
bshanks@15 191
bshanks@15 192 The role of the "local region" is to serve as a region of interest for which marker genes are desired; the role of the "larger anatomical context" is to be the structure
bshanks@15 193
bshanks@15 194
bshanks@15 195
bshanks@15 196 There are two kinds of differences between AGEA and our project; differences that relate to the treatment of the cortex, and differences in the type of generalizable methods being developed. As relates
bshanks@15 197
bshanks@15 198
bshanks@15 199 indicate an ROI
bshanks@15 200
bshanks@15 201 explore simple correlation-based relationships between voxels, genes, and clusters of voxels.
bshanks@15 202
bshanks@15 203
bshanks@15 204 There have not yet been any studies which describe the results of applying AGEA to the cerebral cortex; however, we suspect that the AGEA metrics are not optimal for the task of relating genes to cortical areas. A voxel's gene expression profile depends upon both its cortical area and its cortical layer, however, AGEA has no mechanism to distinguish these two. As a result, voxels in the same layer but different areas are often clustered together by AGEA. As part of the project, we will compare the performance of our techniques against AGEA's.
bshanks@15 205
bshanks@15 206 ---
bshanks@15 207
bshanks@15 208 The Allen Brain Institute has developed interactive tools called AGEA which allow an investigator to explore simple correlation-based relationships between voxels, genes, and clusters of voxels. There have not yet been any studies which describe the results of applying AGEA to the cerebral cortex; however, we suspect that the AGEA metrics are not optimal for the task of relating genes to cortical areas. A voxel's gene expression profile depends upon both its cortical area and its cortical layer, however, AGEA has no mechanism to distinguish these two. As a result, voxels in the same layer but different areas are often clustered together by AGEA. As part of the project, we will compare the performance of our techniques against AGEA's.
bshanks@15 209
bshanks@15 210 Another difference between our techniques and AGEA's is that AGEA allows the user to enter only a voxel location, and then to either explore the rest of the brain's relationship to that particular voxel, or explore a partitioning of the brain based on pairwise voxel correlation. If the user is interested not in a single voxel, but rather an entire anatomical structure, AGEA will only succeed to the extent that the selected voxel is a typical representative of the structure. As discussed in the previous paragraph, this poses problems for structures like cortical areas, which (because of their division into cortical layers) do not have a single "typical representative".
bshanks@15 211
bshanks@15 212 By contrast, in our system, the user will start by selecting, not a single voxel, but rather, an anatomical superstructure to be divided into pieces (for example, the cerebral cortex). We expect that our methods will take into account not just pairwise statistics between voxels, but also large-scale geometric features (for example, the rapidity of change in gene expression as regional boundaries are crossed) which optimize the discriminability of regions within the selected superstructure.
bshanks@15 213
bshanks@15 214
bshanks@15 215 -----
bshanks@15 216
bshanks@15 217 screen for combinations of marker genes which selectively target anatomical structures
bshanks@15 218 pick delineate the boundaries between neighboring anatomical structures.
bshanks@15 219 (b) techniques to screen for marker genes which pick out anatomical structures of interest
bshanks@15 220
bshanks@15 221 , techniques which: (a) screen for marker genes , and (b) suggest new anatomical maps based on
bshanks@15 222
bshanks@15 223
bshanks@15 224 whose expression partitions the region of interest into its anatomical substructures, and (b) use the natural contours of gene expression to suggest new ways of dividing an organ into
bshanks@15 225
bshanks@15 226
bshanks@15 227 The Allen Brain Atlas
bshanks@15 228
bshanks@15 229
bshanks@15 230
bshanks@15 231
bshanks@15 232 --
bshanks@15 233
bshanks@15 234 to: brooksl@mail.nih.gov
bshanks@15 235
bshanks@15 236 Hi, I'm writing to confirm the applicability of a potential research
bshanks@15 237 project to the challenge grant topic "New computational and
bshanks@15 238 statistical methods for the analysis of large
bshanks@15 239 data sets from next-generation sequencing technologies".
bshanks@15 240
bshanks@15 241 We want to develop methods for the analysis of gene expression
bshanks@15 242 datasets that can be used to uncover the relationships between gene
bshanks@15 243 expression and anatomical regions. Specifically, we want to develop
bshanks@15 244 techniques to (a) given a set of known anatomical areas, identify
bshanks@15 245 genetic markers for each of these areas, and (b) given an anatomical structure
bshanks@15 246 whose substructure is unknown, suggest a map, that is, a division of
bshanks@15 247 the space into anatomical sub-structures, that represents the
bshanks@15 248 boundaries inherent in the gene expression data.
bshanks@15 249
bshanks@15 250 We propose to develop our techniques on the Allen Brain
bshanks@15 251 Atlas mouse brain gene expression dataset by finding genetic markers
bshanks@15 252 for anatomical areas within the cerebral cortex. The Allen Brain Atlas
bshanks@15 253 contains a registered 3-D map of gene expression data with 200-micron
bshanks@15 254 voxel resolution which was created from in situ hybridization
bshanks@15 255 data. The dataset contains about 4000 genes which are available at
bshanks@15 256 this resolution across the entire cerebral cortex.
bshanks@15 257
bshanks@15 258 Despite the distinct roles of different cortical
bshanks@15 259 areas in both normal functioning and disease processes, there are no
bshanks@15 260 known marker genes for many cortical areas. This project will be
bshanks@15 261 immediately useful for both drug discovery and clinical research
bshanks@15 262 because once the markers are known, interventions can be designed
bshanks@15 263 which selectively target specific cortical areas.
bshanks@15 264
bshanks@15 265 This techniques we develop will be useful because they will be
bshanks@15 266 applicable to the analysis of other anatomical areas, both in
bshanks@15 267 terms of finding marker genes for known areas, and in terms of
bshanks@15 268 suggesting new anatomical subdivisions that are based upon the gene
bshanks@15 269 expression data.
bshanks@15 270
bshanks@15 271
bshanks@15 272
bshanks@15 273 ----
bshanks@15 274
bshanks@15 275
bshanks@15 276
bshanks@15 277
bshanks@15 278
bshanks@15 279
bshanks@15 280 It is likely that our study, by showing which areal divisions naturally follow from gene expression data, as opposed to traditional histological data, will contribute to the creation of
bshanks@15 281
bshanks@15 282 there are clear genetic or chemical markers known for only a few cortical areas. This makes it difficult to target drugs to specific
bshanks@15 283
bshanks@15 284 As part of aims (1) and (5), we will discover sets of marker genes that pick out specific cortical areas. This will allow the development of drugs and other interventions that selectively target individual cortical areas. As part of aims (2) and (5), we will also discover small panels of marker genes that can be used to delineate most of the cortical areal map.
bshanks@15 285
bshanks@15 286
bshanks@15 287
bshanks@15 288 With aims (2) and (4), we
bshanks@15 289
bshanks@15 290 There are five principals
bshanks@15 291
bshanks@15 292
bshanks@15 293
bshanks@15 294 In addition to validating the usefulness of the algorithms, the application of these methods to cerebral cortex will produce immediate benefits that are only one step removed from clinical application.
bshanks@15 295
bshanks@15 296
bshanks@15 297 todo: remember to check gensat, etc for validation (mention bias/variance)
bshanks@15 298
bshanks@15 299
bshanks@15 300
bshanks@15 301 === Why it is useful to apply these methods to cortex ===
bshanks@15 302
bshanks@15 303
bshanks@15 304 There is still room for debate as to exactly how the cortex should be parcellated into areas.
bshanks@15 305
bshanks@15 306
bshanks@15 307 The best way to divide up rodent cortex into areas has not been completely determined,
bshanks@15 308
bshanks@15 309
bshanks@15 310 not yet been accounted for in
bshanks@15 311
bshanks@15 312 that the expression of some genes will contain novel spatial patterns which are not account
bshanks@15 313
bshanks@15 314 that a genoarchitectonic map
bshanks@15 315
bshanks@15 316
bshanks@15 317 This principle is only applicable to aim 1 (marker genes). For aim 2 (partition a structure in into anatomical subregions), we plan to work with many genes at once.
bshanks@15 318
bshanks@15 319
bshanks@15 320 tood: aim 2 b+s?
bshanks@15 321
bshanks@15 322
bshanks@15 323
bshanks@15 324
bshanks@15 325 ==== Principle 5: Interoperate with existing tools ====
bshanks@15 326
bshanks@15 327 In order for our software to be as useful as possible for our users, it will be able to import and export data to standard formats so that users can use our software in tandem with other software tools created by other teams. We will support the following formats: NIFTI (Neuroimaging Informatics Technology Initiative), SEV (Allen Brain Institute Smoothed Energy Volume), and MATLAB. This ensures that our users will not have to exclusively rely on our tools when analyzing data. For example, users will be able to use the data visualization and analysis capabilities of MATLAB and Caret alongside our software.
bshanks@15 328
bshanks@15 329 To our knowledge, there is no currently available software to convert between these formats, so we will also provide a format conversion tool. This may be useful even for groups that don't use any of our other software.
bshanks@15 330
bshanks@15 331
bshanks@15 332
bshanks@15 333 todo: is "marker gene" even a phrase that we should use at all?
bshanks@15 334
bshanks@15 335
bshanks@15 336
bshanks@15 337 note for aim 1 apps: combo of genes is for voxel, not within any single cell
bshanks@15 338
bshanks@15 339
bshanks@15 340
bshanks@15 341
bshanks@15 342 , as when genetic markers allow the development of selective interventions; the reason that one can be confident that the intervention is selective is that it is only turned on when a certain combination of genes is turned on and off. The result procedure is what assures us that when that combination is present, the local tissue is probably part of a certain subregion.
bshanks@15 343
bshanks@15 344
bshanks@15 345
bshanks@15 346 The basic idea is that we want to find a procedure by
bshanks@15 347
bshanks@15 348 The task of finding genes that mark anatomical areas can be phrased in terms of what the field of machine learning calls a "supervised learning" task. The goal of this task is to learn a function (the "classifier") which
bshanks@15 349
bshanks@15 350 If a person knows a combination of genes that mark an area, that implies that the person can be told how strong those genes express in any voxel, and the person can use this information to determine how
bshanks@15 351
bshanks@15 352 finding how to infer the areal identity of a voxel if given the gene expression profile of that voxel.
bshanks@15 353
bshanks@15 354
bshanks@15 355 For each voxel in the cortex, we want to start with data about the gene expression
bshanks@15 356
bshanks@15 357
bshanks@15 358
bshanks@15 359 There are various ways to look for marker genes. We will define some terms, and along the way we will describe a few design choices encountered in the process of creating a marker gene finding method, and then we will present four principles that describe which options we have chosen.
bshanks@15 360
bshanks@15 361 In developing a procedure for finding marker genes, we are developing a procedure that takes a dataset of experimental observations and produces a result. One can think of the result as merely a list of genes, but really the result is an understanding of a predictive relationship between, on the one hand, the expression levels of genes, and, on the other hand, anatomical subregions.
bshanks@15 362
bshanks@15 363 One way to more formally define this understanding is to look at it as a procedure. In this view, the result of the learning procedure is itself a procedure. The result procedure provides a way to use the gene expression profiles of voxels in a tissue sample in order to determine where the subregions are.
bshanks@15 364
bshanks@15 365 This result procedure can be used directly, as when an experimenter has a tissue sample and needs to know what subregions are present in it, and, if multiple subregions are present, where they each are. Or it can be used indirectly; imagine that the result procedure tells us that whenever a certain combination of genes are expressed, the local tissue is probably part of a certain subregion. This means that we can then confidentally develop an intervention which is triggered only when that combination of genes are expressed; and to the extent that the result procedure is reliable, we know that the intervention will only be triggered in the target subregion.
bshanks@15 366
bshanks@15 367 We said that the result procedure provides "a way to use the gene expression profiles of voxels in a tissue sample" in order to "determine where the subregions are".
bshanks@15 368
bshanks@15 369
bshanks@15 370 Does the result procedure get as input all of the gene expression profiles of each voxel in the entire tissue sample, and produce as output all of the subregional boundaries all at once?
bshanks@15 371
bshanks@15 372
bshanks@15 373
bshanks@15 374
bshanks@15 375
bshanks@15 376
bshanks@15 377 it is helpful for the classifier to look at the global "shape" of gene expression patterns over the whole structure, rather than just nearby voxels.
bshanks@15 378
bshanks@15 379
bshanks@15 380
bshanks@15 381
bshanks@15 382 there is some small bit of additional information that can be gleaned from knowing the
bshanks@15 383
bshanks@15 384 ==== Design choices for a supervised learning procedure ====
bshanks@15 385
bshanks@15 386
bshanks@15 387 After all,
bshanks@15 388
bshanks@15 389 there is a small correlation between the gene expression levels from distant voxels and
bshanks@15 390
bshanks@15 391 Depending on how we intend to use the classifier, we may want to design it so that
bshanks@15 392
bshanks@15 393 It is possible for many things to
bshanks@15 394
bshanks@15 395 The choice of which data is made part of an instance
bshanks@15 396
bshanks@15 397 what we seek is a procedure
bshanks@15 398
bshanks@15 399 partition the tissue sample into subregions.
bshanks@15 400
bshanks@15 401 each part of the anatomical structure
bshanks@15 402
bshanks@15 403 must be One way to rephrase this task is to say that, instead of searching for the location of the subregions, we are looking to partition the tissue sample into subregions.
bshanks@15 404
bshanks@15 405
bshanks@15 406 There are various ways to look for marker genes. We will define some terms, and along the way we will describe a few design choices encountered in the process of creating a marker gene finding method, and then we will present four principles that describe which options we have chosen.
bshanks@15 407
bshanks@15 408 In developing a procedure for finding marker genes, we are developing a procedure that takes a dataset of experimental observations and produces a result. One can think of the result as merely a list of genes, but really the result is an understanding of a predictive relationship between, on the one hand, the expression levels of genes, and, on the other hand, anatomical subregions.
bshanks@15 409
bshanks@15 410 One way to more formally define this understanding is to look at it as a procedure. In this view, the result of the learning procedure is itself a procedure. The result procedure provides a way to use the gene expression profiles of voxels in a tissue sample in order to determine where the subregions are.
bshanks@15 411
bshanks@15 412 This result procedure can be used directly, as when an experimenter has a tissue sample and needs to know what subregions are present in it, and, if multiple subregions are present, where they each are. Or it can be used indirectly; imagine that the result procedure tells us that whenever a certain combination of genes are expressed, the local tissue is probably part of a certain subregion. This means that we can then confidentally develop an intervention which is triggered only when that combination of genes are expressed; and to the extent that the result procedure is reliable, we know that the intervention will only be triggered in the target subregion.
bshanks@15 413
bshanks@15 414 We said that the result procedure provides "a way to use the gene expression profiles of voxels in a tissue sample" in order to "determine where the subregions are".
bshanks@15 415
bshanks@15 416
bshanks@15 417 Does the result procedure get as input all of the gene expression profiles of each voxel in the entire tissue sample, and produce as output all of the subregional boundaries all at once?
bshanks@15 418
bshanks@15 419
bshanks@15 420 Or are we given one voxel at a time,
bshanks@15 421
bshanks@15 422
bshanks@15 423 In the jargon of the field of machine learning, the result procedure is called a __classifier__.
bshanks@15 424
bshanks@15 425
bshanks@15 426 The task of finding genes that mark anatomical areas can be phrased in terms of what the field of machine learning calls a "supervised learning" task. The goal of this task is to learn a function (the "classifier") which
bshanks@15 427
bshanks@15 428 If a person knows a combination of genes that mark an area, that implies that the person can be told how strong those genes express in any voxel, and the person can use this information to determine how
bshanks@15 429
bshanks@15 430 finding how to infer the areal identity of a voxel if given the gene expression profile of that voxel.
bshanks@15 431
bshanks@15 432
bshanks@15 433 For each voxel in the cortex, we want to start with data about the gene expression
bshanks@15 434
bshanks@15 435
bshanks@15 436
bshanks@15 437 single voxels, but rather groups of voxels, such that the groups can be placed in some 2-D space. We will call such instances "pixels".
bshanks@15 438
bshanks@15 439 We have been speaking as if instances necessarily correspond to single voxels. But it is possible for instances to be groupings of many voxels, in which case each grouping must be assigned the same label (that is, each voxel grouping must stay inside a single anatomical subregion).
bshanks@15 440
bshanks@15 441
bshanks@15 442
bshanks@15 443 In some but not all cases, the groups are either rows or columns of voxels. This is the case with the cerebral cortex, in which one may assume that columns of voxels which run perpendicular to the cortical surface all share the same areal identity. In the cortex, we call such an instance a "surface pixel", because such an instance represents the data associated with all voxels underneath a specific patch of the cortical surface.