rev |
line source |
bshanks@0 | 1 Specific aims
|
bshanks@8 | 2 todo test3
|
bshanks@8 | 3 Massive new datasets obtained with techniques such as in situ hybridization
|
bshanks@0 | 4 (ISH) and BAC-transgenics allow the expression levels of many genes at many
|
bshanks@0 | 5 locations to be compared. Our goal is to develop automated methods to relate
|
bshanks@0 | 6 spatial variation in gene expression to anatomy. We want to find marker genes
|
bshanks@0 | 7 for specific anatomical regions, and also to draw new anatomical maps based on
|
bshanks@0 | 8 gene expression patterns. We have three specific aims:
|
bshanks@0 | 9 (1) develop an algorithm to screen spatial gene expression data for combina-
|
bshanks@0 | 10 tions of marker genes which selectively target anatomical regions
|
bshanks@0 | 11 (2) develop an algorithm to suggest new ways of carving up a structure into
|
bshanks@0 | 12 anatomical subregions, based on spatial patterns in gene expression
|
bshanks@0 | 13 (3) create a 2-D “flat map” dataset of the mouse cerebral cortex that contains
|
bshanks@0 | 14 a flattened version of the Allen Mouse Brain Atlas ISH data, as well as
|
bshanks@0 | 15 the boundaries of cortical anatomical areas. Use this dataset to validate
|
bshanks@0 | 16 the methods developed in (1) and (2).
|
bshanks@0 | 17 In addition to validating the usefulness of the algorithms, the application of
|
bshanks@0 | 18 these methods to cerebral cortex will produce immediate benefits, because there
|
bshanks@0 | 19 are currently no known genetic markers for many cortical areas. The results
|
bshanks@0 | 20 of the project will support the development of new ways to selectively target
|
bshanks@0 | 21 cortical areas, and it will support the development of a method for identifying
|
bshanks@0 | 22 the cortical areal boundaries present in small tissue samples.
|
bshanks@0 | 23 All algorithms that we develop will be implemented in an open-source soft-
|
bshanks@0 | 24 ware toolkit. The toolkit, as well as the machine-readable datasets developed
|
bshanks@0 | 25 in aim (3), will be published and freely available for others to use.
|
bshanks@0 | 26 Background and significance
|
bshanks@0 | 27 Aim 1
|
bshanks@0 | 28 Machine learning terminology
|
bshanks@0 | 29 The task of looking for marker genes for anatomical subregions means that one
|
bshanks@0 | 30 is looking for a set of genes such that, if the expression level of those genes is
|
bshanks@0 | 31 known, then the locations of the subregions can be inferred.
|
bshanks@0 | 32 If we define the subregions so that they cover the entire anatomical structure
|
bshanks@0 | 33 to be divided, then instead of saying that we are using gene expression to find
|
bshanks@0 | 34 the locations of the subregions, we may say that we are using gene expression to
|
bshanks@0 | 35 determine to which subregion each voxel within the structure belongs. We call
|
bshanks@0 | 36 this a classification task, because each voxel is being assigned to a class (namely,
|
bshanks@0 | 37 its subregion).
|
bshanks@8 | 38 1
|
bshanks@8 | 39
|
bshanks@0 | 40 Therefore, an understanding of the relationship between the combination of
|
bshanks@0 | 41 their expression levels and the locations of the subregions may be expressed as
|
bshanks@0 | 42 a function. The input to this function is a voxel, along with the gene expression
|
bshanks@0 | 43 levels within that voxel; the output is the subregional identity of the target
|
bshanks@0 | 44 voxel, that is, the subregion to which the target voxel belongs. We call this
|
bshanks@0 | 45 function a classifier. In general, the input to a classifier is called an instance,
|
bshanks@0 | 46 and the output is called a label.
|
bshanks@0 | 47 The object of aim 1 is not to produce a single classifier, but rather to develop
|
bshanks@0 | 48 an automated method for determining a classifier for any known anatomical
|
bshanks@0 | 49 structure. Therefore, we seek a procedure by which a gene expression dataset
|
bshanks@0 | 50 may be analyzed in concert with an anatomical atlas in order to produce a
|
bshanks@0 | 51 classifier. Such a procedure is a type of a machine learning procedure. The
|
bshanks@0 | 52 construction of the classifier is called training (also learning), and the initial
|
bshanks@0 | 53 gene expression dataset used in the construction of the classifier is called training
|
bshanks@0 | 54 data.
|
bshanks@0 | 55 In the machine learning literature, this sort of procedure may be thought
|
bshanks@0 | 56 of as a supervised learning task, defined as a task in whcih the goal is to learn
|
bshanks@0 | 57 a mapping from instances to labels, and the training data consists of a set of
|
bshanks@0 | 58 instances (voxels) for which the labels (subregions) are known.
|
bshanks@0 | 59 Each gene expression level is called a feature, and the selection of which
|
bshanks@0 | 60 genes to include is called feature selection. Feature selection is one component
|
bshanks@0 | 61 of the task of learning a classifier. Some methods for learning classifiers start
|
bshanks@0 | 62 out with a separate feature selection phase, whereas other methods combine
|
bshanks@0 | 63 feature selection with other aspects of training.
|
bshanks@0 | 64 One class of feature selection methods assigns some sort of score to each
|
bshanks@0 | 65 candidate gene. The top-ranked genes are then chosen. Some scoring measures
|
bshanks@0 | 66 can assign a score to a set of selected genes, not just to a single gene; in this
|
bshanks@0 | 67 case, a dynamic procedure may be used in which features are added and sub-
|
bshanks@0 | 68 tracted from the selected set depending on how much they raise the score. Such
|
bshanks@0 | 69 procedures are called “stepwise” or “greedy”.
|
bshanks@0 | 70 Although the classifier itself may only look at the gene expression data within
|
bshanks@0 | 71 each voxel before classifying that voxel, the learning algorithm which constructs
|
bshanks@0 | 72 the classifier may look over the entire dataset. We can categorize score-based
|
bshanks@0 | 73 feature selection methods depending on how the score of calculated. Often
|
bshanks@0 | 74 the score calculation consists of assigning a sub-score to each voxel, and then
|
bshanks@0 | 75 aggregating these sub-scores into a final score (the aggregation is often a sum or
|
bshanks@0 | 76 a sum of squares). If only information from nearby voxels is used to calculate a
|
bshanks@0 | 77 voxel’s sub-score, then we say it is a local scoring method. If only information
|
bshanks@0 | 78 from the voxel itself is used to calculate a voxel’s sub-score, then we say it is a
|
bshanks@0 | 79 pointwise scoring method.
|
bshanks@0 | 80 Key questions when choosing a learning method are: What are the instances?
|
bshanks@0 | 81 What are the features? How are the features chosen? Here are four principles
|
bshanks@0 | 82 that outline our answers to these questions.
|
bshanks@8 | 83 2
|
bshanks@8 | 84
|
bshanks@0 | 85 Principle 1: Combinatorial gene expression
|
bshanks@0 | 86 Above, we defined an “instance” as the combination of a voxel with the “asso-
|
bshanks@0 | 87 ciated gene expression data”. In our case this refers to the expression level of
|
bshanks@0 | 88 genes within the voxel, but should we include the expression levels of all genes,
|
bshanks@0 | 89 or only a few of them?
|
bshanks@0 | 90 It is too much to hope that every anatomical region of interest will be iden-
|
bshanks@0 | 91 tified by a single gene. For example, in the cortex, there are some areas which
|
bshanks@0 | 92 are not clearly delineated by any gene included in the Allen Brain Atlas (ABA)
|
bshanks@0 | 93 dataset. However, at least some of these areas can be delineated by looking
|
bshanks@0 | 94 at combinations of genes (an example of an area for which multiple genes are
|
bshanks@0 | 95 necessary and sufficient is provided in Preliminary Results).
|
bshanks@0 | 96 Principle 2: Only look at combinations of small numbers of genes
|
bshanks@0 | 97 When the classifier classifies a voxel, it is only allowed to look at the expression of
|
bshanks@0 | 98 the genes which have been selected as features. The more data that is available
|
bshanks@0 | 99 to a classifier, the better that it can do. For example, perhaps there are weak
|
bshanks@0 | 100 correlations over many genes that add up to a strong signal. So, why not include
|
bshanks@0 | 101 every gene as a feature? The reason is that we wish to employ the classifier in
|
bshanks@0 | 102 situations in which it is not feasible to gather data about every gene. For
|
bshanks@0 | 103 example, if we want to use the expression of marker genes as a trigger for some
|
bshanks@0 | 104 regionally-targeted intervention, then our intervention must contain a molecular
|
bshanks@0 | 105 mechanism to check the expression level of each marker gene before it triggers.
|
bshanks@0 | 106 It is currently infeasible to design a molecular trigger that checks the level of
|
bshanks@0 | 107 more than a handful of genes. Similarly, if the goal is to develop a procedure to
|
bshanks@0 | 108 do ISH on tissue samples in order to label their anatomy, then it is infeasible
|
bshanks@0 | 109 to label more than a few genes. Therefore, we must select only a few genes as
|
bshanks@0 | 110 features.
|
bshanks@0 | 111 Principle 3: Use geometry in feature selection
|
bshanks@1 | 112 When doing feature selection with score-based methods, the simplest thing to do
|
bshanks@1 | 113 would be to score the performance of each voxel by itself and then combine these
|
bshanks@1 | 114 scores (pointwise scoring). A more powerful approach is to also use information
|
bshanks@1 | 115 about the geometric relations between each voxel and its neighbors; this requires
|
bshanks@1 | 116 non-pointwise, local scoring methods. See Preliminary Results for evidence of
|
bshanks@1 | 117 the complementary nature of pointwise and local scoring methods.
|
bshanks@0 | 118 Principle 4: Work in 2-D whenever possible
|
bshanks@0 | 119 There are many anatomical structures which are commonly characterized in
|
bshanks@0 | 120 terms of a two-dimensional manifold. When it is known that the structure that
|
bshanks@0 | 121 one is looking for is two-dimensional, the results may be improved by allowing
|
bshanks@0 | 122 the analysis algorithm to take advantage of this prior knowledge. In addition,
|
bshanks@0 | 123 it is easier for humans to visualize and work with 2-D data.
|
bshanks@0 | 124 Therefore, when possible, the instances should represent pixels, not voxels.
|
bshanks@8 | 125 3
|
bshanks@8 | 126
|
bshanks@1 | 127 Aim 2
|
bshanks@1 | 128 todo
|
bshanks@0 | 129 Aim 3
|
bshanks@0 | 130 Background
|
bshanks@0 | 131 The cortex is divided into areas and layers. To a first approximation, the par-
|
bshanks@0 | 132 cellation of the cortex into areas can be drawn as a 2-D map on the surface
|
bshanks@0 | 133 of the cortex. In the third dimension, the boundaries between the areas con-
|
bshanks@0 | 134 tinue downwards into the cortical depth, perpendicular to the surface. The layer
|
bshanks@0 | 135 boundaries run parallel to the surface. One can picture an area of the cortex as
|
bshanks@0 | 136 a slice of many-layered cake.
|
bshanks@0 | 137 Although it is known that different cortical areas have distinct roles in both
|
bshanks@0 | 138 normal functioning and in disease processes, there are no known marker genes
|
bshanks@0 | 139 for many cortical areas. When it is necessary to divide a tissue sample into
|
bshanks@0 | 140 cortical areas, this is a manual process that requires a skilled human to combine
|
bshanks@0 | 141 multiple visual cues and interpret them in the context of their approximate
|
bshanks@0 | 142 location upon the cortical surface.
|
bshanks@0 | 143 Even the questions of how many areas should be recognized in cortex, and
|
bshanks@0 | 144 what their arrangement is, are still not completely settled. A proposed division
|
bshanks@0 | 145 of the cortex into areas is called a cortical map. In the rodent, the lack of a
|
bshanks@0 | 146 single agreed-upon map can be seen by contrasting the recent maps given by
|
bshanks@0 | 147 Swanson?? on the one hand, and Paxinos and Franklin?? on the other. While
|
bshanks@0 | 148 the maps are certainly very similar in their general arrangement, significant
|
bshanks@0 | 149 differences remain in the details.
|
bshanks@0 | 150 Significance
|
bshanks@0 | 151 The method developed in aim (1) will be applied to each cortical area to find
|
bshanks@0 | 152 a set of marker genes such that the combinatorial expression pattern of those
|
bshanks@0 | 153 genes uniquely picks out the target area. Finding marker genes will be useful
|
bshanks@0 | 154 for drug discovery as well as for experimentation because marker genes can be
|
bshanks@0 | 155 used to design interventions which selectively target individual cortical areas.
|
bshanks@0 | 156 The application of the marker gene finding algorithm to the cortex will
|
bshanks@0 | 157 also support the development of new neuroanatomical methods. In addition to
|
bshanks@0 | 158 finding markers for each individual cortical areas, we will find a small panel
|
bshanks@0 | 159 of genes that can find many of the areal boundaries at once. This panel of
|
bshanks@0 | 160 marker genes will allow the development of an ISH protocol that will allow
|
bshanks@0 | 161 experimenters to more easily identify which anatomical areas are present in
|
bshanks@0 | 162 small samples of cortex.
|
bshanks@0 | 163 The method developed in aim (3) will provide a genoarchitectonic viewpoint
|
bshanks@0 | 164 that will contribute to the creation of a better map. The development of present-
|
bshanks@0 | 165 day cortical maps was driven by the application of histological stains. It is
|
bshanks@0 | 166 conceivable that if a different set of stains had been available which identified
|
bshanks@0 | 167 a different set of features, then the today’s cortical maps would have come out
|
bshanks@8 | 168 4
|
bshanks@8 | 169
|
bshanks@0 | 170 differently. Since the number of classes of stains is small compared to the number
|
bshanks@0 | 171 of genes, it is likely that there are many repeated, salient spatial patterns in
|
bshanks@0 | 172 the gene expression which have not yet been captured by any stain. Therefore,
|
bshanks@0 | 173 current ideas about cortical anatomy need to incorporate what we can learn
|
bshanks@0 | 174 from looking at the patterns of gene expression.
|
bshanks@0 | 175 While we do not here propose to analyze human gene expression data, it is
|
bshanks@0 | 176 conceivable that the methods we propose to develop could be used to suggest
|
bshanks@0 | 177 modifications to the human cortical map as well.
|
bshanks@0 | 178 Related work
|
bshanks@1 | 179 todo
|
bshanks@0 | 180 Preliminary work
|
bshanks@0 | 181 Justification of principles 1 thur 3
|
bshanks@0 | 182 Principle 1: Combinatorial gene expression
|
bshanks@0 | 183 Here we give an example of a cortical area which is not marked by any single
|
bshanks@0 | 184 gene, but which can be identified combinatorially. according to logistic regres-
|
bshanks@0 | 185 sion, gene wwc11 is the best fit single gene for predicting whether or not a pixel
|
bshanks@0 | 186 on the cortical surface belongs to the motor area (area MO). The upper-left
|
bshanks@0 | 187 picture in Figure shows wwc1’s spatial expression pattern over the cortex. The
|
bshanks@0 | 188 lower-right boundary of MO is represented reasonably well by this gene, however
|
bshanks@0 | 189 the gene overshoots the upper-left boundary. This flattened 2-D representation
|
bshanks@0 | 190 does not show it, but the area corresponding to the overshoot is the medial
|
bshanks@0 | 191 surface of the cortex. MO is only found on the lateral surface (todo).
|
bshanks@0 | 192 Gnee mtif22 is shown in figure the upper-right of Fig. . Mtif2 captures MO’s
|
bshanks@0 | 193 upper-left boundary, but not its lower-right boundary. Mtif2 does not express
|
bshanks@0 | 194 very much on the medial surface. By adding together the values at each pixel
|
bshanks@0 | 195 in these two figures, we get the lower-left of Figure . This combination captures
|
bshanks@0 | 196 area MO much better than any single gene.
|
bshanks@0 | 197 Principle 2: Only look at combinations of small numbers of genes
|
bshanks@0 | 198 In order to see how well one can do when looking at all genes at once, we ran
|
bshanks@0 | 199 a support vector machine to classify cortical surface pixels based on their gene
|
bshanks@0 | 200 expression profiles. We achieved classification accuracy of about 81%3. As noted
|
bshanks@0 | 201 above, however, a classifier that looks at all the genes at once isn’t practically
|
bshanks@0 | 202 useful.
|
bshanks@8 | 203 _____________________
|
bshanks@0 | 204 1“WW, C2 and coiled-coil domain containing 1”; EntrezGene ID 211652
|
bshanks@0 | 205 2“mitochondrial translational initiation factor 2”; EntrezGene ID 76784
|
bshanks@0 | 206 3Using the Shogun SVM package (todo:cite), with parameters type=GMNPSVM (multi-
|
bshanks@0 | 207 class b-SVM), kernal = gaussian with sigma = 0.1, c = 10, epsilon = 1e-1 – these are the
|
bshanks@0 | 208 first parameters we tried, so presumably performance would improve with different choices of
|
bshanks@0 | 209 parameters. 5-fold cross-validation.
|
bshanks@0 | 210 5
|
bshanks@0 | 211
|
bshanks@0 | 212
|
bshanks@0 | 213
|
bshanks@0 | 214 Figure 1: Upper left: wwc1. Upper right: mtif2. Lower left: wwc1 + mtif2
|
bshanks@0 | 215 (each pixel’s value on the lower left is the sum of the corresponding pixels in
|
bshanks@0 | 216 the upper row). Within each picture, the vertical axis roughly corresponds to
|
bshanks@0 | 217 anterior at the top and posterior at the bottom, and the horizontal axis roughly
|
bshanks@0 | 218 corresponds to medial at the left and lateral at the right. The red outline is
|
bshanks@0 | 219 the boundary of region MO. Pixels are colored approximately according to the
|
bshanks@0 | 220 density of expressing cells underneath each pixel, with red meaning a lot of
|
bshanks@0 | 221 expression and blue meaning little.
|
bshanks@0 | 222 6
|
bshanks@0 | 223
|
bshanks@8 | 224 The requirement to find combinations of only a small number of genes limits
|
bshanks@8 | 225 us from straightforwardly applying many of the most simple techniques from
|
bshanks@1 | 226 the field of supervised machine learning. In the parlance of machine learning,
|
bshanks@1 | 227 our task combines feature selection with supervised learning.
|
bshanks@0 | 228 Principle 3: Use geometry
|
bshanks@0 | 229 To show that local geometry can provide useful information that cannot be
|
bshanks@0 | 230 detected via pointwise analyses, consider Fig. . The top row of Fig. displays
|
bshanks@0 | 231 the 3 genes which most match area AUD, according to a pointwise method4. The
|
bshanks@0 | 232 bottom row displays the 3 genes which most match AUD according to a method
|
bshanks@0 | 233 which considers local geometry5 The pointwise method in the top row identifies
|
bshanks@0 | 234 genes which express more strongly in AUD than outside of it; its weakness is that
|
bshanks@0 | 235 this includes many areas which don’t have a salient border matching the areal
|
bshanks@0 | 236 border. The geometric method identifies genes whose salient expression border
|
bshanks@0 | 237 seems to partially line up with the border of AUD; its weakness is that this
|
bshanks@0 | 238 includes genes which don’t express over the entire area. Genes which have high
|
bshanks@0 | 239 rankings using both pointwise and border criteria, such as Aph1a in the example,
|
bshanks@0 | 240 may be particularly good markers. None of these genes are, individually, a
|
bshanks@0 | 241 perfect marker for AUD; we deliberately chose a “difficult” area in order to
|
bshanks@0 | 242 better contrast pointwise with geometric methods.
|
bshanks@0 | 243 Principle 4: Work in 2-D whenever possible
|
bshanks@0 | 244 In anatomy, the manifold of interest is usually either defined by a combination
|
bshanks@0 | 245 of two relevant anatomical axes (todo), or by the surface of the structure (as is
|
bshanks@0 | 246 the case with the cortex). In the former case, the manifold of interest is a plane,
|
bshanks@0 | 247 but in the latter case it is curved. If the manifold is curved, there are various
|
bshanks@0 | 248 methods for mapping the manifold into a plane.
|
bshanks@0 | 249 The method that we will develop will begin by mapping the data into a
|
bshanks@0 | 250 2-D plane. Although the manifold that characterized cortical areas is known
|
bshanks@0 | 251 to be the cortical surface, it remains to be seen which method of mapping the
|
bshanks@0 | 252 manifold into a plane is optimal for this application. We will compare mappings
|
bshanks@0 | 253 which attempt to preserve size (such as the one used by Caret??) with mappings
|
bshanks@0 | 254 which preserve angle (conformal maps).
|
bshanks@0 | 255 Although there is much 2-D organization in anatomy, there are also struc-
|
bshanks@0 | 256 tures whose shape is fundamentally 3-dimensional. If possible, we would like
|
bshanks@0 | 257 the method we develop to include a statistical test that warns the user if the
|
bshanks@0 | 258 assumption of 2-D structure seems to be wrong.
|
bshanks@0 | 259 ——
|
bshanks@8 | 260 ____________________
|
bshanks@8 | 261 4For each gene, a logistic regression in which the response variable was whether or not a
|
bshanks@8 | 262 surface pixel was within area AUD, and the predictor variable was the value of the expression
|
bshanks@8 | 263 of the gene underneath that pixel. The resulting scores were used to rank the genes in terms
|
bshanks@8 | 264 of how well they predict area AUD.
|
bshanks@8 | 265 5For each gene the gradient similarity (see section ??) between (a) a map of the expression
|
bshanks@8 | 266 of each gene on the cortical surface and (b) the shape of area AUD, was calculated, and this
|
bshanks@8 | 267 was used to rank the genes.
|
bshanks@8 | 268 7
|
bshanks@8 | 269
|
bshanks@8 | 270
|
bshanks@8 | 271
|
bshanks@8 | 272 Figure 2: The top row shows the three genes which (individually) best predict
|
bshanks@8 | 273 area AUD, according to logistic regression. The bottom row shows the three
|
bshanks@8 | 274 genes which (individually) best match area AUD, according to gradient similar-
|
bshanks@8 | 275 ity. From left to right and top to bottom, the genes are Ssr1, Efcbp1, Aph1a,
|
bshanks@8 | 276 Ptk7, Aph1a again, and Lepr
|
bshanks@0 | 277 Massive new datasets obtained with techniques such as in situ hybridization
|
bshanks@0 | 278 (ISH) and BAC-transgenics allow the expression levels of many genes at many
|
bshanks@0 | 279 locations to be compared. This can be used to find marker genes for specific
|
bshanks@0 | 280 anatomical structures, as well as to draw new anatomical maps. Our goal is
|
bshanks@0 | 281 to develop automated methods to relate spatial variation in gene expression to
|
bshanks@0 | 282 anatomy. We have five specific aims:
|
bshanks@0 | 283 (1) develop an algorithm to screen spatial gene expression data for combi-
|
bshanks@0 | 284 nations of marker genes which selectively target individual anatomical
|
bshanks@0 | 285 structures
|
bshanks@0 | 286 (2) develop an algorithm to screen spatial gene expression data for combina-
|
bshanks@0 | 287 tions of marker genes which can be used to delineate most of the bound-
|
bshanks@0 | 288 aries between a number of anatomical structures at once
|
bshanks@0 | 289 (3) develop an algorithm to suggest new ways of dividing a structure up into
|
bshanks@0 | 290 anatomical subregions, based on spatial patterns in gene expression
|
bshanks@0 | 291 (4) create a flat (2-D) map of the mouse cerebral cortex that contains a flat-
|
bshanks@0 | 292 tened version of the Allen Mouse Brain Atlas ISH dataset, as well as the
|
bshanks@0 | 293 boundaries of anatomical areas within the cortex. For each cortical layer,
|
bshanks@0 | 294 a layer-specific flat dataset will be created. A single combined flat dataset
|
bshanks@0 | 295 will be created which averages information from all of the layers. These
|
bshanks@0 | 296 datasets will be made available in both MATLAB and Caret formats.
|
bshanks@0 | 297 (5) validate the methods developed in (1), (2) and (3) by applying them to
|
bshanks@0 | 298 the cerebral cortex datasets created in (4)
|
bshanks@8 | 299 8
|
bshanks@8 | 300
|
bshanks@0 | 301 All algorithms that we develop will be implemented in an open-source soft-
|
bshanks@0 | 302 ware toolkit. The toolkit, as well as the machine-readable datasets developed in
|
bshanks@0 | 303 aim (4) and any other intermediate dataset we produce, will be published and
|
bshanks@0 | 304 freely available for others to use.
|
bshanks@0 | 305 In addition to developing generally useful methods, the application of these
|
bshanks@0 | 306 methods to cerebral cortex will produce immediate benefits that are only one
|
bshanks@0 | 307 step removed from clinical application, while also supporting the development
|
bshanks@0 | 308 of new neuroanatomical techniques. The method developed in aim (1) will be
|
bshanks@0 | 309 applied to each cortical area to find a set of marker genes. Currently, despite
|
bshanks@0 | 310 the distinct roles of different cortical areas in both normal functioning and
|
bshanks@0 | 311 disease processes, there are no known marker genes for many cortical areas.
|
bshanks@0 | 312 Finding marker genes will be immediately useful for drug discovery as well as for
|
bshanks@0 | 313 experimentation because once marker genes for an area are known, interventions
|
bshanks@0 | 314 can be designed which selectively target that area.
|
bshanks@0 | 315 The method developed in aim (2) will be used to find a small panel of genes
|
bshanks@0 | 316 that can find most of the boundaries between areas in the cortex. Today, finding
|
bshanks@0 | 317 cortical areal boundaries in a tissue sample is a manual process that requires a
|
bshanks@0 | 318 skilled human to combine multiple visual cues over a large area of the cortical
|
bshanks@0 | 319 surface. A panel of marker genes will allow the development of an ISH protocol
|
bshanks@0 | 320 that will allow experimenters to more easily identify which anatomical areas are
|
bshanks@0 | 321 present in small samples of cortex.
|
bshanks@0 | 322 For each cortical layer, a layer-specific flat dataset will be created. A single
|
bshanks@0 | 323 combined flat dataset will be created which averages information from all of
|
bshanks@0 | 324 the layers. These datasets will be made available in both MATLAB and Caret
|
bshanks@0 | 325 formats.
|
bshanks@6 | 326 ___________________________________________________________
|
bshanks@6 | 327 New techniques allow the expression levels of many genes at many locations
|
bshanks@6 | 328 to be compared. It is thought that even neighboring anatomical structures have
|
bshanks@6 | 329 different gene expression profiles. We propose to develop automated methods
|
bshanks@6 | 330 to relate the spatial variation in gene expression to anatomy. We will develop
|
bshanks@6 | 331 two kinds of techniques:
|
bshanks@6 | 332 (a) techniques to screen for combinations of marker genes which selectively
|
bshanks@6 | 333 target anatomical structures
|
bshanks@6 | 334 (b) techniques to suggest new ways of dividing a structure up into anatomical
|
bshanks@6 | 335 subregions, based on the shapes of contours in the gene expression
|
bshanks@6 | 336 The first kind of technique will be helpful for finding marker genes associated
|
bshanks@6 | 337 with known anatomical features. The second kind of technique will be helpful in
|
bshanks@6 | 338 creating new anatomical maps, maps which reflect differences in gene expression
|
bshanks@6 | 339 the same way that existing maps reflect differences in histology.
|
bshanks@6 | 340 We intend to develop our techniques using the adult mouse cerebral cortex
|
bshanks@6 | 341 as a testbed. The Allen Brain Atlas has collected a dataset containing the
|
bshanks@6 | 342 expression level of about 4000 genes* over a set of over 150000 voxels, with a
|
bshanks@6 | 343 spatial resolution of approximately 200 microns[?].
|
bshanks@6 | 344 9
|
bshanks@6 | 345
|
bshanks@8 | 346 We expect to discover sets of marker genes that pick out specific cortical
|
bshanks@8 | 347 areas. This will allow the development of drugs and other interventions that
|
bshanks@8 | 348 selectively target individual cortical areas. Therefore our research will lead
|
bshanks@0 | 349 to application in drug discovery, in the development of other targeted clinical
|
bshanks@0 | 350 interventions, and in the development of new experimental techniques.
|
bshanks@0 | 351 The best way to divide up rodent cortex into areas has not been completely
|
bshanks@0 | 352 determined, as can be seen by the differences in the recent maps given by Swan-
|
bshanks@0 | 353 son on the one hand, and Paxinos and Franklin on the other. It is likely that our
|
bshanks@0 | 354 study, by showing which areal divisions naturally follow from gene expression
|
bshanks@0 | 355 data, as opposed to traditional histological data, will contribute to the creation
|
bshanks@0 | 356 of a better map. While we do not here propose to analyze human gene expres-
|
bshanks@0 | 357 sion data, it is conceivable that the methods we propose to develop could be
|
bshanks@0 | 358 used to suggest modifications to the human cortical map as well.
|
bshanks@0 | 359 In the following, we will only be talking about coronal data.
|
bshanks@0 | 360 The Allen Brain Atlas provides “Smoothed Energy Volumes”, which are
|
bshanks@0 | 361 One type of artifact in the Allen Brain Atlas data is what we call a “slice
|
bshanks@0 | 362 artifact”. We have noticed two types of slice artifacts in the dataset. The first
|
bshanks@0 | 363 type, a “missing slice artifact”, occurs when the ISH procedure on a slice did
|
bshanks@0 | 364 not come out well. In this case, the Allen Brain investigators excluded the slice
|
bshanks@0 | 365 at issue from the dataset. This means that no gene expression information is
|
bshanks@0 | 366 available for that gene for the region of space covered by that slice. This results
|
bshanks@0 | 367 in an expression level of zero being assigned to voxels covered by the slice. This
|
bshanks@0 | 368 is partially but not completely ameliorated by the smoothing that is applied to
|
bshanks@0 | 369 create the Smoothed Energy Volumes. The usual end result is that a region of
|
bshanks@0 | 370 space which is shaped and oriented like a coronal slice is marked as having less
|
bshanks@0 | 371 gene expression than surrounding regions.
|
bshanks@0 | 372 The second type of slice artifact is caused by the fact that all of the slices
|
bshanks@0 | 373 have a consistent orientation. Since there may be artifacts (such as how well
|
bshanks@0 | 374 the ISH worked) which are constant within each slice but which vary between
|
bshanks@0 | 375 different slices, the result is that ceteris paribus, when one compares the genetic
|
bshanks@0 | 376 data of a voxel to another voxel within the same coronal plane, one would expect
|
bshanks@0 | 377 to find more similarity than if one compared a voxel to another voxel displaced
|
bshanks@0 | 378 along the rostrocaudal axis.
|
bshanks@0 | 379 We are enthusiastic about the sharing of methods, data, and results, and
|
bshanks@0 | 380 at the conclusion of the project, we will make all of our data and computer
|
bshanks@0 | 381 source code publically available. Our goal is that replicating our results, or
|
bshanks@0 | 382 applying the methods we develop to other targets, will be quick and easy for
|
bshanks@0 | 383 other investigators. In order to aid in understanding and replicating our results,
|
bshanks@0 | 384 we intend to include a software program which, when run, will take as input
|
bshanks@0 | 385 the Allen Brain Atlas raw data, and produce as output all numbers and charts
|
bshanks@0 | 386 found in publications resulting from the project.
|
bshanks@0 | 387 To aid in the replication of our results, we will include a script which takes
|
bshanks@0 | 388 as input the dataset in aim (3) and provides as output all of the tables in figures
|
bshanks@0 | 389 in our publications .
|
bshanks@0 | 390 We also expect to weigh in on the debate about how to best partition rodent
|
bshanks@0 | 391 cortex
|
bshanks@8 | 392 10
|
bshanks@8 | 393
|
bshanks@0 | 394 be useful for drug discovery as well
|
bshanks@0 | 395 * Another 16000 genes are available, but they do not cover the entire cerebral
|
bshanks@0 | 396 cortex with high spatial resolution.
|
bshanks@0 | 397 User-definable ROIs Combinatorial gene expression Negative as well as pos-
|
bshanks@0 | 398 itive signal Use geometry Search for local boundaries if necessary Flatmapped
|
bshanks@0 | 399 Specific aims
|
bshanks@0 | 400 Develop algorithms that find genetic markers for anatomical regions
|
bshanks@0 | 401 1. Develop scoring measures for evaluating how good individual genes are at
|
bshanks@0 | 402 marking areas: we will compare pointwise, geometric, and information-
|
bshanks@0 | 403 theoretic measures.
|
bshanks@0 | 404 2. Develop a procedure to find single marker genes for anatomical regions: for
|
bshanks@0 | 405 each cortical area, by using or combining the scoring measures developed,
|
bshanks@0 | 406 we will rank the genes by their ability to delineate each area.
|
bshanks@0 | 407 3. Extend the procedure to handle difficult areas by using combinatorial cod-
|
bshanks@0 | 408 ing: for areas that cannot be identified by any single gene, identify them
|
bshanks@0 | 409 with a handful of genes. We will consider both (a) algorithms that incre-
|
bshanks@0 | 410 mentally/greedily combine single gene markers into sets, such as forward
|
bshanks@0 | 411 stepwise regression and decision trees, and also (b) supervised learning
|
bshanks@0 | 412 techniques which use soft constraints to minimize the number of features,
|
bshanks@0 | 413 such as sparse support vector machines.
|
bshanks@0 | 414 4. Extend the procedure to handle difficult areas by combining or redrawing
|
bshanks@0 | 415 the boundaries: An area may be difficult to identify because the bound-
|
bshanks@0 | 416 aries are misdrawn, or because it does not “really” exist as a single area,
|
bshanks@0 | 417 at least on the genetic level. We will develop extensions to our procedure
|
bshanks@0 | 418 which (a) detect when a difficult area could be fit if its boundary were
|
bshanks@0 | 419 redrawn slightly, and (b) detect when a difficult area could be combined
|
bshanks@0 | 420 with adjacent areas to create a larger area which can be fit.
|
bshanks@0 | 421 Apply these algorithms to the cortex
|
bshanks@0 | 422 1. Create open source format conversion tools: we will create tools to bulk
|
bshanks@0 | 423 download the ABA dataset and to convert between SEV, NIFTI and MAT-
|
bshanks@0 | 424 LAB formats.
|
bshanks@0 | 425 2. Flatmap the ABA cortex data: map the ABA data onto a plane and draw
|
bshanks@0 | 426 the cortical area boundaries onto it.
|
bshanks@0 | 427 3. Find layer boundaries: cluster similar voxels together in order to auto-
|
bshanks@0 | 428 matically find the cortical layer boundaries.
|
bshanks@0 | 429 4. Run the procedures that we developed on the cortex: we will present, for
|
bshanks@0 | 430 each area, a short list of markers to identify that area; and we will also
|
bshanks@8 | 431 11
|
bshanks@8 | 432
|
bshanks@0 | 433 present lists of “panels” of genes that can be used to delineate many areas
|
bshanks@0 | 434 at once.
|
bshanks@0 | 435 Develop algorithms to suggest a division of a structure into anatom-
|
bshanks@0 | 436 ical parts
|
bshanks@0 | 437 1. Explore dimensionality reduction algorithms applied to pixels: including
|
bshanks@0 | 438 TODO
|
bshanks@0 | 439 2. Explore dimensionality reduction algorithms applied to genes: including
|
bshanks@0 | 440 TODO
|
bshanks@0 | 441 3. Explore clustering algorithms applied to pixels: including TODO
|
bshanks@0 | 442 4. Explore clustering algorithms applied to genes: including gene shaving,
|
bshanks@0 | 443 TODO
|
bshanks@0 | 444 5. Develop an algorithm to use dimensionality reduction and/or hierarchial
|
bshanks@0 | 445 clustering to create anatomical maps
|
bshanks@0 | 446 6. Run this algorithm on the cortex: present a hierarchial, genoarchitectonic
|
bshanks@0 | 447 map of the cortex
|
bshanks@0 | 448 gradient similarity is calculated as: ∑
|
bshanks@0 | 449 pixels cos(abs(∠∇1 - ∠∇2)) ⋅|∇1|+|∇2|
|
bshanks@0 | 450 2 ⋅
|
bshanks@0 | 451 pixel_value1+pixel_value2
|
bshanks@0 | 452 2
|
bshanks@0 | 453 (todo) Technically, we say that an anatomical structure has a fundamen-
|
bshanks@0 | 454 tally 2-D organization when there exists a commonly used, generic, anatomical
|
bshanks@0 | 455 structure-preserving map from 3-D space to a 2-D manifold.
|
bshanks@0 | 456 Related work:
|
bshanks@0 | 457 The Allen Brain Institute has developed an interactive web interface called
|
bshanks@0 | 458 AGEA which allows an investigator to (1) calculate lists of genes which are se-
|
bshanks@0 | 459 lectively overexpressed in certain anatomical regions (ABA calls this the “Gene
|
bshanks@0 | 460 Finder” function) (2) to visualize the correlation between the genetic profiles of
|
bshanks@0 | 461 voxels in the dataset, and (3) to visualize a hierarchial clustering of voxels in
|
bshanks@0 | 462 the dataset [?]. AGEA is an impressive and useful tool, however, it does not
|
bshanks@0 | 463 solve the same problems that we propose to solve with this project.
|
bshanks@0 | 464 First we describe AGEA’s “Gene Finder”, and then compare it to our pro-
|
bshanks@0 | 465 posed method for finding marker genes. AGEA’s Gene Finder first asks the
|
bshanks@0 | 466 investigator to select a single “seed voxel” of interest. It then uses a clustering
|
bshanks@0 | 467 method, combined with built-in knowledge of major anatomical structures, to
|
bshanks@0 | 468 select two sets of voxels; an “ROI” and a “comparator region”*. The seed voxel
|
bshanks@0 | 469 is always contained within the ROI, and the ROI is always contained within the
|
bshanks@0 | 470 comparator region. The comparator region is similar but not identical to the
|
bshanks@0 | 471 set of voxels making up the major anatomical region containing the ROI. Gene
|
bshanks@0 | 472 Finder then looks for genes which can distinguish the ROI from the comparator
|
bshanks@0 | 473 region. Specifically, it finds genes for which the ratio (expression energy in the
|
bshanks@0 | 474 ROI) / (expression energy in the comparator region) is high.
|
bshanks@8 | 475 12
|
bshanks@8 | 476
|
bshanks@0 | 477 Informally, the Gene Finder first infers an ROI based on clustering the seed
|
bshanks@0 | 478 voxel with other voxels. Then, the Gene Finder finds genes which overexpress
|
bshanks@0 | 479 in the ROI as compared to other voxels in the major anatomical region.
|
bshanks@0 | 480 There are three major differences between our approach and Gene Finder.
|
bshanks@0 | 481 First, Gene Finder focuses on individual genes and individual ROIs in isola-
|
bshanks@0 | 482 tion. This is great for regions which can be picked out from all other regions by a
|
bshanks@0 | 483 single gene, but not all of them can (todo). There are at least two ways this can
|
bshanks@0 | 484 miss out on useful genes. First, a gene might express in part of a region, but not
|
bshanks@0 | 485 throughout the whole region, but there may be another gene which expresses
|
bshanks@0 | 486 in the rest of the region*. Second, a gene might express in a region, but not in
|
bshanks@0 | 487 any of its neighbors, but it might express also in other non-neighboring regions.
|
bshanks@0 | 488 To take advantage of these types of genes, we propose to find combinations of
|
bshanks@0 | 489 genes which, together, can identify the boundaries of all subregions within the
|
bshanks@0 | 490 containing region.
|
bshanks@0 | 491 Second, Gene Finder uses a pointwise metric, namely expression energy ratio,
|
bshanks@0 | 492 to decide whether a gene is good for picking out a region. We have found better
|
bshanks@0 | 493 results by using metrics which take into account not just single voxels, but also
|
bshanks@0 | 494 the local geometry of neighboring voxels, such as the local gradient (todo). In
|
bshanks@0 | 495 addition, we have found that often the absence of gene expression can be used
|
bshanks@0 | 496 as a marker, which will not be caught by Gene Finder’s expression energy ratio
|
bshanks@0 | 497 (todo).
|
bshanks@0 | 498 Third, Gene Finder chooses the ROI based only on the seed voxel. This
|
bshanks@0 | 499 often does not permit the user to query the ROI that they are interested in. For
|
bshanks@0 | 500 example, in all of our tests of Gene Finder in cortex, the ROIs chosen tend to
|
bshanks@0 | 501 be cortical layers, rather than cortical areas.
|
bshanks@0 | 502 In summary, when Gene Finder picks the ROI that you want, and when this
|
bshanks@0 | 503 ROI can be easily picked out from neighboring regions by single genes which
|
bshanks@0 | 504 selectively overexpress in the ROI compared to the entire major anatomical re-
|
bshanks@0 | 505 gion, Gene Finder will work. However, Gene Finder will not pick cortical areas
|
bshanks@0 | 506 as ROIs, and even if it could, many cortical areas cannot be uniquely picked out
|
bshanks@0 | 507 by the overexpression of any single gene. By contrast, we will target cortical
|
bshanks@0 | 508 areas, we will explore a variety of metrics which can complement the shortcom-
|
bshanks@0 | 509 ings of expression energy ratio, and we will use the combinatorial expression of
|
bshanks@0 | 510 genes to pick out cortical areas even when no individual gene will do.
|
bshanks@0 | 511 * The terms “ROI” and “comparator region” are our own; the ABI calls
|
bshanks@0 | 512 them the “local region” and the “larger anatomical context”. The ABI uses the
|
bshanks@0 | 513 term “specificity comparator” to mean the major anatomic region containing
|
bshanks@0 | 514 the ROI, which is not exactly identical to the comparator region.
|
bshanks@0 | 515 ** In this case, the union of the area of expression of the two genes would
|
bshanks@0 | 516 suffice; one could also imagine that there could be situations in which the in-
|
bshanks@0 | 517 tersection of multiple genes would be needed, or a combination of unions and
|
bshanks@0 | 518 intersections.
|
bshanks@0 | 519 Now we describe AGEA’s hierarchial clustering, and compare it to our pro-
|
bshanks@0 | 520 posal. The goal of AGEA’s hierarchial clustering is to generate a binary tree of
|
bshanks@0 | 521 clusters, where a cluster is a collection of voxels. AGEA begins by computing
|
bshanks@0 | 522 the Pearson correlation between each pair of voxels. They then employ a recur-
|
bshanks@8 | 523 13
|
bshanks@8 | 524
|
bshanks@0 | 525 sive divisive (top-down) hierarchial clustering procedure on the voxels, which
|
bshanks@0 | 526 means that they start with all of the voxels, and then they divide them into clus-
|
bshanks@0 | 527 ters, and then within each cluster, they divide that cluster into smaller clusters,
|
bshanks@0 | 528 etc***. At each step, the collection of voxels is partitioned into two smaller
|
bshanks@0 | 529 clusters in a way that maximizes the following quantity: average correlation
|
bshanks@0 | 530 between all possible pairs of voxels containing one voxel from each cluster.
|
bshanks@0 | 531 There are three major differences between our approach and AGEA’s hier-
|
bshanks@0 | 532 archial clustering. First, AGEA’s clustering method separates cortical layers
|
bshanks@0 | 533 before it separates cortical areas.
|
bshanks@0 | 534 following procedure is used for the purpose of dividing a collection of voxels
|
bshanks@0 | 535 into smaller clusters: partition the voxels into two sets, such that the following
|
bshanks@0 | 536 quantity is maximized:
|
bshanks@0 | 537 *** depending on which level of the tree is being created, the voxels are
|
bshanks@0 | 538 subsampled in order to save time
|
bshanks@0 | 539 does not allow the user to input anything other than a seed voxel; this means
|
bshanks@0 | 540 that for each seed voxel, there is only one
|
bshanks@0 | 541 The role of the “local region” is to serve as a region of interest for which
|
bshanks@0 | 542 marker genes are desired; the role of the “larger anatomical context” is to be
|
bshanks@0 | 543 the structure
|
bshanks@0 | 544 There are two kinds of differences between AGEA and our project; differ-
|
bshanks@0 | 545 ences that relate to the treatment of the cortex, and differences in the type of
|
bshanks@0 | 546 generalizable methods being developed. As relates
|
bshanks@0 | 547 indicate an ROI
|
bshanks@0 | 548 explore simple correlation-based relationships between voxels, genes, and
|
bshanks@0 | 549 clusters of voxels.
|
bshanks@0 | 550 There have not yet been any studies which describe the results of applying
|
bshanks@0 | 551 AGEA to the cerebral cortex; however, we suspect that the AGEA metrics are
|
bshanks@0 | 552 not optimal for the task of relating genes to cortical areas. A voxel’s gene
|
bshanks@0 | 553 expression profile depends upon both its cortical area and its cortical layer,
|
bshanks@0 | 554 however, AGEA has no mechanism to distinguish these two. As a result, voxels
|
bshanks@0 | 555 in the same layer but different areas are often clustered together by AGEA. As
|
bshanks@0 | 556 part of the project, we will compare the performance of our techniques against
|
bshanks@0 | 557 AGEA’s.
|
bshanks@0 | 558 —
|
bshanks@0 | 559 The Allen Brain Institute has developed interactive tools called AGEA which
|
bshanks@0 | 560 allow an investigator to explore simple correlation-based relationships between
|
bshanks@0 | 561 voxels, genes, and clusters of voxels. There have not yet been any studies
|
bshanks@0 | 562 which describe the results of applying AGEA to the cerebral cortex; however,
|
bshanks@0 | 563 we suspect that the AGEA metrics are not optimal for the task of relating
|
bshanks@0 | 564 genes to cortical areas. A voxel’s gene expression profile depends upon both
|
bshanks@0 | 565 its cortical area and its cortical layer, however, AGEA has no mechanism to
|
bshanks@0 | 566 distinguish these two. As a result, voxels in the same layer but different areas
|
bshanks@0 | 567 are often clustered together by AGEA. As part of the project, we will compare
|
bshanks@0 | 568 the performance of our techniques against AGEA’s.
|
bshanks@0 | 569 Another difference between our techniques and AGEA’s is that AGEA allows
|
bshanks@0 | 570 the user to enter only a voxel location, and then to either explore the rest of
|
bshanks@8 | 571 14
|
bshanks@8 | 572
|
bshanks@0 | 573 the brain’s relationship to that particular voxel, or explore a partitioning of
|
bshanks@0 | 574 the brain based on pairwise voxel correlation. If the user is interested not in a
|
bshanks@0 | 575 single voxel, but rather an entire anatomical structure, AGEA will only succeed
|
bshanks@0 | 576 to the extent that the selected voxel is a typical representative of the structure.
|
bshanks@0 | 577 As discussed in the previous paragraph, this poses problems for structures like
|
bshanks@0 | 578 cortical areas, which (because of their division into cortical layers) do not have
|
bshanks@0 | 579 a single “typical representative”.
|
bshanks@0 | 580 By contrast, in our system, the user will start by selecting, not a single voxel,
|
bshanks@0 | 581 but rather, an anatomical superstructure to be divided into pieces (for example,
|
bshanks@0 | 582 the cerebral cortex). We expect that our methods will take into account not
|
bshanks@0 | 583 just pairwise statistics between voxels, but also large-scale geometric features
|
bshanks@0 | 584 (for example, the rapidity of change in gene expression as regional boundaries
|
bshanks@0 | 585 are crossed) which optimize the discriminability of regions within the selected
|
bshanks@0 | 586 superstructure.
|
bshanks@0 | 587 —–
|
bshanks@0 | 588 screen for combinations of marker genes which selectively target anatom-
|
bshanks@0 | 589 ical structures pick delineate the boundaries between neighboring anatomical
|
bshanks@0 | 590 structures. (b) techniques to screen for marker genes which pick out anatomical
|
bshanks@0 | 591 structures of interest
|
bshanks@0 | 592 , techniques which: (a) screen for marker genes , and (b) suggest new
|
bshanks@0 | 593 anatomical maps based on
|
bshanks@0 | 594 whose expression partitions the region of interest into its anatomical sub-
|
bshanks@0 | 595 structures, and (b) use the natural contours of gene expression to suggest new
|
bshanks@0 | 596 ways of dividing an organ into
|
bshanks@0 | 597 The Allen Brain Atlas
|
bshanks@0 | 598 –
|
bshanks@0 | 599 to: brooksl@mail.nih.gov
|
bshanks@0 | 600 Hi, I’m writing to confirm the applicability of a potential research project to
|
bshanks@0 | 601 the challenge grant topic ”New computational and statistical methods for the
|
bshanks@0 | 602 analysis of large data sets from next-generation sequencing technologies”.
|
bshanks@0 | 603 We want to develop methods for the analysis of gene expression datasets that
|
bshanks@0 | 604 can be used to uncover the relationships between gene expression and anatomical
|
bshanks@0 | 605 regions. Specifically, we want to develop techniques to (a) given a set of known
|
bshanks@0 | 606 anatomical areas, identify genetic markers for each of these areas, and (b) given
|
bshanks@0 | 607 an anatomical structure whose substructure is unknown, suggest a map, that
|
bshanks@0 | 608 is, a division of the space into anatomical sub-structures, that represents the
|
bshanks@0 | 609 boundaries inherent in the gene expression data.
|
bshanks@0 | 610 We propose to develop our techniques on the Allen Brain Atlas mouse brain
|
bshanks@0 | 611 gene expression dataset by finding genetic markers for anatomical areas within
|
bshanks@0 | 612 the cerebral cortex. The Allen Brain Atlas contains a registered 3-D map of
|
bshanks@0 | 613 gene expression data with 200-micron voxel resolution which was created from
|
bshanks@0 | 614 in situ hybridization data. The dataset contains about 4000 genes which are
|
bshanks@0 | 615 available at this resolution across the entire cerebral cortex.
|
bshanks@0 | 616 Despite the distinct roles of different cortical areas in both normal function-
|
bshanks@0 | 617 ing and disease processes, there are no known marker genes for many cortical
|
bshanks@0 | 618 areas. This project will be immediately useful for both drug discovery and clini-
|
bshanks@8 | 619 15
|
bshanks@8 | 620
|
bshanks@0 | 621 cal research because once the markers are known, interventions can be designed
|
bshanks@0 | 622 which selectively target specific cortical areas.
|
bshanks@0 | 623 This techniques we develop will be useful because they will be applicable to
|
bshanks@0 | 624 the analysis of other anatomical areas, both in terms of finding marker genes
|
bshanks@0 | 625 for known areas, and in terms of suggesting new anatomical subdivisions that
|
bshanks@0 | 626 are based upon the gene expression data.
|
bshanks@6 | 627 _______________________________
|
bshanks@6 | 628 It is likely that our study, by showing which areal divisions naturally fol-
|
bshanks@6 | 629 low from gene expression data, as opposed to traditional histological data, will
|
bshanks@6 | 630 contribute to the creation of
|
bshanks@6 | 631 there are clear genetic or chemical markers known for only a few cortical
|
bshanks@6 | 632 areas. This makes it difficult to target drugs to specific
|
bshanks@6 | 633 As part of aims (1) and (5), we will discover sets of marker genes that pick
|
bshanks@6 | 634 out specific cortical areas. This will allow the development of drugs and other
|
bshanks@6 | 635 interventions that selectively target individual cortical areas. As part of aims
|
bshanks@6 | 636 (2) and (5), we will also discover small panels of marker genes that can be used
|
bshanks@6 | 637 to delineate most of the cortical areal map.
|
bshanks@6 | 638 With aims (2) and (4), we
|
bshanks@6 | 639 There are five principals
|
bshanks@6 | 640 In addition to validating the usefulness of the algorithms, the application of
|
bshanks@6 | 641 these methods to cerebral cortex will produce immediate benefits that are only
|
bshanks@6 | 642 one step removed from clinical application.
|
bshanks@6 | 643 todo: remember to check gensat, etc for validation (mention bias/variance)
|
bshanks@6 | 644 Why it is useful to apply these methods to cortex
|
bshanks@6 | 645 There is still room for debate as to exactly how the cortex should be parcellated
|
bshanks@6 | 646 into areas.
|
bshanks@6 | 647 The best way to divide up rodent cortex into areas has not been completely
|
bshanks@6 | 648 determined,
|
bshanks@6 | 649 not yet been accounted for in
|
bshanks@6 | 650 that the expression of some genes will contain novel spatial patterns which
|
bshanks@6 | 651 are not account
|
bshanks@6 | 652 that a genoarchitectonic map
|
bshanks@6 | 653 This principle is only applicable to aim 1 (marker genes). For aim 2 (partition
|
bshanks@6 | 654 a structure in into anatomical subregions), we plan to work with many genes at
|
bshanks@6 | 655 once.
|
bshanks@6 | 656 tood: aim 2 b+s?
|
bshanks@6 | 657 Principle 5: Interoperate with existing tools
|
bshanks@6 | 658 In order for our software to be as useful as possible for our users, it will be
|
bshanks@6 | 659 able to import and export data to standard formats so that users can use our
|
bshanks@6 | 660 software in tandem with other software tools created by other teams. We will
|
bshanks@6 | 661 support the following formats: NIFTI (Neuroimaging Informatics Technology
|
bshanks@6 | 662 16
|
bshanks@6 | 663
|
bshanks@8 | 664 Initiative), SEV (Allen Brain Institute Smoothed Energy Volume), and MAT-
|
bshanks@8 | 665 LAB. This ensures that our users will not have to exclusively rely on our tools
|
bshanks@8 | 666 when analyzing data. For example, users will be able to use the data visualiza-
|
bshanks@8 | 667 tion and analysis capabilities of MATLAB and Caret alongside our software.
|
bshanks@0 | 668 To our knowledge, there is no currently available software to convert between
|
bshanks@0 | 669 these formats, so we will also provide a format conversion tool. This may be
|
bshanks@0 | 670 useful even for groups that don’t use any of our other software.
|
bshanks@0 | 671 todo: is “marker gene” even a phrase that we should use at all?
|
bshanks@0 | 672 note for aim 1 apps: combo of genes is for voxel, not within any single cell
|
bshanks@0 | 673 , as when genetic markers allow the development of selective interventions;
|
bshanks@0 | 674 the reason that one can be confident that the intervention is selective is that it
|
bshanks@0 | 675 is only turned on when a certain combination of genes is turned on and off. The
|
bshanks@0 | 676 result procedure is what assures us that when that combination is present, the
|
bshanks@0 | 677 local tissue is probably part of a certain subregion.
|
bshanks@0 | 678 The basic idea is that we want to find a procedure by
|
bshanks@0 | 679 The task of finding genes that mark anatomical areas can be phrased in
|
bshanks@0 | 680 terms of what the field of machine learning calls a “supervised learning” task.
|
bshanks@0 | 681 The goal of this task is to learn a function (the “classifier”) which
|
bshanks@0 | 682 If a person knows a combination of genes that mark an area, that implies
|
bshanks@0 | 683 that the person can be told how strong those genes express in any voxel, and
|
bshanks@0 | 684 the person can use this information to determine how
|
bshanks@0 | 685 finding how to infer the areal identity of a voxel if given the gene expression
|
bshanks@0 | 686 profile of that voxel.
|
bshanks@0 | 687 For each voxel in the cortex, we want to start with data about the gene
|
bshanks@0 | 688 expression
|
bshanks@0 | 689 There are various ways to look for marker genes. We will define some terms,
|
bshanks@0 | 690 and along the way we will describe a few design choices encountered in the
|
bshanks@0 | 691 process of creating a marker gene finding method, and then we will present four
|
bshanks@0 | 692 principles that describe which options we have chosen.
|
bshanks@0 | 693 In developing a procedure for finding marker genes, we are developing a
|
bshanks@0 | 694 procedure that takes a dataset of experimental observations and produces a
|
bshanks@0 | 695 result. One can think of the result as merely a list of genes, but really the result
|
bshanks@0 | 696 is an understanding of a predictive relationship between, on the one hand, the
|
bshanks@0 | 697 expression levels of genes, and, on the other hand, anatomical subregions.
|
bshanks@0 | 698 One way to more formally define this understanding is to look at it as a
|
bshanks@0 | 699 procedure. In this view, the result of the learning procedure is itself a procedure.
|
bshanks@0 | 700 The result procedure provides a way to use the gene expression profiles of voxels
|
bshanks@0 | 701 in a tissue sample in order to determine where the subregions are.
|
bshanks@0 | 702 This result procedure can be used directly, as when an experimenter has
|
bshanks@0 | 703 a tissue sample and needs to know what subregions are present in it, and,
|
bshanks@0 | 704 if multiple subregions are present, where they each are. Or it can be used
|
bshanks@0 | 705 indirectly; imagine that the result procedure tells us that whenever a certain
|
bshanks@0 | 706 combination of genes are expressed, the local tissue is probably part of a certain
|
bshanks@0 | 707 subregion. This means that we can then confidentally develop an intervention
|
bshanks@0 | 708 which is triggered only when that combination of genes are expressed; and to
|
bshanks@8 | 709 17
|
bshanks@8 | 710
|
bshanks@0 | 711 the extent that the result procedure is reliable, we know that the intervention
|
bshanks@0 | 712 will only be triggered in the target subregion.
|
bshanks@0 | 713 We said that the result procedure provides “a way to use the gene expression
|
bshanks@0 | 714 profiles of voxels in a tissue sample” in order to “determine where the subregions
|
bshanks@0 | 715 are”.
|
bshanks@0 | 716 Does the result procedure get as input all of the gene expression profiles
|
bshanks@0 | 717 of each voxel in the entire tissue sample, and produce as output all of the
|
bshanks@0 | 718 subregional boundaries all at once?
|
bshanks@0 | 719 it is helpful for the classifier to look at the global “shape” of gene expression
|
bshanks@0 | 720 patterns over the whole structure, rather than just nearby voxels.
|
bshanks@0 | 721 there is some small bit of additional information that can be gleaned from
|
bshanks@0 | 722 knowing the
|
bshanks@0 | 723 Design choices for a supervised learning procedure
|
bshanks@0 | 724 After all,
|
bshanks@0 | 725 there is a small correlation between the gene expression levels from distant
|
bshanks@0 | 726 voxels and
|
bshanks@0 | 727 Depending on how we intend to use the classifier, we may want to design it
|
bshanks@0 | 728 so that
|
bshanks@0 | 729 It is possible for many things to
|
bshanks@0 | 730 The choice of which data is made part of an instance
|
bshanks@0 | 731 what we seek is a procedure
|
bshanks@0 | 732 partition the tissue sample into subregions.
|
bshanks@0 | 733 each part of the anatomical structure
|
bshanks@0 | 734 must be One way to rephrase this task is to say that, instead of searching
|
bshanks@0 | 735 for the location of the subregions, we are looking to partition the tissue sample
|
bshanks@0 | 736 into subregions.
|
bshanks@0 | 737 There are various ways to look for marker genes. We will define some terms,
|
bshanks@0 | 738 and along the way we will describe a few design choices encountered in the
|
bshanks@0 | 739 process of creating a marker gene finding method, and then we will present four
|
bshanks@0 | 740 principles that describe which options we have chosen.
|
bshanks@0 | 741 In developing a procedure for finding marker genes, we are developing a
|
bshanks@0 | 742 procedure that takes a dataset of experimental observations and produces a
|
bshanks@0 | 743 result. One can think of the result as merely a list of genes, but really the result
|
bshanks@0 | 744 is an understanding of a predictive relationship between, on the one hand, the
|
bshanks@0 | 745 expression levels of genes, and, on the other hand, anatomical subregions.
|
bshanks@0 | 746 One way to more formally define this understanding is to look at it as a
|
bshanks@0 | 747 procedure. In this view, the result of the learning procedure is itself a procedure.
|
bshanks@0 | 748 The result procedure provides a way to use the gene expression profiles of voxels
|
bshanks@0 | 749 in a tissue sample in order to determine where the subregions are.
|
bshanks@0 | 750 This result procedure can be used directly, as when an experimenter has
|
bshanks@0 | 751 a tissue sample and needs to know what subregions are present in it, and,
|
bshanks@0 | 752 if multiple subregions are present, where they each are. Or it can be used
|
bshanks@0 | 753 indirectly; imagine that the result procedure tells us that whenever a certain
|
bshanks@0 | 754 combination of genes are expressed, the local tissue is probably part of a certain
|
bshanks@8 | 755 18
|
bshanks@8 | 756
|
bshanks@0 | 757 subregion. This means that we can then confidentally develop an intervention
|
bshanks@0 | 758 which is triggered only when that combination of genes are expressed; and to
|
bshanks@0 | 759 the extent that the result procedure is reliable, we know that the intervention
|
bshanks@0 | 760 will only be triggered in the target subregion.
|
bshanks@0 | 761 We said that the result procedure provides “a way to use the gene expression
|
bshanks@0 | 762 profiles of voxels in a tissue sample” in order to “determine where the subregions
|
bshanks@0 | 763 are”.
|
bshanks@0 | 764 Does the result procedure get as input all of the gene expression profiles
|
bshanks@0 | 765 of each voxel in the entire tissue sample, and produce as output all of the
|
bshanks@0 | 766 subregional boundaries all at once?
|
bshanks@0 | 767 Or are we given one voxel at a time,
|
bshanks@0 | 768 In the jargon of the field of machine learning, the result procedure is called
|
bshanks@0 | 769 a classifier.
|
bshanks@0 | 770 The task of finding genes that mark anatomical areas can be phrased in
|
bshanks@0 | 771 terms of what the field of machine learning calls a “supervised learning” task.
|
bshanks@0 | 772 The goal of this task is to learn a function (the “classifier”) which
|
bshanks@0 | 773 If a person knows a combination of genes that mark an area, that implies
|
bshanks@0 | 774 that the person can be told how strong those genes express in any voxel, and
|
bshanks@0 | 775 the person can use this information to determine how
|
bshanks@0 | 776 finding how to infer the areal identity of a voxel if given the gene expression
|
bshanks@0 | 777 profile of that voxel.
|
bshanks@0 | 778 For each voxel in the cortex, we want to start with data about the gene
|
bshanks@0 | 779 expression
|
bshanks@0 | 780 single voxels, but rather groups of voxels, such that the groups can be placed
|
bshanks@0 | 781 in some 2-D space. We will call such instances “pixels”.
|
bshanks@0 | 782 We have been speaking as if instances necessarily correspond to single voxels.
|
bshanks@0 | 783 But it is possible for instances to be groupings of many voxels, in which case
|
bshanks@0 | 784 each grouping must be assigned the same label (that is, each voxel grouping
|
bshanks@0 | 785 must stay inside a single anatomical subregion).
|
bshanks@0 | 786 In some but not all cases, the groups are either rows or columns of voxels.
|
bshanks@0 | 787 This is the case with the cerebral cortex, in which one may assume that columns
|
bshanks@0 | 788 of voxels which run perpendicular to the cortical surface all share the same areal
|
bshanks@0 | 789 identity. In the cortex, we call such an instance a “surface pixel”, because such
|
bshanks@0 | 790 an instance represents the data associated with all voxels underneath a specific
|
bshanks@0 | 791 patch of the cortical surface.
|
bshanks@0 | 792 19
|
bshanks@0 | 793
|
bshanks@0 | 794
|