cg
diff grant.html @ 0:29eee29f9bc1
initial commit to hg version control repository
author | bshanks@bshanks-salk.dyndns.org |
---|---|
date | Sat Apr 11 19:12:32 2009 -0700 (16 years ago) |
parents | |
children | 7487ad7f5d8f |
line diff
1.1 --- /dev/null Thu Jan 01 00:00:00 1970 +0000
1.2 +++ b/grant.html Sat Apr 11 19:12:32 2009 -0700
1.3 @@ -0,0 +1,790 @@
1.4 +Specific aims
1.5 + Massive new datasets obtained with techniques such as in situ hybridization
1.6 + (ISH) and BAC-transgenics allow the expression levels of many genes at many
1.7 + locations to be compared. Our goal is to develop automated methods to relate
1.8 + spatial variation in gene expression to anatomy. We want to find marker genes
1.9 + for specific anatomical regions, and also to draw new anatomical maps based on
1.10 + gene expression patterns. We have three specific aims:
1.11 + (1) develop an algorithm to screen spatial gene expression data for combina-
1.12 + tions of marker genes which selectively target anatomical regions
1.13 + (2) develop an algorithm to suggest new ways of carving up a structure into
1.14 + anatomical subregions, based on spatial patterns in gene expression
1.15 + (3) create a 2-D “flat map” dataset of the mouse cerebral cortex that contains
1.16 + a flattened version of the Allen Mouse Brain Atlas ISH data, as well as
1.17 + the boundaries of cortical anatomical areas. Use this dataset to validate
1.18 + the methods developed in (1) and (2).
1.19 + In addition to validating the usefulness of the algorithms, the application of
1.20 + these methods to cerebral cortex will produce immediate benefits, because there
1.21 + are currently no known genetic markers for many cortical areas. The results
1.22 + of the project will support the development of new ways to selectively target
1.23 + cortical areas, and it will support the development of a method for identifying
1.24 + the cortical areal boundaries present in small tissue samples.
1.25 + All algorithms that we develop will be implemented in an open-source soft-
1.26 + ware toolkit. The toolkit, as well as the machine-readable datasets developed
1.27 + in aim (3), will be published and freely available for others to use.
1.28 + Background and significance
1.29 + Aim 1
1.30 + Machine learning terminology
1.31 + The task of looking for marker genes for anatomical subregions means that one
1.32 + is looking for a set of genes such that, if the expression level of those genes is
1.33 + known, then the locations of the subregions can be inferred.
1.34 + If we define the subregions so that they cover the entire anatomical structure
1.35 + to be divided, then instead of saying that we are using gene expression to find
1.36 + the locations of the subregions, we may say that we are using gene expression to
1.37 + determine to which subregion each voxel within the structure belongs. We call
1.38 + this a classification task, because each voxel is being assigned to a class (namely,
1.39 + its subregion).
1.40 + Therefore, an understanding of the relationship between the combination of
1.41 + their expression levels and the locations of the subregions may be expressed as
1.42 + 1
1.43 +
1.44 + a function. The input to this function is a voxel, along with the gene expression
1.45 + levels within that voxel; the output is the subregional identity of the target
1.46 + voxel, that is, the subregion to which the target voxel belongs. We call this
1.47 + function a classifier. In general, the input to a classifier is called an instance,
1.48 + and the output is called a label.
1.49 + The object of aim 1 is not to produce a single classifier, but rather to develop
1.50 + an automated method for determining a classifier for any known anatomical
1.51 + structure. Therefore, we seek a procedure by which a gene expression dataset
1.52 + may be analyzed in concert with an anatomical atlas in order to produce a
1.53 + classifier. Such a procedure is a type of a machine learning procedure. The
1.54 + construction of the classifier is called training (also learning), and the initial
1.55 + gene expression dataset used in the construction of the classifier is called training
1.56 + data.
1.57 + In the machine learning literature, this sort of procedure may be thought
1.58 + of as a supervised learning task, defined as a task in whcih the goal is to learn
1.59 + a mapping from instances to labels, and the training data consists of a set of
1.60 + instances (voxels) for which the labels (subregions) are known.
1.61 + Each gene expression level is called a feature, and the selection of which
1.62 + genes to include is called feature selection. Feature selection is one component
1.63 + of the task of learning a classifier. Some methods for learning classifiers start
1.64 + out with a separate feature selection phase, whereas other methods combine
1.65 + feature selection with other aspects of training.
1.66 + One class of feature selection methods assigns some sort of score to each
1.67 + candidate gene. The top-ranked genes are then chosen. Some scoring measures
1.68 + can assign a score to a set of selected genes, not just to a single gene; in this
1.69 + case, a dynamic procedure may be used in which features are added and sub-
1.70 + tracted from the selected set depending on how much they raise the score. Such
1.71 + procedures are called “stepwise” or “greedy”.
1.72 + Although the classifier itself may only look at the gene expression data within
1.73 + each voxel before classifying that voxel, the learning algorithm which constructs
1.74 + the classifier may look over the entire dataset. We can categorize score-based
1.75 + feature selection methods depending on how the score of calculated. Often
1.76 + the score calculation consists of assigning a sub-score to each voxel, and then
1.77 + aggregating these sub-scores into a final score (the aggregation is often a sum or
1.78 + a sum of squares). If only information from nearby voxels is used to calculate a
1.79 + voxel’s sub-score, then we say it is a local scoring method. If only information
1.80 + from the voxel itself is used to calculate a voxel’s sub-score, then we say it is a
1.81 + pointwise scoring method.
1.82 + Key questions when choosing a learning method are: What are the instances?
1.83 + What are the features? How are the features chosen? Here are four principles
1.84 + that outline our answers to these questions.
1.85 + Principle 1: Combinatorial gene expression
1.86 + Above, we defined an “instance” as the combination of a voxel with the “asso-
1.87 + ciated gene expression data”. In our case this refers to the expression level of
1.88 + 2
1.89 +
1.90 + genes within the voxel, but should we include the expression levels of all genes,
1.91 + or only a few of them?
1.92 + It is too much to hope that every anatomical region of interest will be iden-
1.93 + tified by a single gene. For example, in the cortex, there are some areas which
1.94 + are not clearly delineated by any gene included in the Allen Brain Atlas (ABA)
1.95 + dataset. However, at least some of these areas can be delineated by looking
1.96 + at combinations of genes (an example of an area for which multiple genes are
1.97 + necessary and sufficient is provided in Preliminary Results).
1.98 + Principle 2: Only look at combinations of small numbers of genes
1.99 + When the classifier classifies a voxel, it is only allowed to look at the expression of
1.100 + the genes which have been selected as features. The more data that is available
1.101 + to a classifier, the better that it can do. For example, perhaps there are weak
1.102 + correlations over many genes that add up to a strong signal. So, why not include
1.103 + every gene as a feature? The reason is that we wish to employ the classifier in
1.104 + situations in which it is not feasible to gather data about every gene. For
1.105 + example, if we want to use the expression of marker genes as a trigger for some
1.106 + regionally-targeted intervention, then our intervention must contain a molecular
1.107 + mechanism to check the expression level of each marker gene before it triggers.
1.108 + It is currently infeasible to design a molecular trigger that checks the level of
1.109 + more than a handful of genes. Similarly, if the goal is to develop a procedure to
1.110 + do ISH on tissue samples in order to label their anatomy, then it is infeasible
1.111 + to label more than a few genes. Therefore, we must select only a few genes as
1.112 + features.
1.113 + Principle 3: Use geometry in feature selection
1.114 + When doing feature selection with score-based methods, the simplest thing to
1.115 + do would be to score the performance of each voxel by itself and then combine
1.116 + these scores; this is pointwise scoring. A more powerful approach is to also use
1.117 + information about the geometric relations between each voxel and its neighbors;
1.118 + this requires non-pointwise, local scoring methods. See Preliminary Results for
1.119 + evidence of the complementary nature of pointwise and local scoring methods.
1.120 + Principle 4: Work in 2-D whenever possible
1.121 + There are many anatomical structures which are commonly characterized in
1.122 + terms of a two-dimensional manifold. When it is known that the structure that
1.123 + one is looking for is two-dimensional, the results may be improved by allowing
1.124 + the analysis algorithm to take advantage of this prior knowledge. In addition,
1.125 + it is easier for humans to visualize and work with 2-D data.
1.126 + Therefore, when possible, the instances should represent pixels, not voxels.
1.127 + 3
1.128 +
1.129 + Aim 3
1.130 + Background
1.131 + The cortex is divided into areas and layers. To a first approximation, the par-
1.132 + cellation of the cortex into areas can be drawn as a 2-D map on the surface
1.133 + of the cortex. In the third dimension, the boundaries between the areas con-
1.134 + tinue downwards into the cortical depth, perpendicular to the surface. The layer
1.135 + boundaries run parallel to the surface. One can picture an area of the cortex as
1.136 + a slice of many-layered cake.
1.137 + Although it is known that different cortical areas have distinct roles in both
1.138 + normal functioning and in disease processes, there are no known marker genes
1.139 + for many cortical areas. When it is necessary to divide a tissue sample into
1.140 + cortical areas, this is a manual process that requires a skilled human to combine
1.141 + multiple visual cues and interpret them in the context of their approximate
1.142 + location upon the cortical surface.
1.143 + Even the questions of how many areas should be recognized in cortex, and
1.144 + what their arrangement is, are still not completely settled. A proposed division
1.145 + of the cortex into areas is called a cortical map. In the rodent, the lack of a
1.146 + single agreed-upon map can be seen by contrasting the recent maps given by
1.147 + Swanson?? on the one hand, and Paxinos and Franklin?? on the other. While
1.148 + the maps are certainly very similar in their general arrangement, significant
1.149 + differences remain in the details.
1.150 + Significance
1.151 + The method developed in aim (1) will be applied to each cortical area to find
1.152 + a set of marker genes such that the combinatorial expression pattern of those
1.153 + genes uniquely picks out the target area. Finding marker genes will be useful
1.154 + for drug discovery as well as for experimentation because marker genes can be
1.155 + used to design interventions which selectively target individual cortical areas.
1.156 + The application of the marker gene finding algorithm to the cortex will
1.157 + also support the development of new neuroanatomical methods. In addition to
1.158 + finding markers for each individual cortical areas, we will find a small panel
1.159 + of genes that can find many of the areal boundaries at once. This panel of
1.160 + marker genes will allow the development of an ISH protocol that will allow
1.161 + experimenters to more easily identify which anatomical areas are present in
1.162 + small samples of cortex.
1.163 + The method developed in aim (3) will provide a genoarchitectonic viewpoint
1.164 + that will contribute to the creation of a better map. The development of present-
1.165 + day cortical maps was driven by the application of histological stains. It is
1.166 + conceivable that if a different set of stains had been available which identified
1.167 + a different set of features, then the today’s cortical maps would have come out
1.168 + differently. Since the number of classes of stains is small compared to the number
1.169 + of genes, it is likely that there are many repeated, salient spatial patterns in
1.170 + the gene expression which have not yet been captured by any stain. Therefore,
1.171 + 4
1.172 +
1.173 + current ideas about cortical anatomy need to incorporate what we can learn
1.174 + from looking at the patterns of gene expression.
1.175 + While we do not here propose to analyze human gene expression data, it is
1.176 + conceivable that the methods we propose to develop could be used to suggest
1.177 + modifications to the human cortical map as well.
1.178 + Related work
1.179 + Preliminary work
1.180 + Justification of principles 1 thur 3
1.181 + Principle 1: Combinatorial gene expression
1.182 + Here we give an example of a cortical area which is not marked by any single
1.183 + gene, but which can be identified combinatorially. according to logistic regres-
1.184 + sion, gene wwc11 is the best fit single gene for predicting whether or not a pixel
1.185 + on the cortical surface belongs to the motor area (area MO). The upper-left
1.186 + picture in Figure shows wwc1’s spatial expression pattern over the cortex. The
1.187 + lower-right boundary of MO is represented reasonably well by this gene, however
1.188 + the gene overshoots the upper-left boundary. This flattened 2-D representation
1.189 + does not show it, but the area corresponding to the overshoot is the medial
1.190 + surface of the cortex. MO is only found on the lateral surface (todo).
1.191 + Gnee mtif22 is shown in figure the upper-right of Fig. . Mtif2 captures MO’s
1.192 + upper-left boundary, but not its lower-right boundary. Mtif2 does not express
1.193 + very much on the medial surface. By adding together the values at each pixel
1.194 + in these two figures, we get the lower-left of Figure . This combination captures
1.195 + area MO much better than any single gene.
1.196 + Principle 2: Only look at combinations of small numbers of genes
1.197 + In order to see how well one can do when looking at all genes at once, we ran
1.198 + a support vector machine to classify cortical surface pixels based on their gene
1.199 + expression profiles. We achieved classification accuracy of about 81%3. As noted
1.200 + above, however, a classifier that looks at all the genes at once isn’t practically
1.201 + useful.
1.202 + The requirement to find combinations of only a small number of genes limits
1.203 + us from straightforwardly applying many of the most simple techniques from
1.204 + the field of supervised machine learning. In the parlance of machine learning,
1.205 + our task combines feature selection with supervised learning.
1.206 +__________________________
1.207 + 1“WW, C2 and coiled-coil domain containing 1”; EntrezGene ID 211652
1.208 + 2“mitochondrial translational initiation factor 2”; EntrezGene ID 76784
1.209 + 3Using the Shogun SVM package (todo:cite), with parameters type=GMNPSVM (multi-
1.210 +class b-SVM), kernal = gaussian with sigma = 0.1, c = 10, epsilon = 1e-1 – these are the
1.211 +first parameters we tried, so presumably performance would improve with different choices of
1.212 +parameters. 5-fold cross-validation.
1.213 + 5
1.214 +
1.215 +
1.216 +
1.217 + Figure 1: Upper left: wwc1. Upper right: mtif2. Lower left: wwc1 + mtif2
1.218 + (each pixel’s value on the lower left is the sum of the corresponding pixels in
1.219 + the upper row). Within each picture, the vertical axis roughly corresponds to
1.220 + anterior at the top and posterior at the bottom, and the horizontal axis roughly
1.221 + corresponds to medial at the left and lateral at the right. The red outline is
1.222 + the boundary of region MO. Pixels are colored approximately according to the
1.223 + density of expressing cells underneath each pixel, with red meaning a lot of
1.224 + expression and blue meaning little.
1.225 + 6
1.226 +
1.227 +
1.228 +
1.229 + Figure 2: The top row shows the three genes which (individually) best predict
1.230 + area AUD, according to logistic regression. The bottom row shows the three
1.231 + genes which (individually) best match area AUD, according to gradient similar-
1.232 + ity. From left to right and top to bottom, the genes are Ssr1, Efcbp1, Aph1a,
1.233 + Ptk7, Aph1a again, and Lepr
1.234 + Principle 3: Use geometry
1.235 + To show that local geometry can provide useful information that cannot be
1.236 + detected via pointwise analyses, consider Fig. . The top row of Fig. displays
1.237 + the 3 genes which most match area AUD, according to a pointwise method4. The
1.238 + bottom row displays the 3 genes which most match AUD according to a method
1.239 + which considers local geometry5 The pointwise method in the top row identifies
1.240 + genes which express more strongly in AUD than outside of it; its weakness is that
1.241 + this includes many areas which don’t have a salient border matching the areal
1.242 + border. The geometric method identifies genes whose salient expression border
1.243 + seems to partially line up with the border of AUD; its weakness is that this
1.244 + includes genes which don’t express over the entire area. Genes which have high
1.245 + rankings using both pointwise and border criteria, such as Aph1a in the example,
1.246 + may be particularly good markers. None of these genes are, individually, a
1.247 + perfect marker for AUD; we deliberately chose a “difficult” area in order to
1.248 + better contrast pointwise with geometric methods.
1.249 +__________________________
1.250 + 4For each gene, a logistic regression in which the response variable was whether or not a
1.251 +surface pixel was within area AUD, and the predictor variable was the value of the expression
1.252 +of the gene underneath that pixel. The resulting scores were used to rank the genes in terms
1.253 +of how well they predict area AUD.
1.254 + 5For each gene the gradient similarity (see section ??) between (a) a map of the expression
1.255 +of each gene on the cortical surface and (b) the shape of area AUD, was calculated, and this
1.256 +was used to rank the genes.
1.257 + 7
1.258 +
1.259 + Principle 4: Work in 2-D whenever possible
1.260 + In anatomy, the manifold of interest is usually either defined by a combination
1.261 + of two relevant anatomical axes (todo), or by the surface of the structure (as is
1.262 + the case with the cortex). In the former case, the manifold of interest is a plane,
1.263 + but in the latter case it is curved. If the manifold is curved, there are various
1.264 + methods for mapping the manifold into a plane.
1.265 + The method that we will develop will begin by mapping the data into a
1.266 + 2-D plane. Although the manifold that characterized cortical areas is known
1.267 + to be the cortical surface, it remains to be seen which method of mapping the
1.268 + manifold into a plane is optimal for this application. We will compare mappings
1.269 + which attempt to preserve size (such as the one used by Caret??) with mappings
1.270 + which preserve angle (conformal maps).
1.271 + Although there is much 2-D organization in anatomy, there are also struc-
1.272 + tures whose shape is fundamentally 3-dimensional. If possible, we would like
1.273 + the method we develop to include a statistical test that warns the user if the
1.274 + assumption of 2-D structure seems to be wrong.
1.275 + ——
1.276 + Massive new datasets obtained with techniques such as in situ hybridization
1.277 + (ISH) and BAC-transgenics allow the expression levels of many genes at many
1.278 + locations to be compared. This can be used to find marker genes for specific
1.279 + anatomical structures, as well as to draw new anatomical maps. Our goal is
1.280 + to develop automated methods to relate spatial variation in gene expression to
1.281 + anatomy. We have five specific aims:
1.282 + (1) develop an algorithm to screen spatial gene expression data for combi-
1.283 + nations of marker genes which selectively target individual anatomical
1.284 + structures
1.285 + (2) develop an algorithm to screen spatial gene expression data for combina-
1.286 + tions of marker genes which can be used to delineate most of the bound-
1.287 + aries between a number of anatomical structures at once
1.288 + (3) develop an algorithm to suggest new ways of dividing a structure up into
1.289 + anatomical subregions, based on spatial patterns in gene expression
1.290 + (4) create a flat (2-D) map of the mouse cerebral cortex that contains a flat-
1.291 + tened version of the Allen Mouse Brain Atlas ISH dataset, as well as the
1.292 + boundaries of anatomical areas within the cortex. For each cortical layer,
1.293 + a layer-specific flat dataset will be created. A single combined flat dataset
1.294 + will be created which averages information from all of the layers. These
1.295 + datasets will be made available in both MATLAB and Caret formats.
1.296 + (5) validate the methods developed in (1), (2) and (3) by applying them to
1.297 + the cerebral cortex datasets created in (4)
1.298 + All algorithms that we develop will be implemented in an open-source soft-
1.299 + ware toolkit. The toolkit, as well as the machine-readable datasets developed in
1.300 + 8
1.301 +
1.302 + aim (4) and any other intermediate dataset we produce, will be published and
1.303 + freely available for others to use.
1.304 + In addition to developing generally useful methods, the application of these
1.305 + methods to cerebral cortex will produce immediate benefits that are only one
1.306 + step removed from clinical application, while also supporting the development
1.307 + of new neuroanatomical techniques. The method developed in aim (1) will be
1.308 + applied to each cortical area to find a set of marker genes. Currently, despite
1.309 + the distinct roles of different cortical areas in both normal functioning and
1.310 + disease processes, there are no known marker genes for many cortical areas.
1.311 + Finding marker genes will be immediately useful for drug discovery as well as for
1.312 + experimentation because once marker genes for an area are known, interventions
1.313 + can be designed which selectively target that area.
1.314 + The method developed in aim (2) will be used to find a small panel of genes
1.315 + that can find most of the boundaries between areas in the cortex. Today, finding
1.316 + cortical areal boundaries in a tissue sample is a manual process that requires a
1.317 + skilled human to combine multiple visual cues over a large area of the cortical
1.318 + surface. A panel of marker genes will allow the development of an ISH protocol
1.319 + that will allow experimenters to more easily identify which anatomical areas are
1.320 + present in small samples of cortex.
1.321 + For each cortical layer, a layer-specific flat dataset will be created. A single
1.322 + combined flat dataset will be created which averages information from all of
1.323 + the layers. These datasets will be made available in both MATLAB and Caret
1.324 + formats.
1.325 + —-
1.326 + New techniques allow the expression levels of many genes at many locations
1.327 + to be compared. It is thought that even neighboring anatomical structures have
1.328 + different gene expression profiles. We propose to develop automated methods
1.329 + to relate the spatial variation in gene expression to anatomy. We will develop
1.330 + two kinds of techniques:
1.331 + (a) techniques to screen for combinations of marker genes which selectively
1.332 + target anatomical structures
1.333 + (b) techniques to suggest new ways of dividing a structure up into anatomical
1.334 + subregions, based on the shapes of contours in the gene expression
1.335 + The first kind of technique will be helpful for finding marker genes associated
1.336 + with known anatomical features. The second kind of technique will be helpful in
1.337 + creating new anatomical maps, maps which reflect differences in gene expression
1.338 + the same way that existing maps reflect differences in histology.
1.339 + We intend to develop our techniques using the adult mouse cerebral cortex
1.340 + as a testbed. The Allen Brain Atlas has collected a dataset containing the
1.341 + expression level of about 4000 genes* over a set of over 150000 voxels, with a
1.342 + spatial resolution of approximately 200 microns[?].
1.343 + We expect to discover sets of marker genes that pick out specific cortical
1.344 + areas. This will allow the development of drugs and other interventions that
1.345 + selectively target individual cortical areas. Therefore our research will lead
1.346 + 9
1.347 +
1.348 + to application in drug discovery, in the development of other targeted clinical
1.349 + interventions, and in the development of new experimental techniques.
1.350 + The best way to divide up rodent cortex into areas has not been completely
1.351 + determined, as can be seen by the differences in the recent maps given by Swan-
1.352 + son on the one hand, and Paxinos and Franklin on the other. It is likely that our
1.353 + study, by showing which areal divisions naturally follow from gene expression
1.354 + data, as opposed to traditional histological data, will contribute to the creation
1.355 + of a better map. While we do not here propose to analyze human gene expres-
1.356 + sion data, it is conceivable that the methods we propose to develop could be
1.357 + used to suggest modifications to the human cortical map as well.
1.358 + In the following, we will only be talking about coronal data.
1.359 + The Allen Brain Atlas provides “Smoothed Energy Volumes”, which are
1.360 + One type of artifact in the Allen Brain Atlas data is what we call a “slice
1.361 + artifact”. We have noticed two types of slice artifacts in the dataset. The first
1.362 + type, a “missing slice artifact”, occurs when the ISH procedure on a slice did
1.363 + not come out well. In this case, the Allen Brain investigators excluded the slice
1.364 + at issue from the dataset. This means that no gene expression information is
1.365 + available for that gene for the region of space covered by that slice. This results
1.366 + in an expression level of zero being assigned to voxels covered by the slice. This
1.367 + is partially but not completely ameliorated by the smoothing that is applied to
1.368 + create the Smoothed Energy Volumes. The usual end result is that a region of
1.369 + space which is shaped and oriented like a coronal slice is marked as having less
1.370 + gene expression than surrounding regions.
1.371 + The second type of slice artifact is caused by the fact that all of the slices
1.372 + have a consistent orientation. Since there may be artifacts (such as how well
1.373 + the ISH worked) which are constant within each slice but which vary between
1.374 + different slices, the result is that ceteris paribus, when one compares the genetic
1.375 + data of a voxel to another voxel within the same coronal plane, one would expect
1.376 + to find more similarity than if one compared a voxel to another voxel displaced
1.377 + along the rostrocaudal axis.
1.378 + We are enthusiastic about the sharing of methods, data, and results, and
1.379 + at the conclusion of the project, we will make all of our data and computer
1.380 + source code publically available. Our goal is that replicating our results, or
1.381 + applying the methods we develop to other targets, will be quick and easy for
1.382 + other investigators. In order to aid in understanding and replicating our results,
1.383 + we intend to include a software program which, when run, will take as input
1.384 + the Allen Brain Atlas raw data, and produce as output all numbers and charts
1.385 + found in publications resulting from the project.
1.386 + To aid in the replication of our results, we will include a script which takes
1.387 + as input the dataset in aim (3) and provides as output all of the tables in figures
1.388 + in our publications .
1.389 + We also expect to weigh in on the debate about how to best partition rodent
1.390 + cortex
1.391 + be useful for drug discovery as well
1.392 + * Another 16000 genes are available, but they do not cover the entire cerebral
1.393 + cortex with high spatial resolution.
1.394 + 10
1.395 +
1.396 + User-definable ROIs Combinatorial gene expression Negative as well as pos-
1.397 + itive signal Use geometry Search for local boundaries if necessary Flatmapped
1.398 + Specific aims
1.399 + Develop algorithms that find genetic markers for anatomical regions
1.400 + 1. Develop scoring measures for evaluating how good individual genes are at
1.401 + marking areas: we will compare pointwise, geometric, and information-
1.402 + theoretic measures.
1.403 + 2. Develop a procedure to find single marker genes for anatomical regions: for
1.404 + each cortical area, by using or combining the scoring measures developed,
1.405 + we will rank the genes by their ability to delineate each area.
1.406 + 3. Extend the procedure to handle difficult areas by using combinatorial cod-
1.407 + ing: for areas that cannot be identified by any single gene, identify them
1.408 + with a handful of genes. We will consider both (a) algorithms that incre-
1.409 + mentally/greedily combine single gene markers into sets, such as forward
1.410 + stepwise regression and decision trees, and also (b) supervised learning
1.411 + techniques which use soft constraints to minimize the number of features,
1.412 + such as sparse support vector machines.
1.413 + 4. Extend the procedure to handle difficult areas by combining or redrawing
1.414 + the boundaries: An area may be difficult to identify because the bound-
1.415 + aries are misdrawn, or because it does not “really” exist as a single area,
1.416 + at least on the genetic level. We will develop extensions to our procedure
1.417 + which (a) detect when a difficult area could be fit if its boundary were
1.418 + redrawn slightly, and (b) detect when a difficult area could be combined
1.419 + with adjacent areas to create a larger area which can be fit.
1.420 + Apply these algorithms to the cortex
1.421 + 1. Create open source format conversion tools: we will create tools to bulk
1.422 + download the ABA dataset and to convert between SEV, NIFTI and MAT-
1.423 + LAB formats.
1.424 + 2. Flatmap the ABA cortex data: map the ABA data onto a plane and draw
1.425 + the cortical area boundaries onto it.
1.426 + 3. Find layer boundaries: cluster similar voxels together in order to auto-
1.427 + matically find the cortical layer boundaries.
1.428 + 4. Run the procedures that we developed on the cortex: we will present, for
1.429 + each area, a short list of markers to identify that area; and we will also
1.430 + present lists of “panels” of genes that can be used to delineate many areas
1.431 + at once.
1.432 + 11
1.433 +
1.434 + Develop algorithms to suggest a division of a structure into anatom-
1.435 + ical parts
1.436 + 1. Explore dimensionality reduction algorithms applied to pixels: including
1.437 + TODO
1.438 + 2. Explore dimensionality reduction algorithms applied to genes: including
1.439 + TODO
1.440 + 3. Explore clustering algorithms applied to pixels: including TODO
1.441 + 4. Explore clustering algorithms applied to genes: including gene shaving,
1.442 + TODO
1.443 + 5. Develop an algorithm to use dimensionality reduction and/or hierarchial
1.444 + clustering to create anatomical maps
1.445 + 6. Run this algorithm on the cortex: present a hierarchial, genoarchitectonic
1.446 + map of the cortex
1.447 + gradient similarity is calculated as: ∑
1.448 + pixels cos(abs(∠∇1 - ∠∇2)) ⋅|∇1|+|∇2|
1.449 + 2 ⋅
1.450 + pixel_value1+pixel_value2
1.451 + 2
1.452 + (todo) Technically, we say that an anatomical structure has a fundamen-
1.453 + tally 2-D organization when there exists a commonly used, generic, anatomical
1.454 + structure-preserving map from 3-D space to a 2-D manifold.
1.455 + Related work:
1.456 + The Allen Brain Institute has developed an interactive web interface called
1.457 + AGEA which allows an investigator to (1) calculate lists of genes which are se-
1.458 + lectively overexpressed in certain anatomical regions (ABA calls this the “Gene
1.459 + Finder” function) (2) to visualize the correlation between the genetic profiles of
1.460 + voxels in the dataset, and (3) to visualize a hierarchial clustering of voxels in
1.461 + the dataset [?]. AGEA is an impressive and useful tool, however, it does not
1.462 + solve the same problems that we propose to solve with this project.
1.463 + First we describe AGEA’s “Gene Finder”, and then compare it to our pro-
1.464 + posed method for finding marker genes. AGEA’s Gene Finder first asks the
1.465 + investigator to select a single “seed voxel” of interest. It then uses a clustering
1.466 + method, combined with built-in knowledge of major anatomical structures, to
1.467 + select two sets of voxels; an “ROI” and a “comparator region”*. The seed voxel
1.468 + is always contained within the ROI, and the ROI is always contained within the
1.469 + comparator region. The comparator region is similar but not identical to the
1.470 + set of voxels making up the major anatomical region containing the ROI. Gene
1.471 + Finder then looks for genes which can distinguish the ROI from the comparator
1.472 + region. Specifically, it finds genes for which the ratio (expression energy in the
1.473 + ROI) / (expression energy in the comparator region) is high.
1.474 + Informally, the Gene Finder first infers an ROI based on clustering the seed
1.475 + voxel with other voxels. Then, the Gene Finder finds genes which overexpress
1.476 + in the ROI as compared to other voxels in the major anatomical region.
1.477 + There are three major differences between our approach and Gene Finder.
1.478 + 12
1.479 +
1.480 + First, Gene Finder focuses on individual genes and individual ROIs in isola-
1.481 + tion. This is great for regions which can be picked out from all other regions by a
1.482 + single gene, but not all of them can (todo). There are at least two ways this can
1.483 + miss out on useful genes. First, a gene might express in part of a region, but not
1.484 + throughout the whole region, but there may be another gene which expresses
1.485 + in the rest of the region*. Second, a gene might express in a region, but not in
1.486 + any of its neighbors, but it might express also in other non-neighboring regions.
1.487 + To take advantage of these types of genes, we propose to find combinations of
1.488 + genes which, together, can identify the boundaries of all subregions within the
1.489 + containing region.
1.490 + Second, Gene Finder uses a pointwise metric, namely expression energy ratio,
1.491 + to decide whether a gene is good for picking out a region. We have found better
1.492 + results by using metrics which take into account not just single voxels, but also
1.493 + the local geometry of neighboring voxels, such as the local gradient (todo). In
1.494 + addition, we have found that often the absence of gene expression can be used
1.495 + as a marker, which will not be caught by Gene Finder’s expression energy ratio
1.496 + (todo).
1.497 + Third, Gene Finder chooses the ROI based only on the seed voxel. This
1.498 + often does not permit the user to query the ROI that they are interested in. For
1.499 + example, in all of our tests of Gene Finder in cortex, the ROIs chosen tend to
1.500 + be cortical layers, rather than cortical areas.
1.501 + In summary, when Gene Finder picks the ROI that you want, and when this
1.502 + ROI can be easily picked out from neighboring regions by single genes which
1.503 + selectively overexpress in the ROI compared to the entire major anatomical re-
1.504 + gion, Gene Finder will work. However, Gene Finder will not pick cortical areas
1.505 + as ROIs, and even if it could, many cortical areas cannot be uniquely picked out
1.506 + by the overexpression of any single gene. By contrast, we will target cortical
1.507 + areas, we will explore a variety of metrics which can complement the shortcom-
1.508 + ings of expression energy ratio, and we will use the combinatorial expression of
1.509 + genes to pick out cortical areas even when no individual gene will do.
1.510 + * The terms “ROI” and “comparator region” are our own; the ABI calls
1.511 + them the “local region” and the “larger anatomical context”. The ABI uses the
1.512 + term “specificity comparator” to mean the major anatomic region containing
1.513 + the ROI, which is not exactly identical to the comparator region.
1.514 + ** In this case, the union of the area of expression of the two genes would
1.515 + suffice; one could also imagine that there could be situations in which the in-
1.516 + tersection of multiple genes would be needed, or a combination of unions and
1.517 + intersections.
1.518 + Now we describe AGEA’s hierarchial clustering, and compare it to our pro-
1.519 + posal. The goal of AGEA’s hierarchial clustering is to generate a binary tree of
1.520 + clusters, where a cluster is a collection of voxels. AGEA begins by computing
1.521 + the Pearson correlation between each pair of voxels. They then employ a recur-
1.522 + sive divisive (top-down) hierarchial clustering procedure on the voxels, which
1.523 + means that they start with all of the voxels, and then they divide them into clus-
1.524 + ters, and then within each cluster, they divide that cluster into smaller clusters,
1.525 + etc***. At each step, the collection of voxels is partitioned into two smaller
1.526 + 13
1.527 +
1.528 + clusters in a way that maximizes the following quantity: average correlation
1.529 + between all possible pairs of voxels containing one voxel from each cluster.
1.530 + There are three major differences between our approach and AGEA’s hier-
1.531 + archial clustering. First, AGEA’s clustering method separates cortical layers
1.532 + before it separates cortical areas.
1.533 + following procedure is used for the purpose of dividing a collection of voxels
1.534 + into smaller clusters: partition the voxels into two sets, such that the following
1.535 + quantity is maximized:
1.536 + *** depending on which level of the tree is being created, the voxels are
1.537 + subsampled in order to save time
1.538 + does not allow the user to input anything other than a seed voxel; this means
1.539 + that for each seed voxel, there is only one
1.540 + The role of the “local region” is to serve as a region of interest for which
1.541 + marker genes are desired; the role of the “larger anatomical context” is to be
1.542 + the structure
1.543 + There are two kinds of differences between AGEA and our project; differ-
1.544 + ences that relate to the treatment of the cortex, and differences in the type of
1.545 + generalizable methods being developed. As relates
1.546 + indicate an ROI
1.547 + explore simple correlation-based relationships between voxels, genes, and
1.548 + clusters of voxels.
1.549 + There have not yet been any studies which describe the results of applying
1.550 + AGEA to the cerebral cortex; however, we suspect that the AGEA metrics are
1.551 + not optimal for the task of relating genes to cortical areas. A voxel’s gene
1.552 + expression profile depends upon both its cortical area and its cortical layer,
1.553 + however, AGEA has no mechanism to distinguish these two. As a result, voxels
1.554 + in the same layer but different areas are often clustered together by AGEA. As
1.555 + part of the project, we will compare the performance of our techniques against
1.556 + AGEA’s.
1.557 + —
1.558 + The Allen Brain Institute has developed interactive tools called AGEA which
1.559 + allow an investigator to explore simple correlation-based relationships between
1.560 + voxels, genes, and clusters of voxels. There have not yet been any studies
1.561 + which describe the results of applying AGEA to the cerebral cortex; however,
1.562 + we suspect that the AGEA metrics are not optimal for the task of relating
1.563 + genes to cortical areas. A voxel’s gene expression profile depends upon both
1.564 + its cortical area and its cortical layer, however, AGEA has no mechanism to
1.565 + distinguish these two. As a result, voxels in the same layer but different areas
1.566 + are often clustered together by AGEA. As part of the project, we will compare
1.567 + the performance of our techniques against AGEA’s.
1.568 + Another difference between our techniques and AGEA’s is that AGEA allows
1.569 + the user to enter only a voxel location, and then to either explore the rest of
1.570 + the brain’s relationship to that particular voxel, or explore a partitioning of
1.571 + the brain based on pairwise voxel correlation. If the user is interested not in a
1.572 + single voxel, but rather an entire anatomical structure, AGEA will only succeed
1.573 + to the extent that the selected voxel is a typical representative of the structure.
1.574 + 14
1.575 +
1.576 + As discussed in the previous paragraph, this poses problems for structures like
1.577 + cortical areas, which (because of their division into cortical layers) do not have
1.578 + a single “typical representative”.
1.579 + By contrast, in our system, the user will start by selecting, not a single voxel,
1.580 + but rather, an anatomical superstructure to be divided into pieces (for example,
1.581 + the cerebral cortex). We expect that our methods will take into account not
1.582 + just pairwise statistics between voxels, but also large-scale geometric features
1.583 + (for example, the rapidity of change in gene expression as regional boundaries
1.584 + are crossed) which optimize the discriminability of regions within the selected
1.585 + superstructure.
1.586 + —–
1.587 + screen for combinations of marker genes which selectively target anatom-
1.588 + ical structures pick delineate the boundaries between neighboring anatomical
1.589 + structures. (b) techniques to screen for marker genes which pick out anatomical
1.590 + structures of interest
1.591 + , techniques which: (a) screen for marker genes , and (b) suggest new
1.592 + anatomical maps based on
1.593 + whose expression partitions the region of interest into its anatomical sub-
1.594 + structures, and (b) use the natural contours of gene expression to suggest new
1.595 + ways of dividing an organ into
1.596 + The Allen Brain Atlas
1.597 + –
1.598 + to: brooksl@mail.nih.gov
1.599 + Hi, I’m writing to confirm the applicability of a potential research project to
1.600 + the challenge grant topic ”New computational and statistical methods for the
1.601 + analysis of large data sets from next-generation sequencing technologies”.
1.602 + We want to develop methods for the analysis of gene expression datasets that
1.603 + can be used to uncover the relationships between gene expression and anatomical
1.604 + regions. Specifically, we want to develop techniques to (a) given a set of known
1.605 + anatomical areas, identify genetic markers for each of these areas, and (b) given
1.606 + an anatomical structure whose substructure is unknown, suggest a map, that
1.607 + is, a division of the space into anatomical sub-structures, that represents the
1.608 + boundaries inherent in the gene expression data.
1.609 + We propose to develop our techniques on the Allen Brain Atlas mouse brain
1.610 + gene expression dataset by finding genetic markers for anatomical areas within
1.611 + the cerebral cortex. The Allen Brain Atlas contains a registered 3-D map of
1.612 + gene expression data with 200-micron voxel resolution which was created from
1.613 + in situ hybridization data. The dataset contains about 4000 genes which are
1.614 + available at this resolution across the entire cerebral cortex.
1.615 + Despite the distinct roles of different cortical areas in both normal function-
1.616 + ing and disease processes, there are no known marker genes for many cortical
1.617 + areas. This project will be immediately useful for both drug discovery and clini-
1.618 + cal research because once the markers are known, interventions can be designed
1.619 + which selectively target specific cortical areas.
1.620 + This techniques we develop will be useful because they will be applicable to
1.621 + the analysis of other anatomical areas, both in terms of finding marker genes
1.622 + 15
1.623 +
1.624 + for known areas, and in terms of suggesting new anatomical subdivisions that
1.625 + are based upon the gene expression data.
1.626 + —-
1.627 + It is likely that our study, by showing which areal divisions naturally fol-
1.628 + low from gene expression data, as opposed to traditional histological data, will
1.629 + contribute to the creation of
1.630 + there are clear genetic or chemical markers known for only a few cortical
1.631 + areas. This makes it difficult to target drugs to specific
1.632 + As part of aims (1) and (5), we will discover sets of marker genes that pick
1.633 + out specific cortical areas. This will allow the development of drugs and other
1.634 + interventions that selectively target individual cortical areas. As part of aims
1.635 + (2) and (5), we will also discover small panels of marker genes that can be used
1.636 + to delineate most of the cortical areal map.
1.637 + With aims (2) and (4), we
1.638 + There are five principals
1.639 + In addition to validating the usefulness of the algorithms, the application of
1.640 + these methods to cerebral cortex will produce immediate benefits that are only
1.641 + one step removed from clinical application.
1.642 + todo: remember to check gensat, etc for validation (mention bias/variance)
1.643 + Why it is useful to apply these methods to cortex
1.644 + There is still room for debate as to exactly how the cortex should be parcellated
1.645 + into areas.
1.646 + The best way to divide up rodent cortex into areas has not been completely
1.647 + determined,
1.648 + not yet been accounted for in
1.649 + that the expression of some genes will contain novel spatial patterns which
1.650 + are not account
1.651 + that a genoarchitectonic map
1.652 + This principle is only applicable to aim 1 (marker genes). For aim 2 (partition
1.653 + a structure in into anatomical subregions), we plan to work with many genes at
1.654 + once.
1.655 + tood: aim 2 b+s?
1.656 + Principle 5: Interoperate with existing tools
1.657 + In order for our software to be as useful as possible for our users, it will be
1.658 + able to import and export data to standard formats so that users can use our
1.659 + software in tandem with other software tools created by other teams. We will
1.660 + support the following formats: NIFTI (Neuroimaging Informatics Technology
1.661 + Initiative), SEV (Allen Brain Institute Smoothed Energy Volume), and MAT-
1.662 + LAB. This ensures that our users will not have to exclusively rely on our tools
1.663 + when analyzing data. For example, users will be able to use the data visualiza-
1.664 + tion and analysis capabilities of MATLAB and Caret alongside our software.
1.665 + 16
1.666 +
1.667 + To our knowledge, there is no currently available software to convert between
1.668 + these formats, so we will also provide a format conversion tool. This may be
1.669 + useful even for groups that don’t use any of our other software.
1.670 + todo: is “marker gene” even a phrase that we should use at all?
1.671 + note for aim 1 apps: combo of genes is for voxel, not within any single cell
1.672 + , as when genetic markers allow the development of selective interventions;
1.673 + the reason that one can be confident that the intervention is selective is that it
1.674 + is only turned on when a certain combination of genes is turned on and off. The
1.675 + result procedure is what assures us that when that combination is present, the
1.676 + local tissue is probably part of a certain subregion.
1.677 + The basic idea is that we want to find a procedure by
1.678 + The task of finding genes that mark anatomical areas can be phrased in
1.679 + terms of what the field of machine learning calls a “supervised learning” task.
1.680 + The goal of this task is to learn a function (the “classifier”) which
1.681 + If a person knows a combination of genes that mark an area, that implies
1.682 + that the person can be told how strong those genes express in any voxel, and
1.683 + the person can use this information to determine how
1.684 + finding how to infer the areal identity of a voxel if given the gene expression
1.685 + profile of that voxel.
1.686 + For each voxel in the cortex, we want to start with data about the gene
1.687 + expression
1.688 + There are various ways to look for marker genes. We will define some terms,
1.689 + and along the way we will describe a few design choices encountered in the
1.690 + process of creating a marker gene finding method, and then we will present four
1.691 + principles that describe which options we have chosen.
1.692 + In developing a procedure for finding marker genes, we are developing a
1.693 + procedure that takes a dataset of experimental observations and produces a
1.694 + result. One can think of the result as merely a list of genes, but really the result
1.695 + is an understanding of a predictive relationship between, on the one hand, the
1.696 + expression levels of genes, and, on the other hand, anatomical subregions.
1.697 + One way to more formally define this understanding is to look at it as a
1.698 + procedure. In this view, the result of the learning procedure is itself a procedure.
1.699 + The result procedure provides a way to use the gene expression profiles of voxels
1.700 + in a tissue sample in order to determine where the subregions are.
1.701 + This result procedure can be used directly, as when an experimenter has
1.702 + a tissue sample and needs to know what subregions are present in it, and,
1.703 + if multiple subregions are present, where they each are. Or it can be used
1.704 + indirectly; imagine that the result procedure tells us that whenever a certain
1.705 + combination of genes are expressed, the local tissue is probably part of a certain
1.706 + subregion. This means that we can then confidentally develop an intervention
1.707 + which is triggered only when that combination of genes are expressed; and to
1.708 + the extent that the result procedure is reliable, we know that the intervention
1.709 + will only be triggered in the target subregion.
1.710 + We said that the result procedure provides “a way to use the gene expression
1.711 + profiles of voxels in a tissue sample” in order to “determine where the subregions
1.712 + are”.
1.713 + 17
1.714 +
1.715 + Does the result procedure get as input all of the gene expression profiles
1.716 + of each voxel in the entire tissue sample, and produce as output all of the
1.717 + subregional boundaries all at once?
1.718 + it is helpful for the classifier to look at the global “shape” of gene expression
1.719 + patterns over the whole structure, rather than just nearby voxels.
1.720 + there is some small bit of additional information that can be gleaned from
1.721 + knowing the
1.722 + Design choices for a supervised learning procedure
1.723 + After all,
1.724 + there is a small correlation between the gene expression levels from distant
1.725 + voxels and
1.726 + Depending on how we intend to use the classifier, we may want to design it
1.727 + so that
1.728 + It is possible for many things to
1.729 + The choice of which data is made part of an instance
1.730 + what we seek is a procedure
1.731 + partition the tissue sample into subregions.
1.732 + each part of the anatomical structure
1.733 + must be One way to rephrase this task is to say that, instead of searching
1.734 + for the location of the subregions, we are looking to partition the tissue sample
1.735 + into subregions.
1.736 + There are various ways to look for marker genes. We will define some terms,
1.737 + and along the way we will describe a few design choices encountered in the
1.738 + process of creating a marker gene finding method, and then we will present four
1.739 + principles that describe which options we have chosen.
1.740 + In developing a procedure for finding marker genes, we are developing a
1.741 + procedure that takes a dataset of experimental observations and produces a
1.742 + result. One can think of the result as merely a list of genes, but really the result
1.743 + is an understanding of a predictive relationship between, on the one hand, the
1.744 + expression levels of genes, and, on the other hand, anatomical subregions.
1.745 + One way to more formally define this understanding is to look at it as a
1.746 + procedure. In this view, the result of the learning procedure is itself a procedure.
1.747 + The result procedure provides a way to use the gene expression profiles of voxels
1.748 + in a tissue sample in order to determine where the subregions are.
1.749 + This result procedure can be used directly, as when an experimenter has
1.750 + a tissue sample and needs to know what subregions are present in it, and,
1.751 + if multiple subregions are present, where they each are. Or it can be used
1.752 + indirectly; imagine that the result procedure tells us that whenever a certain
1.753 + combination of genes are expressed, the local tissue is probably part of a certain
1.754 + subregion. This means that we can then confidentally develop an intervention
1.755 + which is triggered only when that combination of genes are expressed; and to
1.756 + the extent that the result procedure is reliable, we know that the intervention
1.757 + will only be triggered in the target subregion.
1.758 + 18
1.759 +
1.760 + We said that the result procedure provides “a way to use the gene expression
1.761 + profiles of voxels in a tissue sample” in order to “determine where the subregions
1.762 + are”.
1.763 + Does the result procedure get as input all of the gene expression profiles
1.764 + of each voxel in the entire tissue sample, and produce as output all of the
1.765 + subregional boundaries all at once?
1.766 + Or are we given one voxel at a time,
1.767 + In the jargon of the field of machine learning, the result procedure is called
1.768 + a classifier.
1.769 + The task of finding genes that mark anatomical areas can be phrased in
1.770 + terms of what the field of machine learning calls a “supervised learning” task.
1.771 + The goal of this task is to learn a function (the “classifier”) which
1.772 + If a person knows a combination of genes that mark an area, that implies
1.773 + that the person can be told how strong those genes express in any voxel, and
1.774 + the person can use this information to determine how
1.775 + finding how to infer the areal identity of a voxel if given the gene expression
1.776 + profile of that voxel.
1.777 + For each voxel in the cortex, we want to start with data about the gene
1.778 + expression
1.779 + single voxels, but rather groups of voxels, such that the groups can be placed
1.780 + in some 2-D space. We will call such instances “pixels”.
1.781 + We have been speaking as if instances necessarily correspond to single voxels.
1.782 + But it is possible for instances to be groupings of many voxels, in which case
1.783 + each grouping must be assigned the same label (that is, each voxel grouping
1.784 + must stay inside a single anatomical subregion).
1.785 + In some but not all cases, the groups are either rows or columns of voxels.
1.786 + This is the case with the cerebral cortex, in which one may assume that columns
1.787 + of voxels which run perpendicular to the cortical surface all share the same areal
1.788 + identity. In the cortex, we call such an instance a “surface pixel”, because such
1.789 + an instance represents the data associated with all voxels underneath a specific
1.790 + patch of the cortical surface.
1.791 + 19
1.792 +
1.793 +