cg

diff grant.html @ 0:29eee29f9bc1

initial commit to hg version control repository
author bshanks@bshanks-salk.dyndns.org
date Sat Apr 11 19:12:32 2009 -0700 (16 years ago)
parents
children 7487ad7f5d8f
line diff
1.1 --- /dev/null Thu Jan 01 00:00:00 1970 +0000 1.2 +++ b/grant.html Sat Apr 11 19:12:32 2009 -0700 1.3 @@ -0,0 +1,790 @@ 1.4 +Specific aims 1.5 + Massive new datasets obtained with techniques such as in situ hybridization 1.6 + (ISH) and BAC-transgenics allow the expression levels of many genes at many 1.7 + locations to be compared. Our goal is to develop automated methods to relate 1.8 + spatial variation in gene expression to anatomy. We want to find marker genes 1.9 + for specific anatomical regions, and also to draw new anatomical maps based on 1.10 + gene expression patterns. We have three specific aims: 1.11 + (1) develop an algorithm to screen spatial gene expression data for combina- 1.12 + tions of marker genes which selectively target anatomical regions 1.13 + (2) develop an algorithm to suggest new ways of carving up a structure into 1.14 + anatomical subregions, based on spatial patterns in gene expression 1.15 + (3) create a 2-D “flat map” dataset of the mouse cerebral cortex that contains 1.16 + a flattened version of the Allen Mouse Brain Atlas ISH data, as well as 1.17 + the boundaries of cortical anatomical areas. Use this dataset to validate 1.18 + the methods developed in (1) and (2). 1.19 + In addition to validating the usefulness of the algorithms, the application of 1.20 + these methods to cerebral cortex will produce immediate benefits, because there 1.21 + are currently no known genetic markers for many cortical areas. The results 1.22 + of the project will support the development of new ways to selectively target 1.23 + cortical areas, and it will support the development of a method for identifying 1.24 + the cortical areal boundaries present in small tissue samples. 1.25 + All algorithms that we develop will be implemented in an open-source soft- 1.26 + ware toolkit. The toolkit, as well as the machine-readable datasets developed 1.27 + in aim (3), will be published and freely available for others to use. 1.28 + Background and significance 1.29 + Aim 1 1.30 + Machine learning terminology 1.31 + The task of looking for marker genes for anatomical subregions means that one 1.32 + is looking for a set of genes such that, if the expression level of those genes is 1.33 + known, then the locations of the subregions can be inferred. 1.34 + If we define the subregions so that they cover the entire anatomical structure 1.35 + to be divided, then instead of saying that we are using gene expression to find 1.36 + the locations of the subregions, we may say that we are using gene expression to 1.37 + determine to which subregion each voxel within the structure belongs. We call 1.38 + this a classification task, because each voxel is being assigned to a class (namely, 1.39 + its subregion). 1.40 + Therefore, an understanding of the relationship between the combination of 1.41 + their expression levels and the locations of the subregions may be expressed as 1.42 + 1 1.43 + 1.44 + a function. The input to this function is a voxel, along with the gene expression 1.45 + levels within that voxel; the output is the subregional identity of the target 1.46 + voxel, that is, the subregion to which the target voxel belongs. We call this 1.47 + function a classifier. In general, the input to a classifier is called an instance, 1.48 + and the output is called a label. 1.49 + The object of aim 1 is not to produce a single classifier, but rather to develop 1.50 + an automated method for determining a classifier for any known anatomical 1.51 + structure. Therefore, we seek a procedure by which a gene expression dataset 1.52 + may be analyzed in concert with an anatomical atlas in order to produce a 1.53 + classifier. Such a procedure is a type of a machine learning procedure. The 1.54 + construction of the classifier is called training (also learning), and the initial 1.55 + gene expression dataset used in the construction of the classifier is called training 1.56 + data. 1.57 + In the machine learning literature, this sort of procedure may be thought 1.58 + of as a supervised learning task, defined as a task in whcih the goal is to learn 1.59 + a mapping from instances to labels, and the training data consists of a set of 1.60 + instances (voxels) for which the labels (subregions) are known. 1.61 + Each gene expression level is called a feature, and the selection of which 1.62 + genes to include is called feature selection. Feature selection is one component 1.63 + of the task of learning a classifier. Some methods for learning classifiers start 1.64 + out with a separate feature selection phase, whereas other methods combine 1.65 + feature selection with other aspects of training. 1.66 + One class of feature selection methods assigns some sort of score to each 1.67 + candidate gene. The top-ranked genes are then chosen. Some scoring measures 1.68 + can assign a score to a set of selected genes, not just to a single gene; in this 1.69 + case, a dynamic procedure may be used in which features are added and sub- 1.70 + tracted from the selected set depending on how much they raise the score. Such 1.71 + procedures are called “stepwise” or “greedy”. 1.72 + Although the classifier itself may only look at the gene expression data within 1.73 + each voxel before classifying that voxel, the learning algorithm which constructs 1.74 + the classifier may look over the entire dataset. We can categorize score-based 1.75 + feature selection methods depending on how the score of calculated. Often 1.76 + the score calculation consists of assigning a sub-score to each voxel, and then 1.77 + aggregating these sub-scores into a final score (the aggregation is often a sum or 1.78 + a sum of squares). If only information from nearby voxels is used to calculate a 1.79 + voxel’s sub-score, then we say it is a local scoring method. If only information 1.80 + from the voxel itself is used to calculate a voxel’s sub-score, then we say it is a 1.81 + pointwise scoring method. 1.82 + Key questions when choosing a learning method are: What are the instances? 1.83 + What are the features? How are the features chosen? Here are four principles 1.84 + that outline our answers to these questions. 1.85 + Principle 1: Combinatorial gene expression 1.86 + Above, we defined an “instance” as the combination of a voxel with the “asso- 1.87 + ciated gene expression data”. In our case this refers to the expression level of 1.88 + 2 1.89 + 1.90 + genes within the voxel, but should we include the expression levels of all genes, 1.91 + or only a few of them? 1.92 + It is too much to hope that every anatomical region of interest will be iden- 1.93 + tified by a single gene. For example, in the cortex, there are some areas which 1.94 + are not clearly delineated by any gene included in the Allen Brain Atlas (ABA) 1.95 + dataset. However, at least some of these areas can be delineated by looking 1.96 + at combinations of genes (an example of an area for which multiple genes are 1.97 + necessary and sufficient is provided in Preliminary Results). 1.98 + Principle 2: Only look at combinations of small numbers of genes 1.99 + When the classifier classifies a voxel, it is only allowed to look at the expression of 1.100 + the genes which have been selected as features. The more data that is available 1.101 + to a classifier, the better that it can do. For example, perhaps there are weak 1.102 + correlations over many genes that add up to a strong signal. So, why not include 1.103 + every gene as a feature? The reason is that we wish to employ the classifier in 1.104 + situations in which it is not feasible to gather data about every gene. For 1.105 + example, if we want to use the expression of marker genes as a trigger for some 1.106 + regionally-targeted intervention, then our intervention must contain a molecular 1.107 + mechanism to check the expression level of each marker gene before it triggers. 1.108 + It is currently infeasible to design a molecular trigger that checks the level of 1.109 + more than a handful of genes. Similarly, if the goal is to develop a procedure to 1.110 + do ISH on tissue samples in order to label their anatomy, then it is infeasible 1.111 + to label more than a few genes. Therefore, we must select only a few genes as 1.112 + features. 1.113 + Principle 3: Use geometry in feature selection 1.114 + When doing feature selection with score-based methods, the simplest thing to 1.115 + do would be to score the performance of each voxel by itself and then combine 1.116 + these scores; this is pointwise scoring. A more powerful approach is to also use 1.117 + information about the geometric relations between each voxel and its neighbors; 1.118 + this requires non-pointwise, local scoring methods. See Preliminary Results for 1.119 + evidence of the complementary nature of pointwise and local scoring methods. 1.120 + Principle 4: Work in 2-D whenever possible 1.121 + There are many anatomical structures which are commonly characterized in 1.122 + terms of a two-dimensional manifold. When it is known that the structure that 1.123 + one is looking for is two-dimensional, the results may be improved by allowing 1.124 + the analysis algorithm to take advantage of this prior knowledge. In addition, 1.125 + it is easier for humans to visualize and work with 2-D data. 1.126 + Therefore, when possible, the instances should represent pixels, not voxels. 1.127 + 3 1.128 + 1.129 + Aim 3 1.130 + Background 1.131 + The cortex is divided into areas and layers. To a first approximation, the par- 1.132 + cellation of the cortex into areas can be drawn as a 2-D map on the surface 1.133 + of the cortex. In the third dimension, the boundaries between the areas con- 1.134 + tinue downwards into the cortical depth, perpendicular to the surface. The layer 1.135 + boundaries run parallel to the surface. One can picture an area of the cortex as 1.136 + a slice of many-layered cake. 1.137 + Although it is known that different cortical areas have distinct roles in both 1.138 + normal functioning and in disease processes, there are no known marker genes 1.139 + for many cortical areas. When it is necessary to divide a tissue sample into 1.140 + cortical areas, this is a manual process that requires a skilled human to combine 1.141 + multiple visual cues and interpret them in the context of their approximate 1.142 + location upon the cortical surface. 1.143 + Even the questions of how many areas should be recognized in cortex, and 1.144 + what their arrangement is, are still not completely settled. A proposed division 1.145 + of the cortex into areas is called a cortical map. In the rodent, the lack of a 1.146 + single agreed-upon map can be seen by contrasting the recent maps given by 1.147 + Swanson?? on the one hand, and Paxinos and Franklin?? on the other. While 1.148 + the maps are certainly very similar in their general arrangement, significant 1.149 + differences remain in the details. 1.150 + Significance 1.151 + The method developed in aim (1) will be applied to each cortical area to find 1.152 + a set of marker genes such that the combinatorial expression pattern of those 1.153 + genes uniquely picks out the target area. Finding marker genes will be useful 1.154 + for drug discovery as well as for experimentation because marker genes can be 1.155 + used to design interventions which selectively target individual cortical areas. 1.156 + The application of the marker gene finding algorithm to the cortex will 1.157 + also support the development of new neuroanatomical methods. In addition to 1.158 + finding markers for each individual cortical areas, we will find a small panel 1.159 + of genes that can find many of the areal boundaries at once. This panel of 1.160 + marker genes will allow the development of an ISH protocol that will allow 1.161 + experimenters to more easily identify which anatomical areas are present in 1.162 + small samples of cortex. 1.163 + The method developed in aim (3) will provide a genoarchitectonic viewpoint 1.164 + that will contribute to the creation of a better map. The development of present- 1.165 + day cortical maps was driven by the application of histological stains. It is 1.166 + conceivable that if a different set of stains had been available which identified 1.167 + a different set of features, then the today’s cortical maps would have come out 1.168 + differently. Since the number of classes of stains is small compared to the number 1.169 + of genes, it is likely that there are many repeated, salient spatial patterns in 1.170 + the gene expression which have not yet been captured by any stain. Therefore, 1.171 + 4 1.172 + 1.173 + current ideas about cortical anatomy need to incorporate what we can learn 1.174 + from looking at the patterns of gene expression. 1.175 + While we do not here propose to analyze human gene expression data, it is 1.176 + conceivable that the methods we propose to develop could be used to suggest 1.177 + modifications to the human cortical map as well. 1.178 + Related work 1.179 + Preliminary work 1.180 + Justification of principles 1 thur 3 1.181 + Principle 1: Combinatorial gene expression 1.182 + Here we give an example of a cortical area which is not marked by any single 1.183 + gene, but which can be identified combinatorially. according to logistic regres- 1.184 + sion, gene wwc11 is the best fit single gene for predicting whether or not a pixel 1.185 + on the cortical surface belongs to the motor area (area MO). The upper-left 1.186 + picture in Figure shows wwc1’s spatial expression pattern over the cortex. The 1.187 + lower-right boundary of MO is represented reasonably well by this gene, however 1.188 + the gene overshoots the upper-left boundary. This flattened 2-D representation 1.189 + does not show it, but the area corresponding to the overshoot is the medial 1.190 + surface of the cortex. MO is only found on the lateral surface (todo). 1.191 + Gnee mtif22 is shown in figure the upper-right of Fig. . Mtif2 captures MO’s 1.192 + upper-left boundary, but not its lower-right boundary. Mtif2 does not express 1.193 + very much on the medial surface. By adding together the values at each pixel 1.194 + in these two figures, we get the lower-left of Figure . This combination captures 1.195 + area MO much better than any single gene. 1.196 + Principle 2: Only look at combinations of small numbers of genes 1.197 + In order to see how well one can do when looking at all genes at once, we ran 1.198 + a support vector machine to classify cortical surface pixels based on their gene 1.199 + expression profiles. We achieved classification accuracy of about 81%3. As noted 1.200 + above, however, a classifier that looks at all the genes at once isn’t practically 1.201 + useful. 1.202 + The requirement to find combinations of only a small number of genes limits 1.203 + us from straightforwardly applying many of the most simple techniques from 1.204 + the field of supervised machine learning. In the parlance of machine learning, 1.205 + our task combines feature selection with supervised learning. 1.206 +__________________________ 1.207 + 1“WW, C2 and coiled-coil domain containing 1”; EntrezGene ID 211652 1.208 + 2“mitochondrial translational initiation factor 2”; EntrezGene ID 76784 1.209 + 3Using the Shogun SVM package (todo:cite), with parameters type=GMNPSVM (multi- 1.210 +class b-SVM), kernal = gaussian with sigma = 0.1, c = 10, epsilon = 1e-1 – these are the 1.211 +first parameters we tried, so presumably performance would improve with different choices of 1.212 +parameters. 5-fold cross-validation. 1.213 + 5 1.214 + 1.215 + 1.216 + 1.217 + Figure 1: Upper left: wwc1. Upper right: mtif2. Lower left: wwc1 + mtif2 1.218 + (each pixel’s value on the lower left is the sum of the corresponding pixels in 1.219 + the upper row). Within each picture, the vertical axis roughly corresponds to 1.220 + anterior at the top and posterior at the bottom, and the horizontal axis roughly 1.221 + corresponds to medial at the left and lateral at the right. The red outline is 1.222 + the boundary of region MO. Pixels are colored approximately according to the 1.223 + density of expressing cells underneath each pixel, with red meaning a lot of 1.224 + expression and blue meaning little. 1.225 + 6 1.226 + 1.227 + 1.228 + 1.229 + Figure 2: The top row shows the three genes which (individually) best predict 1.230 + area AUD, according to logistic regression. The bottom row shows the three 1.231 + genes which (individually) best match area AUD, according to gradient similar- 1.232 + ity. From left to right and top to bottom, the genes are Ssr1, Efcbp1, Aph1a, 1.233 + Ptk7, Aph1a again, and Lepr 1.234 + Principle 3: Use geometry 1.235 + To show that local geometry can provide useful information that cannot be 1.236 + detected via pointwise analyses, consider Fig. . The top row of Fig. displays 1.237 + the 3 genes which most match area AUD, according to a pointwise method4. The 1.238 + bottom row displays the 3 genes which most match AUD according to a method 1.239 + which considers local geometry5 The pointwise method in the top row identifies 1.240 + genes which express more strongly in AUD than outside of it; its weakness is that 1.241 + this includes many areas which don’t have a salient border matching the areal 1.242 + border. The geometric method identifies genes whose salient expression border 1.243 + seems to partially line up with the border of AUD; its weakness is that this 1.244 + includes genes which don’t express over the entire area. Genes which have high 1.245 + rankings using both pointwise and border criteria, such as Aph1a in the example, 1.246 + may be particularly good markers. None of these genes are, individually, a 1.247 + perfect marker for AUD; we deliberately chose a “difficult” area in order to 1.248 + better contrast pointwise with geometric methods. 1.249 +__________________________ 1.250 + 4For each gene, a logistic regression in which the response variable was whether or not a 1.251 +surface pixel was within area AUD, and the predictor variable was the value of the expression 1.252 +of the gene underneath that pixel. The resulting scores were used to rank the genes in terms 1.253 +of how well they predict area AUD. 1.254 + 5For each gene the gradient similarity (see section ??) between (a) a map of the expression 1.255 +of each gene on the cortical surface and (b) the shape of area AUD, was calculated, and this 1.256 +was used to rank the genes. 1.257 + 7 1.258 + 1.259 + Principle 4: Work in 2-D whenever possible 1.260 + In anatomy, the manifold of interest is usually either defined by a combination 1.261 + of two relevant anatomical axes (todo), or by the surface of the structure (as is 1.262 + the case with the cortex). In the former case, the manifold of interest is a plane, 1.263 + but in the latter case it is curved. If the manifold is curved, there are various 1.264 + methods for mapping the manifold into a plane. 1.265 + The method that we will develop will begin by mapping the data into a 1.266 + 2-D plane. Although the manifold that characterized cortical areas is known 1.267 + to be the cortical surface, it remains to be seen which method of mapping the 1.268 + manifold into a plane is optimal for this application. We will compare mappings 1.269 + which attempt to preserve size (such as the one used by Caret??) with mappings 1.270 + which preserve angle (conformal maps). 1.271 + Although there is much 2-D organization in anatomy, there are also struc- 1.272 + tures whose shape is fundamentally 3-dimensional. If possible, we would like 1.273 + the method we develop to include a statistical test that warns the user if the 1.274 + assumption of 2-D structure seems to be wrong. 1.275 + —— 1.276 + Massive new datasets obtained with techniques such as in situ hybridization 1.277 + (ISH) and BAC-transgenics allow the expression levels of many genes at many 1.278 + locations to be compared. This can be used to find marker genes for specific 1.279 + anatomical structures, as well as to draw new anatomical maps. Our goal is 1.280 + to develop automated methods to relate spatial variation in gene expression to 1.281 + anatomy. We have five specific aims: 1.282 + (1) develop an algorithm to screen spatial gene expression data for combi- 1.283 + nations of marker genes which selectively target individual anatomical 1.284 + structures 1.285 + (2) develop an algorithm to screen spatial gene expression data for combina- 1.286 + tions of marker genes which can be used to delineate most of the bound- 1.287 + aries between a number of anatomical structures at once 1.288 + (3) develop an algorithm to suggest new ways of dividing a structure up into 1.289 + anatomical subregions, based on spatial patterns in gene expression 1.290 + (4) create a flat (2-D) map of the mouse cerebral cortex that contains a flat- 1.291 + tened version of the Allen Mouse Brain Atlas ISH dataset, as well as the 1.292 + boundaries of anatomical areas within the cortex. For each cortical layer, 1.293 + a layer-specific flat dataset will be created. A single combined flat dataset 1.294 + will be created which averages information from all of the layers. These 1.295 + datasets will be made available in both MATLAB and Caret formats. 1.296 + (5) validate the methods developed in (1), (2) and (3) by applying them to 1.297 + the cerebral cortex datasets created in (4) 1.298 + All algorithms that we develop will be implemented in an open-source soft- 1.299 + ware toolkit. The toolkit, as well as the machine-readable datasets developed in 1.300 + 8 1.301 + 1.302 + aim (4) and any other intermediate dataset we produce, will be published and 1.303 + freely available for others to use. 1.304 + In addition to developing generally useful methods, the application of these 1.305 + methods to cerebral cortex will produce immediate benefits that are only one 1.306 + step removed from clinical application, while also supporting the development 1.307 + of new neuroanatomical techniques. The method developed in aim (1) will be 1.308 + applied to each cortical area to find a set of marker genes. Currently, despite 1.309 + the distinct roles of different cortical areas in both normal functioning and 1.310 + disease processes, there are no known marker genes for many cortical areas. 1.311 + Finding marker genes will be immediately useful for drug discovery as well as for 1.312 + experimentation because once marker genes for an area are known, interventions 1.313 + can be designed which selectively target that area. 1.314 + The method developed in aim (2) will be used to find a small panel of genes 1.315 + that can find most of the boundaries between areas in the cortex. Today, finding 1.316 + cortical areal boundaries in a tissue sample is a manual process that requires a 1.317 + skilled human to combine multiple visual cues over a large area of the cortical 1.318 + surface. A panel of marker genes will allow the development of an ISH protocol 1.319 + that will allow experimenters to more easily identify which anatomical areas are 1.320 + present in small samples of cortex. 1.321 + For each cortical layer, a layer-specific flat dataset will be created. A single 1.322 + combined flat dataset will be created which averages information from all of 1.323 + the layers. These datasets will be made available in both MATLAB and Caret 1.324 + formats. 1.325 + —- 1.326 + New techniques allow the expression levels of many genes at many locations 1.327 + to be compared. It is thought that even neighboring anatomical structures have 1.328 + different gene expression profiles. We propose to develop automated methods 1.329 + to relate the spatial variation in gene expression to anatomy. We will develop 1.330 + two kinds of techniques: 1.331 + (a) techniques to screen for combinations of marker genes which selectively 1.332 + target anatomical structures 1.333 + (b) techniques to suggest new ways of dividing a structure up into anatomical 1.334 + subregions, based on the shapes of contours in the gene expression 1.335 + The first kind of technique will be helpful for finding marker genes associated 1.336 + with known anatomical features. The second kind of technique will be helpful in 1.337 + creating new anatomical maps, maps which reflect differences in gene expression 1.338 + the same way that existing maps reflect differences in histology. 1.339 + We intend to develop our techniques using the adult mouse cerebral cortex 1.340 + as a testbed. The Allen Brain Atlas has collected a dataset containing the 1.341 + expression level of about 4000 genes* over a set of over 150000 voxels, with a 1.342 + spatial resolution of approximately 200 microns[?]. 1.343 + We expect to discover sets of marker genes that pick out specific cortical 1.344 + areas. This will allow the development of drugs and other interventions that 1.345 + selectively target individual cortical areas. Therefore our research will lead 1.346 + 9 1.347 + 1.348 + to application in drug discovery, in the development of other targeted clinical 1.349 + interventions, and in the development of new experimental techniques. 1.350 + The best way to divide up rodent cortex into areas has not been completely 1.351 + determined, as can be seen by the differences in the recent maps given by Swan- 1.352 + son on the one hand, and Paxinos and Franklin on the other. It is likely that our 1.353 + study, by showing which areal divisions naturally follow from gene expression 1.354 + data, as opposed to traditional histological data, will contribute to the creation 1.355 + of a better map. While we do not here propose to analyze human gene expres- 1.356 + sion data, it is conceivable that the methods we propose to develop could be 1.357 + used to suggest modifications to the human cortical map as well. 1.358 + In the following, we will only be talking about coronal data. 1.359 + The Allen Brain Atlas provides “Smoothed Energy Volumes”, which are 1.360 + One type of artifact in the Allen Brain Atlas data is what we call a “slice 1.361 + artifact”. We have noticed two types of slice artifacts in the dataset. The first 1.362 + type, a “missing slice artifact”, occurs when the ISH procedure on a slice did 1.363 + not come out well. In this case, the Allen Brain investigators excluded the slice 1.364 + at issue from the dataset. This means that no gene expression information is 1.365 + available for that gene for the region of space covered by that slice. This results 1.366 + in an expression level of zero being assigned to voxels covered by the slice. This 1.367 + is partially but not completely ameliorated by the smoothing that is applied to 1.368 + create the Smoothed Energy Volumes. The usual end result is that a region of 1.369 + space which is shaped and oriented like a coronal slice is marked as having less 1.370 + gene expression than surrounding regions. 1.371 + The second type of slice artifact is caused by the fact that all of the slices 1.372 + have a consistent orientation. Since there may be artifacts (such as how well 1.373 + the ISH worked) which are constant within each slice but which vary between 1.374 + different slices, the result is that ceteris paribus, when one compares the genetic 1.375 + data of a voxel to another voxel within the same coronal plane, one would expect 1.376 + to find more similarity than if one compared a voxel to another voxel displaced 1.377 + along the rostrocaudal axis. 1.378 + We are enthusiastic about the sharing of methods, data, and results, and 1.379 + at the conclusion of the project, we will make all of our data and computer 1.380 + source code publically available. Our goal is that replicating our results, or 1.381 + applying the methods we develop to other targets, will be quick and easy for 1.382 + other investigators. In order to aid in understanding and replicating our results, 1.383 + we intend to include a software program which, when run, will take as input 1.384 + the Allen Brain Atlas raw data, and produce as output all numbers and charts 1.385 + found in publications resulting from the project. 1.386 + To aid in the replication of our results, we will include a script which takes 1.387 + as input the dataset in aim (3) and provides as output all of the tables in figures 1.388 + in our publications . 1.389 + We also expect to weigh in on the debate about how to best partition rodent 1.390 + cortex 1.391 + be useful for drug discovery as well 1.392 + * Another 16000 genes are available, but they do not cover the entire cerebral 1.393 + cortex with high spatial resolution. 1.394 + 10 1.395 + 1.396 + User-definable ROIs Combinatorial gene expression Negative as well as pos- 1.397 + itive signal Use geometry Search for local boundaries if necessary Flatmapped 1.398 + Specific aims 1.399 + Develop algorithms that find genetic markers for anatomical regions 1.400 + 1. Develop scoring measures for evaluating how good individual genes are at 1.401 + marking areas: we will compare pointwise, geometric, and information- 1.402 + theoretic measures. 1.403 + 2. Develop a procedure to find single marker genes for anatomical regions: for 1.404 + each cortical area, by using or combining the scoring measures developed, 1.405 + we will rank the genes by their ability to delineate each area. 1.406 + 3. Extend the procedure to handle difficult areas by using combinatorial cod- 1.407 + ing: for areas that cannot be identified by any single gene, identify them 1.408 + with a handful of genes. We will consider both (a) algorithms that incre- 1.409 + mentally/greedily combine single gene markers into sets, such as forward 1.410 + stepwise regression and decision trees, and also (b) supervised learning 1.411 + techniques which use soft constraints to minimize the number of features, 1.412 + such as sparse support vector machines. 1.413 + 4. Extend the procedure to handle difficult areas by combining or redrawing 1.414 + the boundaries: An area may be difficult to identify because the bound- 1.415 + aries are misdrawn, or because it does not “really” exist as a single area, 1.416 + at least on the genetic level. We will develop extensions to our procedure 1.417 + which (a) detect when a difficult area could be fit if its boundary were 1.418 + redrawn slightly, and (b) detect when a difficult area could be combined 1.419 + with adjacent areas to create a larger area which can be fit. 1.420 + Apply these algorithms to the cortex 1.421 + 1. Create open source format conversion tools: we will create tools to bulk 1.422 + download the ABA dataset and to convert between SEV, NIFTI and MAT- 1.423 + LAB formats. 1.424 + 2. Flatmap the ABA cortex data: map the ABA data onto a plane and draw 1.425 + the cortical area boundaries onto it. 1.426 + 3. Find layer boundaries: cluster similar voxels together in order to auto- 1.427 + matically find the cortical layer boundaries. 1.428 + 4. Run the procedures that we developed on the cortex: we will present, for 1.429 + each area, a short list of markers to identify that area; and we will also 1.430 + present lists of “panels” of genes that can be used to delineate many areas 1.431 + at once. 1.432 + 11 1.433 + 1.434 + Develop algorithms to suggest a division of a structure into anatom- 1.435 + ical parts 1.436 + 1. Explore dimensionality reduction algorithms applied to pixels: including 1.437 + TODO 1.438 + 2. Explore dimensionality reduction algorithms applied to genes: including 1.439 + TODO 1.440 + 3. Explore clustering algorithms applied to pixels: including TODO 1.441 + 4. Explore clustering algorithms applied to genes: including gene shaving, 1.442 + TODO 1.443 + 5. Develop an algorithm to use dimensionality reduction and/or hierarchial 1.444 + clustering to create anatomical maps 1.445 + 6. Run this algorithm on the cortex: present a hierarchial, genoarchitectonic 1.446 + map of the cortex 1.447 + gradient similarity is calculated as: ∑ 1.448 + pixels cos(abs(∠∇1 - ∠∇2)) ⋅|∇1|+|∇2| 1.449 + 2 ⋅ 1.450 + pixel_value1+pixel_value2 1.451 + 2 1.452 + (todo) Technically, we say that an anatomical structure has a fundamen- 1.453 + tally 2-D organization when there exists a commonly used, generic, anatomical 1.454 + structure-preserving map from 3-D space to a 2-D manifold. 1.455 + Related work: 1.456 + The Allen Brain Institute has developed an interactive web interface called 1.457 + AGEA which allows an investigator to (1) calculate lists of genes which are se- 1.458 + lectively overexpressed in certain anatomical regions (ABA calls this the “Gene 1.459 + Finder” function) (2) to visualize the correlation between the genetic profiles of 1.460 + voxels in the dataset, and (3) to visualize a hierarchial clustering of voxels in 1.461 + the dataset [?]. AGEA is an impressive and useful tool, however, it does not 1.462 + solve the same problems that we propose to solve with this project. 1.463 + First we describe AGEA’s “Gene Finder”, and then compare it to our pro- 1.464 + posed method for finding marker genes. AGEA’s Gene Finder first asks the 1.465 + investigator to select a single “seed voxel” of interest. It then uses a clustering 1.466 + method, combined with built-in knowledge of major anatomical structures, to 1.467 + select two sets of voxels; an “ROI” and a “comparator region”*. The seed voxel 1.468 + is always contained within the ROI, and the ROI is always contained within the 1.469 + comparator region. The comparator region is similar but not identical to the 1.470 + set of voxels making up the major anatomical region containing the ROI. Gene 1.471 + Finder then looks for genes which can distinguish the ROI from the comparator 1.472 + region. Specifically, it finds genes for which the ratio (expression energy in the 1.473 + ROI) / (expression energy in the comparator region) is high. 1.474 + Informally, the Gene Finder first infers an ROI based on clustering the seed 1.475 + voxel with other voxels. Then, the Gene Finder finds genes which overexpress 1.476 + in the ROI as compared to other voxels in the major anatomical region. 1.477 + There are three major differences between our approach and Gene Finder. 1.478 + 12 1.479 + 1.480 + First, Gene Finder focuses on individual genes and individual ROIs in isola- 1.481 + tion. This is great for regions which can be picked out from all other regions by a 1.482 + single gene, but not all of them can (todo). There are at least two ways this can 1.483 + miss out on useful genes. First, a gene might express in part of a region, but not 1.484 + throughout the whole region, but there may be another gene which expresses 1.485 + in the rest of the region*. Second, a gene might express in a region, but not in 1.486 + any of its neighbors, but it might express also in other non-neighboring regions. 1.487 + To take advantage of these types of genes, we propose to find combinations of 1.488 + genes which, together, can identify the boundaries of all subregions within the 1.489 + containing region. 1.490 + Second, Gene Finder uses a pointwise metric, namely expression energy ratio, 1.491 + to decide whether a gene is good for picking out a region. We have found better 1.492 + results by using metrics which take into account not just single voxels, but also 1.493 + the local geometry of neighboring voxels, such as the local gradient (todo). In 1.494 + addition, we have found that often the absence of gene expression can be used 1.495 + as a marker, which will not be caught by Gene Finder’s expression energy ratio 1.496 + (todo). 1.497 + Third, Gene Finder chooses the ROI based only on the seed voxel. This 1.498 + often does not permit the user to query the ROI that they are interested in. For 1.499 + example, in all of our tests of Gene Finder in cortex, the ROIs chosen tend to 1.500 + be cortical layers, rather than cortical areas. 1.501 + In summary, when Gene Finder picks the ROI that you want, and when this 1.502 + ROI can be easily picked out from neighboring regions by single genes which 1.503 + selectively overexpress in the ROI compared to the entire major anatomical re- 1.504 + gion, Gene Finder will work. However, Gene Finder will not pick cortical areas 1.505 + as ROIs, and even if it could, many cortical areas cannot be uniquely picked out 1.506 + by the overexpression of any single gene. By contrast, we will target cortical 1.507 + areas, we will explore a variety of metrics which can complement the shortcom- 1.508 + ings of expression energy ratio, and we will use the combinatorial expression of 1.509 + genes to pick out cortical areas even when no individual gene will do. 1.510 + * The terms “ROI” and “comparator region” are our own; the ABI calls 1.511 + them the “local region” and the “larger anatomical context”. The ABI uses the 1.512 + term “specificity comparator” to mean the major anatomic region containing 1.513 + the ROI, which is not exactly identical to the comparator region. 1.514 + ** In this case, the union of the area of expression of the two genes would 1.515 + suffice; one could also imagine that there could be situations in which the in- 1.516 + tersection of multiple genes would be needed, or a combination of unions and 1.517 + intersections. 1.518 + Now we describe AGEA’s hierarchial clustering, and compare it to our pro- 1.519 + posal. The goal of AGEA’s hierarchial clustering is to generate a binary tree of 1.520 + clusters, where a cluster is a collection of voxels. AGEA begins by computing 1.521 + the Pearson correlation between each pair of voxels. They then employ a recur- 1.522 + sive divisive (top-down) hierarchial clustering procedure on the voxels, which 1.523 + means that they start with all of the voxels, and then they divide them into clus- 1.524 + ters, and then within each cluster, they divide that cluster into smaller clusters, 1.525 + etc***. At each step, the collection of voxels is partitioned into two smaller 1.526 + 13 1.527 + 1.528 + clusters in a way that maximizes the following quantity: average correlation 1.529 + between all possible pairs of voxels containing one voxel from each cluster. 1.530 + There are three major differences between our approach and AGEA’s hier- 1.531 + archial clustering. First, AGEA’s clustering method separates cortical layers 1.532 + before it separates cortical areas. 1.533 + following procedure is used for the purpose of dividing a collection of voxels 1.534 + into smaller clusters: partition the voxels into two sets, such that the following 1.535 + quantity is maximized: 1.536 + *** depending on which level of the tree is being created, the voxels are 1.537 + subsampled in order to save time 1.538 + does not allow the user to input anything other than a seed voxel; this means 1.539 + that for each seed voxel, there is only one 1.540 + The role of the “local region” is to serve as a region of interest for which 1.541 + marker genes are desired; the role of the “larger anatomical context” is to be 1.542 + the structure 1.543 + There are two kinds of differences between AGEA and our project; differ- 1.544 + ences that relate to the treatment of the cortex, and differences in the type of 1.545 + generalizable methods being developed. As relates 1.546 + indicate an ROI 1.547 + explore simple correlation-based relationships between voxels, genes, and 1.548 + clusters of voxels. 1.549 + There have not yet been any studies which describe the results of applying 1.550 + AGEA to the cerebral cortex; however, we suspect that the AGEA metrics are 1.551 + not optimal for the task of relating genes to cortical areas. A voxel’s gene 1.552 + expression profile depends upon both its cortical area and its cortical layer, 1.553 + however, AGEA has no mechanism to distinguish these two. As a result, voxels 1.554 + in the same layer but different areas are often clustered together by AGEA. As 1.555 + part of the project, we will compare the performance of our techniques against 1.556 + AGEA’s. 1.557 + — 1.558 + The Allen Brain Institute has developed interactive tools called AGEA which 1.559 + allow an investigator to explore simple correlation-based relationships between 1.560 + voxels, genes, and clusters of voxels. There have not yet been any studies 1.561 + which describe the results of applying AGEA to the cerebral cortex; however, 1.562 + we suspect that the AGEA metrics are not optimal for the task of relating 1.563 + genes to cortical areas. A voxel’s gene expression profile depends upon both 1.564 + its cortical area and its cortical layer, however, AGEA has no mechanism to 1.565 + distinguish these two. As a result, voxels in the same layer but different areas 1.566 + are often clustered together by AGEA. As part of the project, we will compare 1.567 + the performance of our techniques against AGEA’s. 1.568 + Another difference between our techniques and AGEA’s is that AGEA allows 1.569 + the user to enter only a voxel location, and then to either explore the rest of 1.570 + the brain’s relationship to that particular voxel, or explore a partitioning of 1.571 + the brain based on pairwise voxel correlation. If the user is interested not in a 1.572 + single voxel, but rather an entire anatomical structure, AGEA will only succeed 1.573 + to the extent that the selected voxel is a typical representative of the structure. 1.574 + 14 1.575 + 1.576 + As discussed in the previous paragraph, this poses problems for structures like 1.577 + cortical areas, which (because of their division into cortical layers) do not have 1.578 + a single “typical representative”. 1.579 + By contrast, in our system, the user will start by selecting, not a single voxel, 1.580 + but rather, an anatomical superstructure to be divided into pieces (for example, 1.581 + the cerebral cortex). We expect that our methods will take into account not 1.582 + just pairwise statistics between voxels, but also large-scale geometric features 1.583 + (for example, the rapidity of change in gene expression as regional boundaries 1.584 + are crossed) which optimize the discriminability of regions within the selected 1.585 + superstructure. 1.586 + —– 1.587 + screen for combinations of marker genes which selectively target anatom- 1.588 + ical structures pick delineate the boundaries between neighboring anatomical 1.589 + structures. (b) techniques to screen for marker genes which pick out anatomical 1.590 + structures of interest 1.591 + , techniques which: (a) screen for marker genes , and (b) suggest new 1.592 + anatomical maps based on 1.593 + whose expression partitions the region of interest into its anatomical sub- 1.594 + structures, and (b) use the natural contours of gene expression to suggest new 1.595 + ways of dividing an organ into 1.596 + The Allen Brain Atlas 1.597 + – 1.598 + to: brooksl@mail.nih.gov 1.599 + Hi, I’m writing to confirm the applicability of a potential research project to 1.600 + the challenge grant topic ”New computational and statistical methods for the 1.601 + analysis of large data sets from next-generation sequencing technologies”. 1.602 + We want to develop methods for the analysis of gene expression datasets that 1.603 + can be used to uncover the relationships between gene expression and anatomical 1.604 + regions. Specifically, we want to develop techniques to (a) given a set of known 1.605 + anatomical areas, identify genetic markers for each of these areas, and (b) given 1.606 + an anatomical structure whose substructure is unknown, suggest a map, that 1.607 + is, a division of the space into anatomical sub-structures, that represents the 1.608 + boundaries inherent in the gene expression data. 1.609 + We propose to develop our techniques on the Allen Brain Atlas mouse brain 1.610 + gene expression dataset by finding genetic markers for anatomical areas within 1.611 + the cerebral cortex. The Allen Brain Atlas contains a registered 3-D map of 1.612 + gene expression data with 200-micron voxel resolution which was created from 1.613 + in situ hybridization data. The dataset contains about 4000 genes which are 1.614 + available at this resolution across the entire cerebral cortex. 1.615 + Despite the distinct roles of different cortical areas in both normal function- 1.616 + ing and disease processes, there are no known marker genes for many cortical 1.617 + areas. This project will be immediately useful for both drug discovery and clini- 1.618 + cal research because once the markers are known, interventions can be designed 1.619 + which selectively target specific cortical areas. 1.620 + This techniques we develop will be useful because they will be applicable to 1.621 + the analysis of other anatomical areas, both in terms of finding marker genes 1.622 + 15 1.623 + 1.624 + for known areas, and in terms of suggesting new anatomical subdivisions that 1.625 + are based upon the gene expression data. 1.626 + —- 1.627 + It is likely that our study, by showing which areal divisions naturally fol- 1.628 + low from gene expression data, as opposed to traditional histological data, will 1.629 + contribute to the creation of 1.630 + there are clear genetic or chemical markers known for only a few cortical 1.631 + areas. This makes it difficult to target drugs to specific 1.632 + As part of aims (1) and (5), we will discover sets of marker genes that pick 1.633 + out specific cortical areas. This will allow the development of drugs and other 1.634 + interventions that selectively target individual cortical areas. As part of aims 1.635 + (2) and (5), we will also discover small panels of marker genes that can be used 1.636 + to delineate most of the cortical areal map. 1.637 + With aims (2) and (4), we 1.638 + There are five principals 1.639 + In addition to validating the usefulness of the algorithms, the application of 1.640 + these methods to cerebral cortex will produce immediate benefits that are only 1.641 + one step removed from clinical application. 1.642 + todo: remember to check gensat, etc for validation (mention bias/variance) 1.643 + Why it is useful to apply these methods to cortex 1.644 + There is still room for debate as to exactly how the cortex should be parcellated 1.645 + into areas. 1.646 + The best way to divide up rodent cortex into areas has not been completely 1.647 + determined, 1.648 + not yet been accounted for in 1.649 + that the expression of some genes will contain novel spatial patterns which 1.650 + are not account 1.651 + that a genoarchitectonic map 1.652 + This principle is only applicable to aim 1 (marker genes). For aim 2 (partition 1.653 + a structure in into anatomical subregions), we plan to work with many genes at 1.654 + once. 1.655 + tood: aim 2 b+s? 1.656 + Principle 5: Interoperate with existing tools 1.657 + In order for our software to be as useful as possible for our users, it will be 1.658 + able to import and export data to standard formats so that users can use our 1.659 + software in tandem with other software tools created by other teams. We will 1.660 + support the following formats: NIFTI (Neuroimaging Informatics Technology 1.661 + Initiative), SEV (Allen Brain Institute Smoothed Energy Volume), and MAT- 1.662 + LAB. This ensures that our users will not have to exclusively rely on our tools 1.663 + when analyzing data. For example, users will be able to use the data visualiza- 1.664 + tion and analysis capabilities of MATLAB and Caret alongside our software. 1.665 + 16 1.666 + 1.667 + To our knowledge, there is no currently available software to convert between 1.668 + these formats, so we will also provide a format conversion tool. This may be 1.669 + useful even for groups that don’t use any of our other software. 1.670 + todo: is “marker gene” even a phrase that we should use at all? 1.671 + note for aim 1 apps: combo of genes is for voxel, not within any single cell 1.672 + , as when genetic markers allow the development of selective interventions; 1.673 + the reason that one can be confident that the intervention is selective is that it 1.674 + is only turned on when a certain combination of genes is turned on and off. The 1.675 + result procedure is what assures us that when that combination is present, the 1.676 + local tissue is probably part of a certain subregion. 1.677 + The basic idea is that we want to find a procedure by 1.678 + The task of finding genes that mark anatomical areas can be phrased in 1.679 + terms of what the field of machine learning calls a “supervised learning” task. 1.680 + The goal of this task is to learn a function (the “classifier”) which 1.681 + If a person knows a combination of genes that mark an area, that implies 1.682 + that the person can be told how strong those genes express in any voxel, and 1.683 + the person can use this information to determine how 1.684 + finding how to infer the areal identity of a voxel if given the gene expression 1.685 + profile of that voxel. 1.686 + For each voxel in the cortex, we want to start with data about the gene 1.687 + expression 1.688 + There are various ways to look for marker genes. We will define some terms, 1.689 + and along the way we will describe a few design choices encountered in the 1.690 + process of creating a marker gene finding method, and then we will present four 1.691 + principles that describe which options we have chosen. 1.692 + In developing a procedure for finding marker genes, we are developing a 1.693 + procedure that takes a dataset of experimental observations and produces a 1.694 + result. One can think of the result as merely a list of genes, but really the result 1.695 + is an understanding of a predictive relationship between, on the one hand, the 1.696 + expression levels of genes, and, on the other hand, anatomical subregions. 1.697 + One way to more formally define this understanding is to look at it as a 1.698 + procedure. In this view, the result of the learning procedure is itself a procedure. 1.699 + The result procedure provides a way to use the gene expression profiles of voxels 1.700 + in a tissue sample in order to determine where the subregions are. 1.701 + This result procedure can be used directly, as when an experimenter has 1.702 + a tissue sample and needs to know what subregions are present in it, and, 1.703 + if multiple subregions are present, where they each are. Or it can be used 1.704 + indirectly; imagine that the result procedure tells us that whenever a certain 1.705 + combination of genes are expressed, the local tissue is probably part of a certain 1.706 + subregion. This means that we can then confidentally develop an intervention 1.707 + which is triggered only when that combination of genes are expressed; and to 1.708 + the extent that the result procedure is reliable, we know that the intervention 1.709 + will only be triggered in the target subregion. 1.710 + We said that the result procedure provides “a way to use the gene expression 1.711 + profiles of voxels in a tissue sample” in order to “determine where the subregions 1.712 + are”. 1.713 + 17 1.714 + 1.715 + Does the result procedure get as input all of the gene expression profiles 1.716 + of each voxel in the entire tissue sample, and produce as output all of the 1.717 + subregional boundaries all at once? 1.718 + it is helpful for the classifier to look at the global “shape” of gene expression 1.719 + patterns over the whole structure, rather than just nearby voxels. 1.720 + there is some small bit of additional information that can be gleaned from 1.721 + knowing the 1.722 + Design choices for a supervised learning procedure 1.723 + After all, 1.724 + there is a small correlation between the gene expression levels from distant 1.725 + voxels and 1.726 + Depending on how we intend to use the classifier, we may want to design it 1.727 + so that 1.728 + It is possible for many things to 1.729 + The choice of which data is made part of an instance 1.730 + what we seek is a procedure 1.731 + partition the tissue sample into subregions. 1.732 + each part of the anatomical structure 1.733 + must be One way to rephrase this task is to say that, instead of searching 1.734 + for the location of the subregions, we are looking to partition the tissue sample 1.735 + into subregions. 1.736 + There are various ways to look for marker genes. We will define some terms, 1.737 + and along the way we will describe a few design choices encountered in the 1.738 + process of creating a marker gene finding method, and then we will present four 1.739 + principles that describe which options we have chosen. 1.740 + In developing a procedure for finding marker genes, we are developing a 1.741 + procedure that takes a dataset of experimental observations and produces a 1.742 + result. One can think of the result as merely a list of genes, but really the result 1.743 + is an understanding of a predictive relationship between, on the one hand, the 1.744 + expression levels of genes, and, on the other hand, anatomical subregions. 1.745 + One way to more formally define this understanding is to look at it as a 1.746 + procedure. In this view, the result of the learning procedure is itself a procedure. 1.747 + The result procedure provides a way to use the gene expression profiles of voxels 1.748 + in a tissue sample in order to determine where the subregions are. 1.749 + This result procedure can be used directly, as when an experimenter has 1.750 + a tissue sample and needs to know what subregions are present in it, and, 1.751 + if multiple subregions are present, where they each are. Or it can be used 1.752 + indirectly; imagine that the result procedure tells us that whenever a certain 1.753 + combination of genes are expressed, the local tissue is probably part of a certain 1.754 + subregion. This means that we can then confidentally develop an intervention 1.755 + which is triggered only when that combination of genes are expressed; and to 1.756 + the extent that the result procedure is reliable, we know that the intervention 1.757 + will only be triggered in the target subregion. 1.758 + 18 1.759 + 1.760 + We said that the result procedure provides “a way to use the gene expression 1.761 + profiles of voxels in a tissue sample” in order to “determine where the subregions 1.762 + are”. 1.763 + Does the result procedure get as input all of the gene expression profiles 1.764 + of each voxel in the entire tissue sample, and produce as output all of the 1.765 + subregional boundaries all at once? 1.766 + Or are we given one voxel at a time, 1.767 + In the jargon of the field of machine learning, the result procedure is called 1.768 + a classifier. 1.769 + The task of finding genes that mark anatomical areas can be phrased in 1.770 + terms of what the field of machine learning calls a “supervised learning” task. 1.771 + The goal of this task is to learn a function (the “classifier”) which 1.772 + If a person knows a combination of genes that mark an area, that implies 1.773 + that the person can be told how strong those genes express in any voxel, and 1.774 + the person can use this information to determine how 1.775 + finding how to infer the areal identity of a voxel if given the gene expression 1.776 + profile of that voxel. 1.777 + For each voxel in the cortex, we want to start with data about the gene 1.778 + expression 1.779 + single voxels, but rather groups of voxels, such that the groups can be placed 1.780 + in some 2-D space. We will call such instances “pixels”. 1.781 + We have been speaking as if instances necessarily correspond to single voxels. 1.782 + But it is possible for instances to be groupings of many voxels, in which case 1.783 + each grouping must be assigned the same label (that is, each voxel grouping 1.784 + must stay inside a single anatomical subregion). 1.785 + In some but not all cases, the groups are either rows or columns of voxels. 1.786 + This is the case with the cerebral cortex, in which one may assume that columns 1.787 + of voxels which run perpendicular to the cortical surface all share the same areal 1.788 + identity. In the cortex, we call such an instance a “surface pixel”, because such 1.789 + an instance represents the data associated with all voxels underneath a specific 1.790 + patch of the cortical surface. 1.791 + 19 1.792 + 1.793 +