cg
view grant.html @ 1:7487ad7f5d8f
.
author | bshanks@bshanks-salk.dyndns.org |
---|---|
date | Sat Apr 11 19:35:08 2009 -0700 (16 years ago) |
parents | 29eee29f9bc1 |
children | 3c874c1cd837 |
line source
1 Specific aims
2 Massive new datasets obtained with techniques such as in situ hybridization
3 (ISH) and BAC-transgenics allow the expression levels of many genes at many
4 locations to be compared. Our goal is to develop automated methods to relate
5 spatial variation in gene expression to anatomy. We want to find marker genes
6 for specific anatomical regions, and also to draw new anatomical maps based on
7 gene expression patterns. We have three specific aims:
8 (1) develop an algorithm to screen spatial gene expression data for combina-
9 tions of marker genes which selectively target anatomical regions
10 (2) develop an algorithm to suggest new ways of carving up a structure into
11 anatomical subregions, based on spatial patterns in gene expression
12 (3) create a 2-D “flat map” dataset of the mouse cerebral cortex that contains
13 a flattened version of the Allen Mouse Brain Atlas ISH data, as well as
14 the boundaries of cortical anatomical areas. Use this dataset to validate
15 the methods developed in (1) and (2).
16 In addition to validating the usefulness of the algorithms, the application of
17 these methods to cerebral cortex will produce immediate benefits, because there
18 are currently no known genetic markers for many cortical areas. The results
19 of the project will support the development of new ways to selectively target
20 cortical areas, and it will support the development of a method for identifying
21 the cortical areal boundaries present in small tissue samples.
22 All algorithms that we develop will be implemented in an open-source soft-
23 ware toolkit. The toolkit, as well as the machine-readable datasets developed
24 in aim (3), will be published and freely available for others to use.
25 Background and significance
26 Aim 1
27 Machine learning terminology
28 The task of looking for marker genes for anatomical subregions means that one
29 is looking for a set of genes such that, if the expression level of those genes is
30 known, then the locations of the subregions can be inferred.
31 If we define the subregions so that they cover the entire anatomical structure
32 to be divided, then instead of saying that we are using gene expression to find
33 the locations of the subregions, we may say that we are using gene expression to
34 determine to which subregion each voxel within the structure belongs. We call
35 this a classification task, because each voxel is being assigned to a class (namely,
36 its subregion).
37 Therefore, an understanding of the relationship between the combination of
38 their expression levels and the locations of the subregions may be expressed as
39 1
41 a function. The input to this function is a voxel, along with the gene expression
42 levels within that voxel; the output is the subregional identity of the target
43 voxel, that is, the subregion to which the target voxel belongs. We call this
44 function a classifier. In general, the input to a classifier is called an instance,
45 and the output is called a label.
46 The object of aim 1 is not to produce a single classifier, but rather to develop
47 an automated method for determining a classifier for any known anatomical
48 structure. Therefore, we seek a procedure by which a gene expression dataset
49 may be analyzed in concert with an anatomical atlas in order to produce a
50 classifier. Such a procedure is a type of a machine learning procedure. The
51 construction of the classifier is called training (also learning), and the initial
52 gene expression dataset used in the construction of the classifier is called training
53 data.
54 In the machine learning literature, this sort of procedure may be thought
55 of as a supervised learning task, defined as a task in whcih the goal is to learn
56 a mapping from instances to labels, and the training data consists of a set of
57 instances (voxels) for which the labels (subregions) are known.
58 Each gene expression level is called a feature, and the selection of which
59 genes to include is called feature selection. Feature selection is one component
60 of the task of learning a classifier. Some methods for learning classifiers start
61 out with a separate feature selection phase, whereas other methods combine
62 feature selection with other aspects of training.
63 One class of feature selection methods assigns some sort of score to each
64 candidate gene. The top-ranked genes are then chosen. Some scoring measures
65 can assign a score to a set of selected genes, not just to a single gene; in this
66 case, a dynamic procedure may be used in which features are added and sub-
67 tracted from the selected set depending on how much they raise the score. Such
68 procedures are called “stepwise” or “greedy”.
69 Although the classifier itself may only look at the gene expression data within
70 each voxel before classifying that voxel, the learning algorithm which constructs
71 the classifier may look over the entire dataset. We can categorize score-based
72 feature selection methods depending on how the score of calculated. Often
73 the score calculation consists of assigning a sub-score to each voxel, and then
74 aggregating these sub-scores into a final score (the aggregation is often a sum or
75 a sum of squares). If only information from nearby voxels is used to calculate a
76 voxel’s sub-score, then we say it is a local scoring method. If only information
77 from the voxel itself is used to calculate a voxel’s sub-score, then we say it is a
78 pointwise scoring method.
79 Key questions when choosing a learning method are: What are the instances?
80 What are the features? How are the features chosen? Here are four principles
81 that outline our answers to these questions.
82 Principle 1: Combinatorial gene expression
83 Above, we defined an “instance” as the combination of a voxel with the “asso-
84 ciated gene expression data”. In our case this refers to the expression level of
85 2
87 genes within the voxel, but should we include the expression levels of all genes,
88 or only a few of them?
89 It is too much to hope that every anatomical region of interest will be iden-
90 tified by a single gene. For example, in the cortex, there are some areas which
91 are not clearly delineated by any gene included in the Allen Brain Atlas (ABA)
92 dataset. However, at least some of these areas can be delineated by looking
93 at combinations of genes (an example of an area for which multiple genes are
94 necessary and sufficient is provided in Preliminary Results).
95 Principle 2: Only look at combinations of small numbers of genes
96 When the classifier classifies a voxel, it is only allowed to look at the expression of
97 the genes which have been selected as features. The more data that is available
98 to a classifier, the better that it can do. For example, perhaps there are weak
99 correlations over many genes that add up to a strong signal. So, why not include
100 every gene as a feature? The reason is that we wish to employ the classifier in
101 situations in which it is not feasible to gather data about every gene. For
102 example, if we want to use the expression of marker genes as a trigger for some
103 regionally-targeted intervention, then our intervention must contain a molecular
104 mechanism to check the expression level of each marker gene before it triggers.
105 It is currently infeasible to design a molecular trigger that checks the level of
106 more than a handful of genes. Similarly, if the goal is to develop a procedure to
107 do ISH on tissue samples in order to label their anatomy, then it is infeasible
108 to label more than a few genes. Therefore, we must select only a few genes as
109 features.
110 Principle 3: Use geometry in feature selection
111 When doing feature selection with score-based methods, the simplest thing to do
112 would be to score the performance of each voxel by itself and then combine these
113 scores (pointwise scoring). A more powerful approach is to also use information
114 about the geometric relations between each voxel and its neighbors; this requires
115 non-pointwise, local scoring methods. See Preliminary Results for evidence of
116 the complementary nature of pointwise and local scoring methods.
117 Principle 4: Work in 2-D whenever possible
118 There are many anatomical structures which are commonly characterized in
119 terms of a two-dimensional manifold. When it is known that the structure that
120 one is looking for is two-dimensional, the results may be improved by allowing
121 the analysis algorithm to take advantage of this prior knowledge. In addition,
122 it is easier for humans to visualize and work with 2-D data.
123 Therefore, when possible, the instances should represent pixels, not voxels.
124 Aim 2
125 todo
126 3
128 Aim 3
129 Background
130 The cortex is divided into areas and layers. To a first approximation, the par-
131 cellation of the cortex into areas can be drawn as a 2-D map on the surface
132 of the cortex. In the third dimension, the boundaries between the areas con-
133 tinue downwards into the cortical depth, perpendicular to the surface. The layer
134 boundaries run parallel to the surface. One can picture an area of the cortex as
135 a slice of many-layered cake.
136 Although it is known that different cortical areas have distinct roles in both
137 normal functioning and in disease processes, there are no known marker genes
138 for many cortical areas. When it is necessary to divide a tissue sample into
139 cortical areas, this is a manual process that requires a skilled human to combine
140 multiple visual cues and interpret them in the context of their approximate
141 location upon the cortical surface.
142 Even the questions of how many areas should be recognized in cortex, and
143 what their arrangement is, are still not completely settled. A proposed division
144 of the cortex into areas is called a cortical map. In the rodent, the lack of a
145 single agreed-upon map can be seen by contrasting the recent maps given by
146 Swanson?? on the one hand, and Paxinos and Franklin?? on the other. While
147 the maps are certainly very similar in their general arrangement, significant
148 differences remain in the details.
149 Significance
150 The method developed in aim (1) will be applied to each cortical area to find
151 a set of marker genes such that the combinatorial expression pattern of those
152 genes uniquely picks out the target area. Finding marker genes will be useful
153 for drug discovery as well as for experimentation because marker genes can be
154 used to design interventions which selectively target individual cortical areas.
155 The application of the marker gene finding algorithm to the cortex will
156 also support the development of new neuroanatomical methods. In addition to
157 finding markers for each individual cortical areas, we will find a small panel
158 of genes that can find many of the areal boundaries at once. This panel of
159 marker genes will allow the development of an ISH protocol that will allow
160 experimenters to more easily identify which anatomical areas are present in
161 small samples of cortex.
162 The method developed in aim (3) will provide a genoarchitectonic viewpoint
163 that will contribute to the creation of a better map. The development of present-
164 day cortical maps was driven by the application of histological stains. It is
165 conceivable that if a different set of stains had been available which identified
166 a different set of features, then the today’s cortical maps would have come out
167 differently. Since the number of classes of stains is small compared to the number
168 of genes, it is likely that there are many repeated, salient spatial patterns in
169 the gene expression which have not yet been captured by any stain. Therefore,
170 4
172 current ideas about cortical anatomy need to incorporate what we can learn
173 from looking at the patterns of gene expression.
174 While we do not here propose to analyze human gene expression data, it is
175 conceivable that the methods we propose to develop could be used to suggest
176 modifications to the human cortical map as well.
177 Related work
178 todo
179 Preliminary work
180 Justification of principles 1 thur 3
181 Principle 1: Combinatorial gene expression
182 Here we give an example of a cortical area which is not marked by any single
183 gene, but which can be identified combinatorially. according to logistic regres-
184 sion, gene wwc11 is the best fit single gene for predicting whether or not a pixel
185 on the cortical surface belongs to the motor area (area MO). The upper-left
186 picture in Figure shows wwc1’s spatial expression pattern over the cortex. The
187 lower-right boundary of MO is represented reasonably well by this gene, however
188 the gene overshoots the upper-left boundary. This flattened 2-D representation
189 does not show it, but the area corresponding to the overshoot is the medial
190 surface of the cortex. MO is only found on the lateral surface (todo).
191 Gnee mtif22 is shown in figure the upper-right of Fig. . Mtif2 captures MO’s
192 upper-left boundary, but not its lower-right boundary. Mtif2 does not express
193 very much on the medial surface. By adding together the values at each pixel
194 in these two figures, we get the lower-left of Figure . This combination captures
195 area MO much better than any single gene.
196 Principle 2: Only look at combinations of small numbers of genes
197 In order to see how well one can do when looking at all genes at once, we ran
198 a support vector machine to classify cortical surface pixels based on their gene
199 expression profiles. We achieved classification accuracy of about 81%3. As noted
200 above, however, a classifier that looks at all the genes at once isn’t practically
201 useful.
202 The requirement to find combinations of only a small number of genes limits
203 us from straightforwardly applying many of the most simple techniques from
204 __________________________
205 1“WW, C2 and coiled-coil domain containing 1”; EntrezGene ID 211652
206 2“mitochondrial translational initiation factor 2”; EntrezGene ID 76784
207 3Using the Shogun SVM package (todo:cite), with parameters type=GMNPSVM (multi-
208 class b-SVM), kernal = gaussian with sigma = 0.1, c = 10, epsilon = 1e-1 – these are the
209 first parameters we tried, so presumably performance would improve with different choices of
210 parameters. 5-fold cross-validation.
211 5
215 Figure 1: Upper left: wwc1. Upper right: mtif2. Lower left: wwc1 + mtif2
216 (each pixel’s value on the lower left is the sum of the corresponding pixels in
217 the upper row). Within each picture, the vertical axis roughly corresponds to
218 anterior at the top and posterior at the bottom, and the horizontal axis roughly
219 corresponds to medial at the left and lateral at the right. The red outline is
220 the boundary of region MO. Pixels are colored approximately according to the
221 density of expressing cells underneath each pixel, with red meaning a lot of
222 expression and blue meaning little.
223 6
227 Figure 2: The top row shows the three genes which (individually) best predict
228 area AUD, according to logistic regression. The bottom row shows the three
229 genes which (individually) best match area AUD, according to gradient similar-
230 ity. From left to right and top to bottom, the genes are Ssr1, Efcbp1, Aph1a,
231 Ptk7, Aph1a again, and Lepr
232 the field of supervised machine learning. In the parlance of machine learning,
233 our task combines feature selection with supervised learning.
234 Principle 3: Use geometry
235 To show that local geometry can provide useful information that cannot be
236 detected via pointwise analyses, consider Fig. . The top row of Fig. displays
237 the 3 genes which most match area AUD, according to a pointwise method4. The
238 bottom row displays the 3 genes which most match AUD according to a method
239 which considers local geometry5 The pointwise method in the top row identifies
240 genes which express more strongly in AUD than outside of it; its weakness is that
241 this includes many areas which don’t have a salient border matching the areal
242 border. The geometric method identifies genes whose salient expression border
243 seems to partially line up with the border of AUD; its weakness is that this
244 includes genes which don’t express over the entire area. Genes which have high
245 rankings using both pointwise and border criteria, such as Aph1a in the example,
246 may be particularly good markers. None of these genes are, individually, a
247 perfect marker for AUD; we deliberately chose a “difficult” area in order to
248 better contrast pointwise with geometric methods.
249 __________________________
250 4For each gene, a logistic regression in which the response variable was whether or not a
251 surface pixel was within area AUD, and the predictor variable was the value of the expression
252 of the gene underneath that pixel. The resulting scores were used to rank the genes in terms
253 of how well they predict area AUD.
254 5For each gene the gradient similarity (see section ??) between (a) a map of the expression
255 of each gene on the cortical surface and (b) the shape of area AUD, was calculated, and this
256 was used to rank the genes.
257 7
259 Principle 4: Work in 2-D whenever possible
260 In anatomy, the manifold of interest is usually either defined by a combination
261 of two relevant anatomical axes (todo), or by the surface of the structure (as is
262 the case with the cortex). In the former case, the manifold of interest is a plane,
263 but in the latter case it is curved. If the manifold is curved, there are various
264 methods for mapping the manifold into a plane.
265 The method that we will develop will begin by mapping the data into a
266 2-D plane. Although the manifold that characterized cortical areas is known
267 to be the cortical surface, it remains to be seen which method of mapping the
268 manifold into a plane is optimal for this application. We will compare mappings
269 which attempt to preserve size (such as the one used by Caret??) with mappings
270 which preserve angle (conformal maps).
271 Although there is much 2-D organization in anatomy, there are also struc-
272 tures whose shape is fundamentally 3-dimensional. If possible, we would like
273 the method we develop to include a statistical test that warns the user if the
274 assumption of 2-D structure seems to be wrong.
275 ——
276 Massive new datasets obtained with techniques such as in situ hybridization
277 (ISH) and BAC-transgenics allow the expression levels of many genes at many
278 locations to be compared. This can be used to find marker genes for specific
279 anatomical structures, as well as to draw new anatomical maps. Our goal is
280 to develop automated methods to relate spatial variation in gene expression to
281 anatomy. We have five specific aims:
282 (1) develop an algorithm to screen spatial gene expression data for combi-
283 nations of marker genes which selectively target individual anatomical
284 structures
285 (2) develop an algorithm to screen spatial gene expression data for combina-
286 tions of marker genes which can be used to delineate most of the bound-
287 aries between a number of anatomical structures at once
288 (3) develop an algorithm to suggest new ways of dividing a structure up into
289 anatomical subregions, based on spatial patterns in gene expression
290 (4) create a flat (2-D) map of the mouse cerebral cortex that contains a flat-
291 tened version of the Allen Mouse Brain Atlas ISH dataset, as well as the
292 boundaries of anatomical areas within the cortex. For each cortical layer,
293 a layer-specific flat dataset will be created. A single combined flat dataset
294 will be created which averages information from all of the layers. These
295 datasets will be made available in both MATLAB and Caret formats.
296 (5) validate the methods developed in (1), (2) and (3) by applying them to
297 the cerebral cortex datasets created in (4)
298 All algorithms that we develop will be implemented in an open-source soft-
299 ware toolkit. The toolkit, as well as the machine-readable datasets developed in
300 8
302 aim (4) and any other intermediate dataset we produce, will be published and
303 freely available for others to use.
304 In addition to developing generally useful methods, the application of these
305 methods to cerebral cortex will produce immediate benefits that are only one
306 step removed from clinical application, while also supporting the development
307 of new neuroanatomical techniques. The method developed in aim (1) will be
308 applied to each cortical area to find a set of marker genes. Currently, despite
309 the distinct roles of different cortical areas in both normal functioning and
310 disease processes, there are no known marker genes for many cortical areas.
311 Finding marker genes will be immediately useful for drug discovery as well as for
312 experimentation because once marker genes for an area are known, interventions
313 can be designed which selectively target that area.
314 The method developed in aim (2) will be used to find a small panel of genes
315 that can find most of the boundaries between areas in the cortex. Today, finding
316 cortical areal boundaries in a tissue sample is a manual process that requires a
317 skilled human to combine multiple visual cues over a large area of the cortical
318 surface. A panel of marker genes will allow the development of an ISH protocol
319 that will allow experimenters to more easily identify which anatomical areas are
320 present in small samples of cortex.
321 For each cortical layer, a layer-specific flat dataset will be created. A single
322 combined flat dataset will be created which averages information from all of
323 the layers. These datasets will be made available in both MATLAB and Caret
324 formats.
325 —-
326 New techniques allow the expression levels of many genes at many locations
327 to be compared. It is thought that even neighboring anatomical structures have
328 different gene expression profiles. We propose to develop automated methods
329 to relate the spatial variation in gene expression to anatomy. We will develop
330 two kinds of techniques:
331 (a) techniques to screen for combinations of marker genes which selectively
332 target anatomical structures
333 (b) techniques to suggest new ways of dividing a structure up into anatomical
334 subregions, based on the shapes of contours in the gene expression
335 The first kind of technique will be helpful for finding marker genes associated
336 with known anatomical features. The second kind of technique will be helpful in
337 creating new anatomical maps, maps which reflect differences in gene expression
338 the same way that existing maps reflect differences in histology.
339 We intend to develop our techniques using the adult mouse cerebral cortex
340 as a testbed. The Allen Brain Atlas has collected a dataset containing the
341 expression level of about 4000 genes* over a set of over 150000 voxels, with a
342 spatial resolution of approximately 200 microns[?].
343 We expect to discover sets of marker genes that pick out specific cortical
344 areas. This will allow the development of drugs and other interventions that
345 selectively target individual cortical areas. Therefore our research will lead
346 9
348 to application in drug discovery, in the development of other targeted clinical
349 interventions, and in the development of new experimental techniques.
350 The best way to divide up rodent cortex into areas has not been completely
351 determined, as can be seen by the differences in the recent maps given by Swan-
352 son on the one hand, and Paxinos and Franklin on the other. It is likely that our
353 study, by showing which areal divisions naturally follow from gene expression
354 data, as opposed to traditional histological data, will contribute to the creation
355 of a better map. While we do not here propose to analyze human gene expres-
356 sion data, it is conceivable that the methods we propose to develop could be
357 used to suggest modifications to the human cortical map as well.
358 In the following, we will only be talking about coronal data.
359 The Allen Brain Atlas provides “Smoothed Energy Volumes”, which are
360 One type of artifact in the Allen Brain Atlas data is what we call a “slice
361 artifact”. We have noticed two types of slice artifacts in the dataset. The first
362 type, a “missing slice artifact”, occurs when the ISH procedure on a slice did
363 not come out well. In this case, the Allen Brain investigators excluded the slice
364 at issue from the dataset. This means that no gene expression information is
365 available for that gene for the region of space covered by that slice. This results
366 in an expression level of zero being assigned to voxels covered by the slice. This
367 is partially but not completely ameliorated by the smoothing that is applied to
368 create the Smoothed Energy Volumes. The usual end result is that a region of
369 space which is shaped and oriented like a coronal slice is marked as having less
370 gene expression than surrounding regions.
371 The second type of slice artifact is caused by the fact that all of the slices
372 have a consistent orientation. Since there may be artifacts (such as how well
373 the ISH worked) which are constant within each slice but which vary between
374 different slices, the result is that ceteris paribus, when one compares the genetic
375 data of a voxel to another voxel within the same coronal plane, one would expect
376 to find more similarity than if one compared a voxel to another voxel displaced
377 along the rostrocaudal axis.
378 We are enthusiastic about the sharing of methods, data, and results, and
379 at the conclusion of the project, we will make all of our data and computer
380 source code publically available. Our goal is that replicating our results, or
381 applying the methods we develop to other targets, will be quick and easy for
382 other investigators. In order to aid in understanding and replicating our results,
383 we intend to include a software program which, when run, will take as input
384 the Allen Brain Atlas raw data, and produce as output all numbers and charts
385 found in publications resulting from the project.
386 To aid in the replication of our results, we will include a script which takes
387 as input the dataset in aim (3) and provides as output all of the tables in figures
388 in our publications .
389 We also expect to weigh in on the debate about how to best partition rodent
390 cortex
391 be useful for drug discovery as well
392 * Another 16000 genes are available, but they do not cover the entire cerebral
393 cortex with high spatial resolution.
394 10
396 User-definable ROIs Combinatorial gene expression Negative as well as pos-
397 itive signal Use geometry Search for local boundaries if necessary Flatmapped
398 Specific aims
399 Develop algorithms that find genetic markers for anatomical regions
400 1. Develop scoring measures for evaluating how good individual genes are at
401 marking areas: we will compare pointwise, geometric, and information-
402 theoretic measures.
403 2. Develop a procedure to find single marker genes for anatomical regions: for
404 each cortical area, by using or combining the scoring measures developed,
405 we will rank the genes by their ability to delineate each area.
406 3. Extend the procedure to handle difficult areas by using combinatorial cod-
407 ing: for areas that cannot be identified by any single gene, identify them
408 with a handful of genes. We will consider both (a) algorithms that incre-
409 mentally/greedily combine single gene markers into sets, such as forward
410 stepwise regression and decision trees, and also (b) supervised learning
411 techniques which use soft constraints to minimize the number of features,
412 such as sparse support vector machines.
413 4. Extend the procedure to handle difficult areas by combining or redrawing
414 the boundaries: An area may be difficult to identify because the bound-
415 aries are misdrawn, or because it does not “really” exist as a single area,
416 at least on the genetic level. We will develop extensions to our procedure
417 which (a) detect when a difficult area could be fit if its boundary were
418 redrawn slightly, and (b) detect when a difficult area could be combined
419 with adjacent areas to create a larger area which can be fit.
420 Apply these algorithms to the cortex
421 1. Create open source format conversion tools: we will create tools to bulk
422 download the ABA dataset and to convert between SEV, NIFTI and MAT-
423 LAB formats.
424 2. Flatmap the ABA cortex data: map the ABA data onto a plane and draw
425 the cortical area boundaries onto it.
426 3. Find layer boundaries: cluster similar voxels together in order to auto-
427 matically find the cortical layer boundaries.
428 4. Run the procedures that we developed on the cortex: we will present, for
429 each area, a short list of markers to identify that area; and we will also
430 present lists of “panels” of genes that can be used to delineate many areas
431 at once.
432 11
434 Develop algorithms to suggest a division of a structure into anatom-
435 ical parts
436 1. Explore dimensionality reduction algorithms applied to pixels: including
437 TODO
438 2. Explore dimensionality reduction algorithms applied to genes: including
439 TODO
440 3. Explore clustering algorithms applied to pixels: including TODO
441 4. Explore clustering algorithms applied to genes: including gene shaving,
442 TODO
443 5. Develop an algorithm to use dimensionality reduction and/or hierarchial
444 clustering to create anatomical maps
445 6. Run this algorithm on the cortex: present a hierarchial, genoarchitectonic
446 map of the cortex
447 gradient similarity is calculated as: ∑
448 pixels cos(abs(∠∇1 - ∠∇2)) ⋅|∇1|+|∇2|
449 2 ⋅
450 pixel_value1+pixel_value2
451 2
452 (todo) Technically, we say that an anatomical structure has a fundamen-
453 tally 2-D organization when there exists a commonly used, generic, anatomical
454 structure-preserving map from 3-D space to a 2-D manifold.
455 Related work:
456 The Allen Brain Institute has developed an interactive web interface called
457 AGEA which allows an investigator to (1) calculate lists of genes which are se-
458 lectively overexpressed in certain anatomical regions (ABA calls this the “Gene
459 Finder” function) (2) to visualize the correlation between the genetic profiles of
460 voxels in the dataset, and (3) to visualize a hierarchial clustering of voxels in
461 the dataset [?]. AGEA is an impressive and useful tool, however, it does not
462 solve the same problems that we propose to solve with this project.
463 First we describe AGEA’s “Gene Finder”, and then compare it to our pro-
464 posed method for finding marker genes. AGEA’s Gene Finder first asks the
465 investigator to select a single “seed voxel” of interest. It then uses a clustering
466 method, combined with built-in knowledge of major anatomical structures, to
467 select two sets of voxels; an “ROI” and a “comparator region”*. The seed voxel
468 is always contained within the ROI, and the ROI is always contained within the
469 comparator region. The comparator region is similar but not identical to the
470 set of voxels making up the major anatomical region containing the ROI. Gene
471 Finder then looks for genes which can distinguish the ROI from the comparator
472 region. Specifically, it finds genes for which the ratio (expression energy in the
473 ROI) / (expression energy in the comparator region) is high.
474 Informally, the Gene Finder first infers an ROI based on clustering the seed
475 voxel with other voxels. Then, the Gene Finder finds genes which overexpress
476 in the ROI as compared to other voxels in the major anatomical region.
477 There are three major differences between our approach and Gene Finder.
478 12
480 First, Gene Finder focuses on individual genes and individual ROIs in isola-
481 tion. This is great for regions which can be picked out from all other regions by a
482 single gene, but not all of them can (todo). There are at least two ways this can
483 miss out on useful genes. First, a gene might express in part of a region, but not
484 throughout the whole region, but there may be another gene which expresses
485 in the rest of the region*. Second, a gene might express in a region, but not in
486 any of its neighbors, but it might express also in other non-neighboring regions.
487 To take advantage of these types of genes, we propose to find combinations of
488 genes which, together, can identify the boundaries of all subregions within the
489 containing region.
490 Second, Gene Finder uses a pointwise metric, namely expression energy ratio,
491 to decide whether a gene is good for picking out a region. We have found better
492 results by using metrics which take into account not just single voxels, but also
493 the local geometry of neighboring voxels, such as the local gradient (todo). In
494 addition, we have found that often the absence of gene expression can be used
495 as a marker, which will not be caught by Gene Finder’s expression energy ratio
496 (todo).
497 Third, Gene Finder chooses the ROI based only on the seed voxel. This
498 often does not permit the user to query the ROI that they are interested in. For
499 example, in all of our tests of Gene Finder in cortex, the ROIs chosen tend to
500 be cortical layers, rather than cortical areas.
501 In summary, when Gene Finder picks the ROI that you want, and when this
502 ROI can be easily picked out from neighboring regions by single genes which
503 selectively overexpress in the ROI compared to the entire major anatomical re-
504 gion, Gene Finder will work. However, Gene Finder will not pick cortical areas
505 as ROIs, and even if it could, many cortical areas cannot be uniquely picked out
506 by the overexpression of any single gene. By contrast, we will target cortical
507 areas, we will explore a variety of metrics which can complement the shortcom-
508 ings of expression energy ratio, and we will use the combinatorial expression of
509 genes to pick out cortical areas even when no individual gene will do.
510 * The terms “ROI” and “comparator region” are our own; the ABI calls
511 them the “local region” and the “larger anatomical context”. The ABI uses the
512 term “specificity comparator” to mean the major anatomic region containing
513 the ROI, which is not exactly identical to the comparator region.
514 ** In this case, the union of the area of expression of the two genes would
515 suffice; one could also imagine that there could be situations in which the in-
516 tersection of multiple genes would be needed, or a combination of unions and
517 intersections.
518 Now we describe AGEA’s hierarchial clustering, and compare it to our pro-
519 posal. The goal of AGEA’s hierarchial clustering is to generate a binary tree of
520 clusters, where a cluster is a collection of voxels. AGEA begins by computing
521 the Pearson correlation between each pair of voxels. They then employ a recur-
522 sive divisive (top-down) hierarchial clustering procedure on the voxels, which
523 means that they start with all of the voxels, and then they divide them into clus-
524 ters, and then within each cluster, they divide that cluster into smaller clusters,
525 etc***. At each step, the collection of voxels is partitioned into two smaller
526 13
528 clusters in a way that maximizes the following quantity: average correlation
529 between all possible pairs of voxels containing one voxel from each cluster.
530 There are three major differences between our approach and AGEA’s hier-
531 archial clustering. First, AGEA’s clustering method separates cortical layers
532 before it separates cortical areas.
533 following procedure is used for the purpose of dividing a collection of voxels
534 into smaller clusters: partition the voxels into two sets, such that the following
535 quantity is maximized:
536 *** depending on which level of the tree is being created, the voxels are
537 subsampled in order to save time
538 does not allow the user to input anything other than a seed voxel; this means
539 that for each seed voxel, there is only one
540 The role of the “local region” is to serve as a region of interest for which
541 marker genes are desired; the role of the “larger anatomical context” is to be
542 the structure
543 There are two kinds of differences between AGEA and our project; differ-
544 ences that relate to the treatment of the cortex, and differences in the type of
545 generalizable methods being developed. As relates
546 indicate an ROI
547 explore simple correlation-based relationships between voxels, genes, and
548 clusters of voxels.
549 There have not yet been any studies which describe the results of applying
550 AGEA to the cerebral cortex; however, we suspect that the AGEA metrics are
551 not optimal for the task of relating genes to cortical areas. A voxel’s gene
552 expression profile depends upon both its cortical area and its cortical layer,
553 however, AGEA has no mechanism to distinguish these two. As a result, voxels
554 in the same layer but different areas are often clustered together by AGEA. As
555 part of the project, we will compare the performance of our techniques against
556 AGEA’s.
557 —
558 The Allen Brain Institute has developed interactive tools called AGEA which
559 allow an investigator to explore simple correlation-based relationships between
560 voxels, genes, and clusters of voxels. There have not yet been any studies
561 which describe the results of applying AGEA to the cerebral cortex; however,
562 we suspect that the AGEA metrics are not optimal for the task of relating
563 genes to cortical areas. A voxel’s gene expression profile depends upon both
564 its cortical area and its cortical layer, however, AGEA has no mechanism to
565 distinguish these two. As a result, voxels in the same layer but different areas
566 are often clustered together by AGEA. As part of the project, we will compare
567 the performance of our techniques against AGEA’s.
568 Another difference between our techniques and AGEA’s is that AGEA allows
569 the user to enter only a voxel location, and then to either explore the rest of
570 the brain’s relationship to that particular voxel, or explore a partitioning of
571 the brain based on pairwise voxel correlation. If the user is interested not in a
572 single voxel, but rather an entire anatomical structure, AGEA will only succeed
573 to the extent that the selected voxel is a typical representative of the structure.
574 14
576 As discussed in the previous paragraph, this poses problems for structures like
577 cortical areas, which (because of their division into cortical layers) do not have
578 a single “typical representative”.
579 By contrast, in our system, the user will start by selecting, not a single voxel,
580 but rather, an anatomical superstructure to be divided into pieces (for example,
581 the cerebral cortex). We expect that our methods will take into account not
582 just pairwise statistics between voxels, but also large-scale geometric features
583 (for example, the rapidity of change in gene expression as regional boundaries
584 are crossed) which optimize the discriminability of regions within the selected
585 superstructure.
586 —–
587 screen for combinations of marker genes which selectively target anatom-
588 ical structures pick delineate the boundaries between neighboring anatomical
589 structures. (b) techniques to screen for marker genes which pick out anatomical
590 structures of interest
591 , techniques which: (a) screen for marker genes , and (b) suggest new
592 anatomical maps based on
593 whose expression partitions the region of interest into its anatomical sub-
594 structures, and (b) use the natural contours of gene expression to suggest new
595 ways of dividing an organ into
596 The Allen Brain Atlas
597 –
598 to: brooksl@mail.nih.gov
599 Hi, I’m writing to confirm the applicability of a potential research project to
600 the challenge grant topic ”New computational and statistical methods for the
601 analysis of large data sets from next-generation sequencing technologies”.
602 We want to develop methods for the analysis of gene expression datasets that
603 can be used to uncover the relationships between gene expression and anatomical
604 regions. Specifically, we want to develop techniques to (a) given a set of known
605 anatomical areas, identify genetic markers for each of these areas, and (b) given
606 an anatomical structure whose substructure is unknown, suggest a map, that
607 is, a division of the space into anatomical sub-structures, that represents the
608 boundaries inherent in the gene expression data.
609 We propose to develop our techniques on the Allen Brain Atlas mouse brain
610 gene expression dataset by finding genetic markers for anatomical areas within
611 the cerebral cortex. The Allen Brain Atlas contains a registered 3-D map of
612 gene expression data with 200-micron voxel resolution which was created from
613 in situ hybridization data. The dataset contains about 4000 genes which are
614 available at this resolution across the entire cerebral cortex.
615 Despite the distinct roles of different cortical areas in both normal function-
616 ing and disease processes, there are no known marker genes for many cortical
617 areas. This project will be immediately useful for both drug discovery and clini-
618 cal research because once the markers are known, interventions can be designed
619 which selectively target specific cortical areas.
620 This techniques we develop will be useful because they will be applicable to
621 the analysis of other anatomical areas, both in terms of finding marker genes
622 15
624 for known areas, and in terms of suggesting new anatomical subdivisions that
625 are based upon the gene expression data.
626 —-
627 It is likely that our study, by showing which areal divisions naturally fol-
628 low from gene expression data, as opposed to traditional histological data, will
629 contribute to the creation of
630 there are clear genetic or chemical markers known for only a few cortical
631 areas. This makes it difficult to target drugs to specific
632 As part of aims (1) and (5), we will discover sets of marker genes that pick
633 out specific cortical areas. This will allow the development of drugs and other
634 interventions that selectively target individual cortical areas. As part of aims
635 (2) and (5), we will also discover small panels of marker genes that can be used
636 to delineate most of the cortical areal map.
637 With aims (2) and (4), we
638 There are five principals
639 In addition to validating the usefulness of the algorithms, the application of
640 these methods to cerebral cortex will produce immediate benefits that are only
641 one step removed from clinical application.
642 todo: remember to check gensat, etc for validation (mention bias/variance)
643 Why it is useful to apply these methods to cortex
644 There is still room for debate as to exactly how the cortex should be parcellated
645 into areas.
646 The best way to divide up rodent cortex into areas has not been completely
647 determined,
648 not yet been accounted for in
649 that the expression of some genes will contain novel spatial patterns which
650 are not account
651 that a genoarchitectonic map
652 This principle is only applicable to aim 1 (marker genes). For aim 2 (partition
653 a structure in into anatomical subregions), we plan to work with many genes at
654 once.
655 tood: aim 2 b+s?
656 Principle 5: Interoperate with existing tools
657 In order for our software to be as useful as possible for our users, it will be
658 able to import and export data to standard formats so that users can use our
659 software in tandem with other software tools created by other teams. We will
660 support the following formats: NIFTI (Neuroimaging Informatics Technology
661 Initiative), SEV (Allen Brain Institute Smoothed Energy Volume), and MAT-
662 LAB. This ensures that our users will not have to exclusively rely on our tools
663 when analyzing data. For example, users will be able to use the data visualiza-
664 tion and analysis capabilities of MATLAB and Caret alongside our software.
665 16
667 To our knowledge, there is no currently available software to convert between
668 these formats, so we will also provide a format conversion tool. This may be
669 useful even for groups that don’t use any of our other software.
670 todo: is “marker gene” even a phrase that we should use at all?
671 note for aim 1 apps: combo of genes is for voxel, not within any single cell
672 , as when genetic markers allow the development of selective interventions;
673 the reason that one can be confident that the intervention is selective is that it
674 is only turned on when a certain combination of genes is turned on and off. The
675 result procedure is what assures us that when that combination is present, the
676 local tissue is probably part of a certain subregion.
677 The basic idea is that we want to find a procedure by
678 The task of finding genes that mark anatomical areas can be phrased in
679 terms of what the field of machine learning calls a “supervised learning” task.
680 The goal of this task is to learn a function (the “classifier”) which
681 If a person knows a combination of genes that mark an area, that implies
682 that the person can be told how strong those genes express in any voxel, and
683 the person can use this information to determine how
684 finding how to infer the areal identity of a voxel if given the gene expression
685 profile of that voxel.
686 For each voxel in the cortex, we want to start with data about the gene
687 expression
688 There are various ways to look for marker genes. We will define some terms,
689 and along the way we will describe a few design choices encountered in the
690 process of creating a marker gene finding method, and then we will present four
691 principles that describe which options we have chosen.
692 In developing a procedure for finding marker genes, we are developing a
693 procedure that takes a dataset of experimental observations and produces a
694 result. One can think of the result as merely a list of genes, but really the result
695 is an understanding of a predictive relationship between, on the one hand, the
696 expression levels of genes, and, on the other hand, anatomical subregions.
697 One way to more formally define this understanding is to look at it as a
698 procedure. In this view, the result of the learning procedure is itself a procedure.
699 The result procedure provides a way to use the gene expression profiles of voxels
700 in a tissue sample in order to determine where the subregions are.
701 This result procedure can be used directly, as when an experimenter has
702 a tissue sample and needs to know what subregions are present in it, and,
703 if multiple subregions are present, where they each are. Or it can be used
704 indirectly; imagine that the result procedure tells us that whenever a certain
705 combination of genes are expressed, the local tissue is probably part of a certain
706 subregion. This means that we can then confidentally develop an intervention
707 which is triggered only when that combination of genes are expressed; and to
708 the extent that the result procedure is reliable, we know that the intervention
709 will only be triggered in the target subregion.
710 We said that the result procedure provides “a way to use the gene expression
711 profiles of voxels in a tissue sample” in order to “determine where the subregions
712 are”.
713 17
715 Does the result procedure get as input all of the gene expression profiles
716 of each voxel in the entire tissue sample, and produce as output all of the
717 subregional boundaries all at once?
718 it is helpful for the classifier to look at the global “shape” of gene expression
719 patterns over the whole structure, rather than just nearby voxels.
720 there is some small bit of additional information that can be gleaned from
721 knowing the
722 Design choices for a supervised learning procedure
723 After all,
724 there is a small correlation between the gene expression levels from distant
725 voxels and
726 Depending on how we intend to use the classifier, we may want to design it
727 so that
728 It is possible for many things to
729 The choice of which data is made part of an instance
730 what we seek is a procedure
731 partition the tissue sample into subregions.
732 each part of the anatomical structure
733 must be One way to rephrase this task is to say that, instead of searching
734 for the location of the subregions, we are looking to partition the tissue sample
735 into subregions.
736 There are various ways to look for marker genes. We will define some terms,
737 and along the way we will describe a few design choices encountered in the
738 process of creating a marker gene finding method, and then we will present four
739 principles that describe which options we have chosen.
740 In developing a procedure for finding marker genes, we are developing a
741 procedure that takes a dataset of experimental observations and produces a
742 result. One can think of the result as merely a list of genes, but really the result
743 is an understanding of a predictive relationship between, on the one hand, the
744 expression levels of genes, and, on the other hand, anatomical subregions.
745 One way to more formally define this understanding is to look at it as a
746 procedure. In this view, the result of the learning procedure is itself a procedure.
747 The result procedure provides a way to use the gene expression profiles of voxels
748 in a tissue sample in order to determine where the subregions are.
749 This result procedure can be used directly, as when an experimenter has
750 a tissue sample and needs to know what subregions are present in it, and,
751 if multiple subregions are present, where they each are. Or it can be used
752 indirectly; imagine that the result procedure tells us that whenever a certain
753 combination of genes are expressed, the local tissue is probably part of a certain
754 subregion. This means that we can then confidentally develop an intervention
755 which is triggered only when that combination of genes are expressed; and to
756 the extent that the result procedure is reliable, we know that the intervention
757 will only be triggered in the target subregion.
758 18
760 We said that the result procedure provides “a way to use the gene expression
761 profiles of voxels in a tissue sample” in order to “determine where the subregions
762 are”.
763 Does the result procedure get as input all of the gene expression profiles
764 of each voxel in the entire tissue sample, and produce as output all of the
765 subregional boundaries all at once?
766 Or are we given one voxel at a time,
767 In the jargon of the field of machine learning, the result procedure is called
768 a classifier.
769 The task of finding genes that mark anatomical areas can be phrased in
770 terms of what the field of machine learning calls a “supervised learning” task.
771 The goal of this task is to learn a function (the “classifier”) which
772 If a person knows a combination of genes that mark an area, that implies
773 that the person can be told how strong those genes express in any voxel, and
774 the person can use this information to determine how
775 finding how to infer the areal identity of a voxel if given the gene expression
776 profile of that voxel.
777 For each voxel in the cortex, we want to start with data about the gene
778 expression
779 single voxels, but rather groups of voxels, such that the groups can be placed
780 in some 2-D space. We will call such instances “pixels”.
781 We have been speaking as if instances necessarily correspond to single voxels.
782 But it is possible for instances to be groupings of many voxels, in which case
783 each grouping must be assigned the same label (that is, each voxel grouping
784 must stay inside a single anatomical subregion).
785 In some but not all cases, the groups are either rows or columns of voxels.
786 This is the case with the cerebral cortex, in which one may assume that columns
787 of voxels which run perpendicular to the cortical surface all share the same areal
788 identity. In the cortex, we call such an instance a “surface pixel”, because such
789 an instance represents the data associated with all voxels underneath a specific
790 patch of the cortical surface.
791 19