nsf

view grant.html @ 119:e61f822e0375

.
author bshanks@bshanks.dyndns.org
date Tue Jul 07 14:57:48 2009 -0700 (16 years ago)
parents ffa1390e4f39
children 94284c1ca133
line source
1 Introduction
2 Massive new datasets obtained with techniques such as in situ hybridization (ISH), immunohisto-
3 chemistry, in situ transgenic reporter, microarray voxelation, and others, allow the expression levels
4 of many genes at many locations to be compared. Our goal is to develop automated methods to
5 relate spatial variation in gene expression to anatomy. We want to find marker genes for specific
6 anatomical regions, and also to draw new anatomical maps based on gene expression patterns.
7 We will validate these methods by applying them to 46 anatomical areas within the cerebral cortex,
8 by using the Allen Mouse Brain Atlas coronal dataset (ABA).
9 This project has three primary goals:
10 (1) develop an algorithm to screen spatial gene expression data for combinations of marker
11 genes which selectively target anatomical regions.
12 (2) develop an algorithm to suggest new ways of carving up a structure into anatomically dis-
13 tinct regions, based on spatial patterns in gene expression.
14 (3) adapt our tools for the analysis of multi/hyperspectral imaging data from the Geographic
15 Information Systems (GIS) community.
16 We will create a 2-D “flat map” dataset of the mouse cerebral cortex that contains a flattened
17 version of the Allen Mouse Brain Atlas ISH data, as well as the boundaries of cortical anatomical
18 areas. We will use this dataset to validate the methods developed in (1) and (2). In addition to
19 its use in neuroscience, this dataset will be useful as a sample dataset for the machine learning
20 community.
21 Although our particular application involves the 3D spatial distribution of gene expression, the
22 methods we will develop will generalize to any high-dimensional data over points located in a low-
23 dimensional space. In particular, our methods could be applied to the analysis of multi/hyperspectral
24 imaging data, or alternately to genome-wide sequencing data derived from sets of tissues and dis-
25 ease states.
26 All algorithms that we develop will be implemented in a GPL open-source software toolkit. The
27 toolkit and the datasets will be published and freely available for others to use.
28 __________________
29 Background and related work
30 Cortical anatomy
31 The cortex is divided into areas and layers. Because of the cortical columnar organization, the
32 parcellation of the cortex into areas can be drawn as a 2-D map on the surface of the cortex. In the
33 third dimension, the boundaries between the areas continue downwards into the cortical depth,
34 perpendicular to the surface. The layer boundaries run parallel to the surface. One can picture an
35 area of the cortex as a slice of a six-layered cake1.
36 It is known that different cortical areas have distinct roles in both normal functioning and in
37 disease processes, yet there are no known marker genes for most cortical areas. When it is nec-
38 essary to divide a tissue sample into cortical areas, this is a manual process that requires a skilled
39 1Outside of isocortex, the number of layers varies.
40 1
42 human to combine multiple visual cues and interpret them in the context of their approximate
43 location upon the cortical surface.
44 Even the questions of how many areas should be recognized in cortex, and what their arrange-
45 ment is, are still not completely settled. A proposed division of the cortex into areas is called a
46 cortical map. In the rodent, the lack of a single agreed-upon map can be seen by contrasting the
47 recent maps given by Swanson[21] on the one hand, and Paxinos and Franklin[16] on the other.
48 While the maps are certainly very similar in their general arrangement, significant differences re-
49 main.
50 The Allen Mouse Brain Atlas dataset
51 The Allen Mouse Brain Atlas (ABA) data[13] were produced by doing in-situ hybridization on
52 slices of male, 56-day-old C57BL/6J mouse brains. Pictures were taken of the processed slice,
53 and these pictures were semi-automatically analyzed to create a digital measurement of gene
54 expression levels at each location in each slice. Per slice, cellular spatial resolution is achieved.
55 Using this method, a single physical slice can only be used to measure one single gene; many
56 different mouse brains were needed in order to measure the expression of many genes.
57 Mus musculus is thought to contain about 22,000 protein-coding genes[26]. The ABA contains
58 data on about 20,000 genes in sagittal sections, out of which over 4,000 genes are also measured
59 in coronal sections. Our dataset is derived from only the coronal subset of the ABA2. An auto-
60 mated nonlinear alignment procedure located the 2D data from the various slices in a single 3D
61 coordinate system. In the final 3D coordinate system, voxels are cubes with 200 microns on a
62 side. There are 67x41x58 = 159,326 voxels, of which 51,533 are in the brain[15]. For each voxel
63 and each gene, the expression energy[13] within that voxel is made available.
64 The ABA is not the only large public spatial gene expression dataset[8][25][5][14][24][4][23][20][3].
65 However, with the exception of the ABA, GenePaint[25], and EMAGE[24], most of the other re-
66 sources have not (yet) extracted the expression intensity from the ISH images and registered the
67 results into a single 3-D space.
68 The remainder of the background section will be divided into three parts, one for each major
69 goal.
70 Goal 1, From Areas to Genes: Given a map of regions, find genes that mark those regions
71 Machine learning terminology: classifiers The task of looking for marker genes for known
72 anatomical regions means that one is looking for a set of genes such that, if the expression level
73 of those genes is known, then the locations of the regions can be inferred.
74 If we define the regions so that they cover the entire anatomical structure to be subdivided,
75 and restrict ourselves to looking at one voxel at a time, we may say that we are using gene
76 expression in each voxel to assign that voxel to the proper area. We call this a classification
77 task, because each voxel is being assigned to a class (namely, its region). An understanding
78 of the relationship between the combination of gene expression levels and the locations of the
79 regions may be expressed as a function. The input to this function is a voxel, along with the gene
80 expression levels within that voxel; the output is the regional identity of the target voxel, that is, the
81 ____________________________________
82 2The sagittal data do not cover the entire cortex, and also have greater registration error[15]. Genes were selected
83 by the Allen Institute for coronal sectioning based on, “classes of known neuroscientific interest... or through post hoc
84 identification of a marked non-ubiquitous expression pattern”[15].
85 2
87 region to which the target voxel belongs. We call this function a classifier. In general, the input to
88 a classifier is called an instance, and the output is called a label (or a class label).
89 Our goal is not to produce a single classifier, but rather to develop an automated method for
90 determining a classifier for any known anatomical structure. Therefore, we seek a procedure by
91 which a gene expression dataset may be analyzed in concert with an anatomical atlas in order to
92 produce a classifier. The initial gene expression dataset used in the construction of the classifier
93 is called training data. In the machine learning literature, this sort of procedure may be thought
94 of as a supervised learning task, defined as a task in which the goal is to learn a mapping from
95 instances to labels, and the training data consists of a set of instances (voxels) for which the labels
96 (regions) are known.
97 Each gene expression level is called a feature, and the selection of which genes3 to look at is
98 called feature selection. Feature selection is one component of the task of learning a classifier.
99 One class of feature selection methods assigns some sort of score to each candidate gene.
100 The top-ranked genes are then chosen. Some scoring measures can assign a score to a set of
101 selected genes, not just to a single gene; in this case, a dynamic procedure may be used in which
102 features are added and subtracted from the selected set depending on how much they raise the
103 score. Such procedures are called “stepwise” or “greedy”.
104 Although the classifier itself may only look at the gene expression data within each voxel be-
105 fore classifying that voxel, the algorithm which constructs the classifier may look over the entire
106 dataset. We can categorize score-based feature selection methods depending on how the score
107 of calculated. Often the score calculation consists of assigning a sub-score to each voxel, and
108 then aggregating these sub-scores into a final score. If only information from nearby voxels is
109 used to calculate a voxel’s sub-score, then we say it is a local scoring method. If only information
110 from the voxel itself is used to calculate a voxel’s sub-score, then we say it is a pointwise scoring
111 method.
112 Our Strategy for Goal 1
113 Key questions when choosing a learning method are: What are the instances? What are the
114 features? How are the features chosen? Here are four principles that outline our answers to these
115 questions.
116 Principle 1: Combinatorial gene expression
117 It is too much to hope that every anatomical region of interest will be identified by a single
118 gene. For example, in the cortex, there are some areas which are not clearly delineated by any
119 gene included in the ABA coronal dataset. However, at least some of these areas can be delin-
120 eated by looking at combinations of genes (an example of an area for which multiple genes are
121 necessary and sufficient is provided in Preliminary Results, Figure 4). Therefore, each instance
122 should contain multiple features (genes).
123 Principle 2: Only look at combinations of small numbers of genes
124 When the classifier classifies a voxel, it is only allowed to look at the expression of the genes
125 which have been selected as features. The more data that are available to a classifier, the better
126 that it can do. Why not include every gene as a feature? The reason is that we wish to employ the
127 classifier in situations in which it is not feasible to gather data about every gene. For example, if we
128 ____________________________________
129 3Strictly speaking, the features are gene expression levels, but we’ll call them genes.
130 3
132 want to use the expression of marker genes as a trigger for some regionally-targeted intervention,
133 then our intervention must contain a molecular mechanism to check the expression level of each
134 marker gene before it triggers. It is currently infeasible to design a molecular trigger that checks
135 the level of more than a handful of genes. Therefore, we must select only a few genes as features.
136 The requirement to find combinations of only a small number of genes limits us from straightfor-
137 wardly applying many of the most simple techniques from the field of supervised machine learning.
138 In the parlance of machine learning, our task combines feature selection with supervised learning.
139 Principle 3: Use geometry in feature selection
140 When doing feature selection with score-based methods, the simplest thing to do would be
141 to score the performance of each voxel by itself and then combine these scores (pointwise scor-
142 ing). A more powerful approach is to also use information about the geometric relations between
143 each voxel and its neighbors; this requires non-pointwise, local scoring methods. See Preliminary
144 Results, figure 3 for evidence of the complementary nature of pointwise and local scoring methods.
145 Principle 4: Work in 2-D whenever possible
146 There are many anatomical structures which are commonly characterized in terms of a two-
147 dimensional manifold. When it is known that the structure that one is looking for is two-dimensional,
148 the results may be improved by allowing the analysis algorithm to take advantage of this prior
149 knowledge. In addition, it is easier for humans to visualize and work with 2-D data.
150 Goal 2, From Genes to Areas: given gene expression data, discover a map of regions
151 Machine learning terminology: clustering
152 If one is given a dataset consisting merely of instances, with no class labels, then analysis of
153 the dataset is referred to as unsupervised learning in the jargon of machine learning. One thing
154 that you can do with such a dataset is to group instances together. A set of similar instances is
155 called a cluster, and the activity of grouping the data into clusters is called clustering or cluster
156 analysis.
157 The task of deciding how to carve up a structure into anatomical regions can be put into these
158 terms. The instances are once again voxels (or pixels) along with their associated gene expression
159 profiles. We make the assumption that voxels from the same anatomical region have similar gene
160 expression profiles, at least compared to the other regions. This means that clustering voxels is
161 the same as finding potential regions; we seek a partitioning of the voxels into regions, that is, into
162 clusters of voxels with similar gene expression.
163 It is desirable to determine not just one set of regions, but also how these regions relate to
164 each other. The outcome of clustering may be a hierarchical tree of clusters, rather than a single
165 set of clusters which partition the voxels. This is called hierarchical clustering.
166 Similarity scores A crucial choice when designing a clustering method is how to measure
167 similarity, across either pairs of instances, or clusters, or both. There is much overlap between
168 scoring methods for feature selection (discussed above under Goal 1) and scoring methods for
169 similarity.
170 Dimensionality reduction In this section, we discuss reducing the length of the per-pixel gene
171 expression feature vector. By “dimension”, we mean the dimension of this vector, not the spatial
172 4
174 dimension of the underlying data.
177 Figure 1: Top row: Genes Nfic
178 and A930001M12Rik are the most
179 correlated with area SS (somatosen-
180 sory cortex). Bottom row: Genes
181 C130038G02Rik and Cacna1i are
182 those with the best fit using logistic
183 regression. Within each picture, the
184 vertical axis roughly corresponds to
185 anterior at the top and posterior at the
186 bottom, and the horizontal axis roughly
187 corresponds to medial at the left and
188 lateral at the right. The red outline is
189 the boundary of region SS. Pixels are
190 colored according to correlation, with
191 red meaning high correlation and blue
192 meaning low. Unlike Goal 1, there is no externally-imposed need to
193 select only a handful of informative genes for inclusion
194 in the instances. However, some clustering algorithms
195 perform better on small numbers of features4. There are
196 techniques which “summarize” a larger number of fea-
197 tures using a smaller number of features; these tech-
198 niques go by the name of feature extraction or dimen-
199 sionality reduction. The small set of features that such a
200 technique yields is called the reduced feature set. Note
201 that the features in the reduced feature set do not neces-
202 sarily correspond to genes; each feature in the reduced
203 set may be any function of the set of gene expression
204 levels.
205 Clustering genes rather than voxels Although the
206 ultimate goal is to cluster the instances (voxels or pixels),
207 one strategy to achieve this goal is to first cluster the
208 features (genes). There are two ways that clusters of
209 genes could be used.
210 Gene clusters could be used as part of dimensionality
211 reduction: rather than have one feature for each gene,
212 we could have one reduced feature for each gene cluster.
213 Gene clusters could also be used to directly yield a
214 clustering on instances. This is because many genes
215 have an expression pattern which seems to pick out a
216 single, spatially contiguous region. This suggests the fol-
217 lowing procedure: cluster together genes which pick out
218 similar regions, and then to use the more popular com-
219 mon regions as the final clusters. In Preliminary Results,
220 Figure 7, we show that a number of anatomically recog-
221 nized cortical regions, as well as some “superregions” formed by lumping together a few regions,
222 are associated with gene clusters in this fashion.
223 Goal 3: interoperability with multi/hyperspectral imaging analysis software
224 A typical color image associates each pixel with a vector of three values. Multispectral and hyper-
225 spectral images, however, are images which associate each pixel with a vector containing many
226 values. The different positions in the vector correspond to different bands of electromagnetic
227 wavelengths5.
228 Some analysis techniques for hyperspectral imaging, especially preprocessing and calibration
229 techniques, make use of the information that the different values captured at each pixel represent
230 ____________________________________
231 4First, because the number of features in the reduced dataset is less than in the original dataset, the running time of
232 clustering algorithms may be much less. Second, it is thought that some clustering algorithms may give better results
233 on reduced data.
234 5In hyperspectral imaging, the bands are adjacent, and the number of different bands is larger. For conciseness, we
235 discuss only hyperspectral imaging, but our methods are also well suited to multispectral imaging with many bands.
236 5
238 adjacent wavelengths of light, which can be combined to make a spectrum. Other analysis tech-
239 niques ignore the interpretation of the values measured, and their relationship to each other within
240 the electromagnetic spectrum, instead treating them blindly as completely separate features.
241 With both hyperspectral imaging and spatial gene expression data, each location in space
242 is associated with more than three numerical feature values. The analysis of hyperspectral im-
243 ages can involve supervised classification and unsupervised learning. Often hyperspectral images
244 come from satellites looking at the Earth, and it is desirable to classify what sort of objects occupy
245 a given area of land. Sometimes detailed training data is not available, in which case it is desirable
246 at least to cluster together those regions of land which contain similar objects.
247 We believe that it may be possible for these two different field to share some common compu-
248 tational tools. To this end, we intend to make use of existing hyperspectral imaging software when
249 possible, and to develop new software in such a way so as to make it easy to use for the purpose
250 of hyperspectral image analysis, as well as for our primary purpose of spatial gene expression
251 data analysis.
252 Related work
254 Figure 2: Gene Pitx2
255 is selectively underex-
256 pressed in area SS. As noted above, the GIS community has developed tools for supervised
257 classification and unsupervised clustering in the context of the analysis
258 of hyperspectral imaging data. One tool is Spectral Python6. Spectral
259 Python implements various supervised and unsupervised classification
260 methods, as well as utility functions for loading, viewing, and saving
261 spatial data. Although Spectral Python has feature extraction methods
262 (such as principal components analysis) which create a small set of
263 new features computed based on the original features, it does not have
264 feature selection methods, that is, methods to select a small subset
265 out of the original features (although feature selection in hyperspectral
266 imaging has been investigated by others[19].
267 There is a substantial body of work on the analysis of gene expression data. Most of this con-
268 cerns gene expression data which are not fundamentally spatial7. Here we review only that work
269 which concerns the automated analysis of spatial gene expression data with respect to anatomy.
270 Relating to Goal 1, GeneAtlas[5] and EMAGE [24] allow the user to construct a search query by
271 demarcating regions and then specifying either the strength of expression or the name of another
272 gene or dataset whose expression pattern is to be matched. Neither GeneAtlas nor EMAGE allow
273 one to search for combinations of genes that define a region in concert.
274 Relating to Goal 2, EMAGE[24] allows the user to select a dataset from among a large number
275 of alternatives, or by running a search query, and then to cluster the genes within that dataset.
276 EMAGE clusters via hierarchical complete linkage clustering.
277 [15] describes AGEA, ”Anatomic Gene Expression Atlas”. AGEA has three components. Gene
278 Finder: The user selects a seed voxel and the system (1) chooses a cluster which includes the
279 seed voxel, (2) yields a list of genes which are overexpressed in that cluster. Correlation: The user
280 selects a seed voxel and the system then shows the user how much correlation there is between
281 the gene expression profile of the seed voxel and every other voxel. Clusters: AGEA includes a
282 ____________________________________
283 6http://spectralpython.sourceforge.net/
284 7By “fundamentally spatial” we mean that there is information from a large number of spatial locations indexed by
285 spatial coordinates; not just data which have only a few different locations or which is indexed by anatomical label.
286 6
288 preset hierarchical clustering of voxels based on a recursive bifurcation algorithm with correlation
289 as the similarity metric. AGEA has been applied to the cortex. The paper describes interesting
290 results on the structure of correlations between voxel gene expression profiles within a handful of
291 cortical areas. However, that analysis neither looks for genes marking cortical areas, nor does it
292 suggest a cortical map based on gene expression data. Neither of the other components of AGEA
293 can be applied to cortical areas; AGEA’s Gene Finder cannot be used to find marker genes for the
294 cortical areas; and AGEA’s hierarchical clustering does not produce clusters corresponding to the
295 cortical areas8.
298 Figure 3: The top row shows the two
299 genes which (individually) best predict
300 area AUD, according to logistic regres-
301 sion. The bottom row shows the two
302 genes which (individually) best match
303 area AUD, according to gradient sim-
304 ilarity. From left to right and top to
305 bottom, the genes are Ssr1, Efcbp1,
306 Ptk7, and Aph1a. [6] looks at the mean expression level of genes within
307 anatomical regions, and applies a Student’s t-test to de-
308 termine whether the mean expression level of a gene is
309 significantly higher in the target region. This relates to
310 our Goal 1. [6] also clusters genes, relating to our Goal
311 2. For each cluster, prototypical spatial expression pat-
312 terns were created by averaging the genes in the cluster.
313 The prototypes were analyzed manually, without cluster-
314 ing voxels.
315 These related works differ from our strategy for Goal
316 1 in at least three ways. First, they find only single genes,
317 whereas we will also look for combinations of genes.
318 Second, they usually can only use overexpression as
319 a marker, whereas we will also search for underexpres-
320 sion. Third, they use scores based on pointwise expres-
321 sion levels, whereas we will also use geometric scores
322 such as gradient similarity (described in Preliminary Re-
323 sults). Figures 4, 2, and 3 in the Preliminary Results
324 section contain evidence that each of our three choices
325 is the right one.
326 [10] describes a technique to find combinations of
327 marker genes to pick out an anatomical region. They
328 use an evolutionary algorithm to evolve logical operators which combine boolean (thresholded)
329 images in order to match a target image. They apply their technique for finding combinations of
330 marker genes for the purpose of clustering genes around a “seed gene”.
331 Relating to our Goal 2, some researchers have attempted to parcellate cortex on the basis of
332 non-gene expression data. For example, [17], [2], [18], and [1] associate spots on the cortex with
333 the radial profile9 of response to some stain ([12] uses MRI), extract features from this profile, and
334 then use similarity between surface pixels to cluster.
335 [22] describes an analysis of the anatomy of the hippocampus using the ABA dataset. In
336 addition to manual analysis, two clustering methods were employed, a modified Non-negative
337 Matrix Factorization (NNMF), and a hierarchical bifurcation clustering scheme using correlation as
338 ____________________________________
339 8In both cases, the cause is that pairwise correlations between the gene expression of voxels in different areas but
340 the same layer are often stronger than pairwise correlations between the gene expression of voxels in different layers
341 but the same area. Therefore, a pairwise voxel correlation clustering algorithm will tend to create clusters representing
342 cortical layers, not areas.
343 9A radial profile is a profile along a line perpendicular to the cortical surface.
344 7
346 similarity. The paper yielded impressive results, proving the usefulness of computational genomic
347 anatomy. We have run NNMF on the cortical dataset, and while the results are promising, other
348 methods may perform as well or better (see Preliminary Results, Figure 6).
349 Comparing previous work with our Goal 1, there has been fruitful work on finding marker genes,
350 but only one of the projects explored combinations of marker genes, and none of them compared
351 the results obtained by using different algorithms or scoring methods. Comparing previous work
352 with Goal 2, although some projects obtained clusterings, there has not been much comparison
353 between different algorithms or scoring methods, so it is likely that the best clustering method for
354 this application has not yet been found. Also, none of these projects did a separate dimensionality
355 reduction step before clustering pixels, or tried to cluster genes first in order to guide automated
356 clustering of pixels into spatial regions, or used co-clustering algorithms.
357 In summary, (a) only one of the previous projects explores combinations of marker genes, (b)
358 there has been almost no comparison of different algorithms or scoring methods, and (c) there
359 has been no work on computationally finding marker genes applied to cortical areas, or on finding
360 a hierarchical clustering that will yield a map of cortical areas de novo from gene expression data.
361 Our project is guided by a concrete application with a well-specified criterion of success (how
362 well we can find marker genes for / reproduce the layout of cortical areas), which will provide a
363 solid basis for comparing different methods.
364 _________________________________________________
365 Data sharing plan
368 Figure 4: Upper left: wwc1. Upper
369 right: mtif2. Lower left: wwc1 + mtif2
370 (each pixel’s value on the lower left is
371 the sum of the corresponding pixels in
372 the upper row). We are enthusiastic about the sharing of methods and
373 data, and at the conclusion of the project, we will make
374 all of our data and computer source code publically avail-
375 able, either in supplemental attachments to publications,
376 or on a website. The source code will be released under
377 the GNU Public License. We intend to include a soft-
378 ware program which, when run, will take as input the
379 Allen Brain Atlas raw data, and produce as output all
380 numbers and charts found in publications resulting from
381 the project. Source code to be released will include ex-
382 tensions to Caret[7], an existing open-source scientific
383 imaging program, and to Spectral Python. Data to be
384 released will include the 2-D “flat map” dataset. This
385 dataset will be submitted to a machine learning dataset
386 repository.
387 Broader impacts
388 In addition to validating the usefulness of the algorithms,
389 the application of these methods to cortex will produce
390 immediate benefits, because there are currently no known genetic markers for most cortical areas.
391 The method developed in Goal 1 will be applied to each cortical area to find a set of marker
392 genes such that the combinatorial expression pattern of those genes uniquely picks out the target
393 area. Finding marker genes will be useful for drug discovery as well as for experimentation be-
394 cause marker genes can be used to design interventions which selectively target individual cortical
395 areas.
396 8
398 The application of the marker gene finding algorithm to the cortex will also support the develop-
399 ment of new neuroanatomical methods. In addition to finding markers for each individual cortical
400 areas, we will find a small panel of genes that can find many of the areal boundaries at once.
401 The method developed in Goal 2 will provide a genoarchitectonic viewpoint that will contribute
402 to the creation of a better cortical map.
403 The methods we will develop will be applicable to other datasets beyond the brain, and even to
404 datasets outside of biology. The software we develop will be useful for the analysis of hyperspectral
405 images. Our project will draw attention to this area of overlap between neuroscience and GIS, and
406 may lead to future collaborations between these two fields. The cortical dataset that we produce
407 will be useful in the machine learning community as a sample dataset that new algorithms can be
408 tested against. The availability of this sample dataset to the machine learning community may lead
409 to more interest in the design of machine learning algorithms to analyze spatial gene expression.
410 _
411 Preliminary Results
412 Format conversion between SEV, MATLAB, NIFTI
413 We have created software to (politely) download all of the SEV files10 from the Allen Institute
414 website. We have also created software to convert between the SEV, MATLAB, and NIFTI file
415 formats, as well as some of Caret’s file formats.
416 Flatmap of cortex
417 We downloaded the ABA data and selected only those voxels which belong to cerebral cortex.
418 We divided the cortex into hemispheres. Using Caret[7], we created a mesh representation of the
419 surface of the selected voxels. For each gene, and for each node of the mesh, we calculated an
420 average of the gene expression of the voxels “underneath” that mesh node. We then flattened
421 the cortex, creating a two-dimensional mesh. We converted this grid into a MATLAB matrix. We
422 manually traced the boundaries of each of 46 cortical areas from the ABA coronal reference atlas
423 slides, and converted this region data into MATLAB format.
424 At this point, the data are in the form of a number of 2-D matrices, all in registration, with the
425 matrix entries representing a grid of points (pixels) over the cortical surface. There is one 2-D
426 matrix whose entries represent the regional label associated with each surface pixel. And for each
427 gene, there is a 2-D matrix whose entries represent the average expression level underneath each
428 surface pixel. The features and the target area are both functions on the surface pixels. They can
429 be referred to as scalar fields over the space of surface pixels; alternately, they can be thought of
430 as images which can be displayed on the flatmapped surface.
431 Feature selection and scoring methods
432 Underexpression of a gene can serve as a marker Underexpression of a gene can sometimes
433 serve as a marker. For example, see Figure 2.
434 Correlation Recall that the instances are surface pixels, and consider the problem of attempt-
435 ing to classify each instance as either a member of a particular anatomical area, or not. The target
436 area can be represented as a boolean mask over the surface pixels.
437 10SEV is a sparse format for spatial data. It is the format in which the ABA data is made available.
438 9
440 We calculated the correlation between each gene and each cortical area. The top row of Figure
441 1 shows the three genes most correlated with area SS.
442 Conditional entropy
443 For each region, we created and ran a forward stepwise procedure which attempted to find
444 pairs of genes such that the conditional entropy of the target area’s boolean mask, conditioned
445 upon the gene pair’s thresholded expression levels, is minimized.
446 This finds pairs of genes which are most informative (at least at these threshold levels) relative
447 to the question, “Is this surface pixel a member of the target area?”. The advantage over linear
448 methods such as logistic regression is that this takes account of arbitrarily nonlinear relationships;
449 for example, if the XOR of two variables predicts the target, conditional entropy would notice,
450 whereas linear methods would not.
451 Gradient similarity We noticed that the previous two scoring methods, which are pointwise,
452 often found genes whose pattern of expression did not look similar in shape to the target region.
453 For this reason we designed a non-pointwise scoring method to detect when a gene had a pattern
454 of expression which looked like it had a boundary whose shape is similar to the shape of the target
455 region. We call this scoring method “gradient similarity”. The formula is:
456 ∑
457 pixel<img src="cmsy8-32.png" alt="&#x2208;" />pixels cos(&#x2220;&#x2207;1 -&#x2220;&#x2207;2) &#x22C5;|&#x2207;1| + |&#x2207;2|
458 2 &#x22C5; pixel_value1 + pixel_value2
459 2
460 where &#x2207;1 and &#x2207;2 are the gradient vectors of the two images at the current pixel; &#x2220;&#x2207;i is the
461 angle of the gradient of image i at the current pixel; |&#x2207;i| is the magnitude of the gradient of image
462 i at the current pixel; and pixel_valuei is the value of the current pixel in image i.
463 The intuition is that we want to see if the borders of the pattern in the two images are similar; if
464 the borders are similar, then both images will have corresponding pixels with large gradients (be-
465 cause this is a border) which are oriented in a similar direction (because the borders are similar).
466 Gradient similarity provides information complementary to correlation
467 To show that gradient similarity can provide useful information that cannot be detected via
468 pointwise analyses, consider Fig. 3. The pointwise method in the top row identifies genes which
469 express more strongly in AUD than outside of it; its weakness is that this includes many areas
470 which don&#8217;t have a salient border matching the areal border. The geometric method identifies
471 genes whose salient expression border seems to partially line up with the border of AUD; its
472 weakness is that this includes genes which don&#8217;t express over the entire area.
473 Areas which can be identified by single genes Using gradient similarity, we have already
474 found single genes which roughly identify some areas and groupings of areas. For each of these
475 areas, an example of a gene which roughly identifies it is shown in Figure 5. We have not yet
476 cross-verified these genes in other atlases.
477 In addition, there are a number of areas which are almost identified by single genes: COAa+NLOT
478 (anterior part of cortical amygdalar area, nucleus of the lateral olfactory tract), ENT (entorhinal),
479 ACAv (ventral anterior cingulate), VIS (visual), AUD (auditory).
480 These results validate our expectation that the ABA dataset can be exploited to find marker
481 genes for many cortical areas, while also validating the relevancy of our new scoring method,
482 gradient similarity.
483 10
489 Figure 5: From left to right and top
490 to bottom, single genes which roughly
491 identify areas SS (somatosensory pri-
492 mary + supplemental), SSs (supple-
493 mental somatosensory), PIR (piriform),
494 FRP (frontal pole), RSP (retrosplenial),
495 COApm (Cortical amygdalar, poste-
496 rior part, medial zone). Grouping
497 some areas together, we have also
498 found genes to identify the groups
499 ACA+PL+ILA+DP+ORB+MO (anterior
500 cingulate, prelimbic, infralimbic, dor-
501 sal peduncular, orbital, motor), poste-
502 rior and lateral visual (VISpm, VISpl,
503 VISI, VISp; posteromedial, posterolat-
504 eral, lateral, and primary visual; the
505 posterior and lateral visual area is dis-
506 tinguished from its neighbors, but not
507 from the entire rest of the cortex). The
508 genes are Pitx2, Aldh1a2, Ppfibp1,
509 Slco1a5, Tshz2, Trhr, Col12a1, Ets1. Combinations of multiple genes are useful and
510 necessary for some areas
511 In Figure 4, we give an example of a cortical area
512 which is not marked by any single gene, but which can be
513 identified combinatorially. According to logistic regres-
514 sion, gene wwc1 is the best fit single gene for predicting
515 whether or not a pixel on the cortical surface belongs to
516 the motor area (area MO). The upper-left picture in Fig-
517 ure 4 shows wwc1&#8217;s spatial expression pattern over the
518 cortex. The lower-right boundary of MO is represented
519 reasonably well by this gene, but the gene overshoots
520 the upper-left boundary. This flattened 2-D representa-
521 tion does not show it, but the area corresponding to the
522 overshoot is the medial surface of the cortex. MO is only
523 found on the dorsal surface. Gene mtif2 is shown in the
524 upper-right. Mtif2 captures MO&#8217;s upper-left boundary, but
525 not its lower-right boundary. Mtif2 does not express very
526 much on the medial surface. By adding together the val-
527 ues at each pixel in these two figures, we get the lower-
528 left image. This combination captures area MO much
529 better than any single gene.
530 This shows that our proposal to develop a method to
531 find combinations of marker genes is both possible and
532 necessary.
533 Multivariate supervised learning
534 Forward stepwise logistic regression Logistic regres-
535 sion is a popular method for predictive modeling of cat-
536 egorical data. As a pilot run, for five cortical areas (SS,
537 AUD, RSP, VIS, and MO), we performed forward step-
538 wise logistic regression to find single genes, pairs of
539 genes, and triplets of genes which predict areal identify.
540 This is an example of feature selection integrated with
541 prediction using a stepwise wrapper. Some of the sin-
542 gle genes found were shown in various figures through-
543 out this document, and Figure 4 shows a combination of
544 genes which was found.
545 SVM on all genes at once
546 In order to see how well one can do when looking at
547 all genes at once, we ran a support vector machine to
548 classify cortical surface pixels based on their gene ex-
549 pression profiles. We achieved classification accuracy of
550 about 81%11. However, as noted above, a classifier that
551 ____________________________________
552 115-fold cross-validation.
553 11
555 looks at all the genes at once isn&#8217;t as practically useful
556 as a classifier that uses only a few genes.
557 Data-driven redrawing of the cortical map
558 We have applied the following dimensionality reduction algorithms to reduce the dimensionality
559 of the gene expression profile associated with each pixel: Principal Components Analysis (PCA),
560 Simple PCA, Multi-Dimensional Scaling, Isomap, Landmark Isomap, Laplacian eigenmaps, Local
561 Tangent Space Alignment, Stochastic Proximity Embedding, Fast Maximum Variance Unfolding,
562 Non-negative Matrix Factorization (NNMF). Space constraints prevent us from showing many of
563 the results, but as a sample, PCA, NNMF, and landmark Isomap are shown in the first, second,
564 and third rows of Figure 6.
565 After applying the dimensionality reduction, we ran clustering algorithms on the reduced data.
566 To date we have tried k-means and spectral clustering. The results of k-means after PCA, NNMF,
567 and landmark Isomap are shown in the bottom row of Figure 6. To compare, the leftmost picture
568 on the bottom row of Figure 6 shows some of the major subdivisions of cortex. These results show
569 that different dimensionality reduction techniques capture different aspects of the data and lead
570 to different clusterings, indicating the utility of our proposal to produce a detailed comparison of
571 these techniques as applied to the domain of genomic anatomy.
572 Many areas are captured by clusters of genes We also clustered the genes using gradient
573 similarity to see if the spatial regions defined by any clusters matched known anatomical regions.
574 Figure 7 shows, for ten sample gene clusters, each cluster&#8217;s average expression pattern, com-
575 pared to a known anatomical boundary. This suggests that it is worth attempting to cluster genes,
576 and then to use the results to cluster pixels.
577 Our plan: what remains to be done
578 Flatmap cortex and segment cortical layers
579 There are multiple ways to flatten 3-D data into 2-D. We will compare mappings from manifolds to
580 planes which attempt to preserve size (such as the one used by Caret[7]) with mappings which
581 preserve angle (conformal maps). We will also develop a segmentation algorithm to automatically
582 identify the layer boundaries.
583 Develop algorithms that find genetic markers for anatomical regions
584 Scoring measures and feature selection We will develop scoring methods for evaluating how
585 good individual genes are at marking areas. We will compare pointwise, geometric, and information-
586 theoretic measures. We already developed one entirely new scoring method (gradient similarity),
587 but we may develop more. Scoring measures that we will explore will include the L1 norm, cor-
588 relation, expression energy ratio, conditional entropy, gradient similarity, Jaccard similarity, Dice
589 similarity, Hough transform, and statistical tests such as Student&#8217;s t-test, and the Mann-Whitney
590 U test (a non-parametric test). In addition, any classifier induces a scoring measure on genes by
591 taking the prediction error when using that gene to predict the target.
592 Using some combination of these measures, we will develop a procedure to find single marker
593 genes for anatomical regions: for each cortical area, we will rank the genes by their ability to
594 delineate that area. We will quantitatively compare the list of single genes generated by our
595 method to the lists generated by methods which are mentioned in Related Work.
596 12
599 Figure 6: First row: the first 6 reduced dimensions, using PCA. Sec-
600 ond row: the first 6 reduced dimensions, using NNMF. Third row: the
601 first six reduced dimensions, using landmark Isomap. Bottom row:
602 examples of kmeans clustering applied to reduced datasets to find
603 7 clusters. Left: 19 of the major subdivisions of the cortex. Sec-
604 ond from left: PCA. Third from left: NNMF. Right: Landmark Isomap.
605 Additional details: In the third and fourth rows, 7 dimensions were
606 found, but only 6 displayed. In the last row: for PCA, 50 dimensions
607 were used; for NNMF, 6 dimensions were used; for landmark Isomap,
608 7 dimensions were used. Some cortical areas have
609 no single marker genes but
610 can be identified by com-
611 binatorial coding. This re-
612 quires multivariate scoring
613 measures and feature se-
614 lection procedures. Many
615 of the measures, such
616 as expression energy, gra-
617 dient similarity, Jaccard,
618 Dice, Hough, Student&#8217;s t,
619 and Mann-Whitney U are
620 univariate. We will ex-
621 tend these scoring mea-
622 sures for use in multivariate
623 feature selection, that is,
624 for scoring how well com-
625 binations of genes, rather
626 than individual genes, can
627 distinguish a target area.
628 There are existing mul-
629 tivariate forms of some
630 of the univariate scoring
631 measures, for example,
632 Hotelling&#8217;s T-square is a
633 multivariate analog of Stu-
634 dent&#8217;s t.
635 We will develop a fea-
636 ture selection procedure for choosing the best small set of marker genes for a given anatomical
637 area. In addition to using the scoring measures that we develop, we will also explore (a) feature
638 selection using a stepwise wrapper over &#8220;vanilla&#8221; classifiers such as logistic regression, (b) super-
639 vised learning methods such as decision trees which incrementally/greedily combine single gene
640 markers into sets, and (c) supervised learning methods which use soft constraints to minimize
641 number of features used, such as sparse support vector machines (SVMs).
642 Since errors of displacement and of shape may cause genes and target areas to match less
643 than they should, we will consider the robustness of feature selection methods in the presence of
644 error. Some of these methods, such as the Hough transform, are designed to be resistant in the
645 presence of error, but many are not.
646 An area may be difficult to identify because the boundaries are misdrawn in the atlas, or be-
647 cause the shape of the natural domain of gene expression corresponding to the area is different
648 from the shape of the area as recognized by anatomists. We will develop extensions to our pro-
649 cedure which (a) detect when a difficult area could be fit if its boundary were redrawn slightly12,
650 ____________________________________
651 12Not just any redrawing is acceptable, only those which appear to be justified as a natural spatial domain of gene ex-
652 pression by multiple sources of evidence. Interestingly, the need to detect &#8220;natural spatial domains of gene expression&#8221;
653 in a data-driven fashion means that the methods of Goal 2 might be useful in achieving Goal 1, as well &#8211; particularly
654 13
656 and (b) detect when a difficult area could be combined with adjacent areas to create a larger area
657 which can be fit.
658 A future publication on the method that we develop in Goal 1 will review the scoring measures
659 and quantitatively compare their performance in order to provide a foundation for future research
660 of methods of marker gene finding. We will measure the robustness of the scoring measures as
661 well as their absolute performance on our dataset.
662 Develop algorithms to suggest a division of a structure into anatomical parts
664 Figure 7: Prototypes corresponding to sample gene clus-
665 ters, clustered by gradient similarity. Region boundaries for
666 the region that most matches each prototype are overlaid. Dimensionality reduction on gene
667 expression profiles We have al-
668 ready described the application of
669 ten dimensionality reduction algo-
670 rithms for the purpose of replacing
671 the gene expression profiles, which
672 are vectors of about 4000 gene ex-
673 pression levels, with a smaller num-
674 ber of features. We plan to further ex-
675 plore and interpret these results, as
676 well as to apply other unsupervised
677 learning algorithms, including inde-
678 pendent components analysis, self-
679 organizing maps, and generative models such as deep Boltzmann machines. We will explore
680 ways to quantitatively compare the relevance of the different dimensionality reduction methods for
681 identifying cortical areal boundaries.
682 Dimensionality reduction on pixels Instead of applying dimensionality reduction to the gene
683 expression profiles, the same techniques can be applied instead to the pixels. It is possible that
684 the features generated in this way by some dimensionality reduction techniques will directly corre-
685 spond to interesting spatial regions.
686 Clustering and segmentation on pixels We will explore clustering and image segmentation
687 algorithms in order to segment the pixels into regions. We will explore k-means, spectral cluster-
688 ing, gene shaving[9], recursive division clustering, multivariate generalizations of edge detectors,
689 multivariate generalizations of watershed transformations, region growing, active contours, graph
690 partitioning methods, and recursive agglomerative clustering with various linkage functions. These
691 methods can be combined with dimensionality reduction.
692 Clustering on genes We have already shown that the procedure of clustering genes according
693 to gradient similarity, and then creating an averaged prototype of each cluster&#8217;s expression pattern,
694 yields some spatial patterns which match cortical areas (Figure 7). We will further explore the
695 clustering of genes.
696 In addition to using the cluster expression prototypes directly to identify spatial regions, this
697 might be useful as a component of dimensionality reduction. For example, one could imagine
698 clustering similar genes and then replacing their expression levels with a single average expression
699 ____________________________________
700 discriminative dimensionality reduction.
701 14
703 level, thereby removing some redundancy from the gene expression profiles. One could then
704 perform clustering on pixels (possibly after a second dimensionality reduction step) in order to
705 identify spatial regions. It remains to be seen whether removal of redundancy would help or hurt
706 the ultimate goal of identifying interesting spatial regions.
707 Co-clustering We will explore some algorithms which simultaneously incorporate clustering
708 on instances and on features (in our case, pixels and genes), for example, IRM[11]. These are
709 called co-clustering or biclustering algorithms.
710 Compare different methods In order to tell which method is best for genomic anatomy, for
711 each experimental method we will compare the cortical map found by unsupervised learning to a
712 cortical map derived from the Allen Reference Atlas. We will explore various quantitative metrics
713 that purport to measure how similar two clusterings are, such as Jaccard, Rand index, Fowlkes-
714 Mallows, variation of information, Larsen, Van Dongen, and others.
715 Discriminative dimensionality reduction In addition to using a purely data-driven approach
716 to identify spatial regions, it might be useful to see how well the known regions can be recon-
717 structed from a small number of features, even if those features are chosen by using knowledge of
718 the regions. For example, linear discriminant analysis could be used as a dimensionality reduction
719 technique in order to identify a few features which are the best linear summary of gene expression
720 profiles for the purpose of discriminating between regions. This reduced feature set could then be
721 used to cluster pixels into regions. Perhaps the resulting clusters will be similar to the reference
722 atlas, yet more faithful to natural spatial domains of gene expression than the reference atlas is.
723 Apply the new methods to the cortex
724 Using the methods developed in Goal 1, we will present, for each cortical area, a short list of
725 markers to identify that area; and we will also present lists of &#8220;panels&#8221; of genes that can be used
726 to delineate many areas at once.
727 Because in most cases the ABA coronal dataset only contains one ISH per gene, it is possible
728 for an unrelated combination of genes to seem to identify an area when in fact it is only coinci-
729 dence. There are three ways we will validate our marker genes to guard against this. First, we
730 will confirm that putative combinations of marker genes express the same pattern in both hemi-
731 spheres. Second, we will manually validate our final results on other gene expression datasets
732 such as EMAGE, GeneAtlas, and GENSAT[8]. Third, we may conduct ISH experiments jointly with
733 collaborators to get further data on genes of particular interest.
734 Using the methods developed in Goal 2, we will present one or more hierarchical cortical
735 maps. We will identify and explain how the statistical structure in the gene expression data led to
736 any unexpected or interesting features of these maps, and we will provide biological hypotheses
737 to interpret any new cortical areas, or groupings of areas, which are discovered.
738 Apply the new methods to hyperspectral datasets
739 Our software will be able to read and write file formats common in the hyperspectral imaging
740 community such as Erdas LAN and ENVI, and it will be able to convert between the SEV and NIFTI
741 formats from neuroscience and the ENVI format from GIS. The methods developed in Goals 1 and
742 2 will be implemented either as part of Spectral Python or as a separate tool that interoperates
743 with Spectral Python. The methods will be run on hyperspectral satellite image datasets, and their
744 performance will be compared to existing hyperspectral analysis techniques.
745 15
747 References Cited
748 [1] Chris Adamson, Leigh Johnston, Terrie Inder, Sandra Rees, Iven Mareels, and Gary Egan.
749 A Tracking Approach to Parcellation of the Cerebral Cortex, volume 3749/2005 of Lecture
750 Notes in Computer Science, pages 294&#8211;301. Springer Berlin / Heidelberg, 2005.
751 [2] J. Annese, A. Pitiot, I. D. Dinov, and A. W. Toga. A myelo-architectonic method for the struc-
752 tural classification of cortical areas. NeuroImage, 21(1):15&#8211;26, 2004.
753 [3] Tanya Barrett, Dennis B. Troup, Stephen E. Wilhite, Pierre Ledoux, Dmitry Rudnev, Carlos
754 Evangelista, Irene F. Kim, Alexandra Soboleva, Maxim Tomashevsky, and Ron Edgar. NCBI
755 GEO: mining tens of millions of expression profiles&#8211;database and tools update. Nucl. Acids
756 Res., 35(suppl_1):D760&#8211;765, 2007.
757 [4] George W. Bell, Tatiana A. Yatskievych, and Parker B. Antin. GEISHA, a whole-mount in
758 situ hybridization gene expression screen in chicken embryos. Developmental Dynamics,
759 229(3):677&#8211;687, 2004.
760 [5] James P Carson, Tao Ju, Hui-Chen Lu, Christina Thaller, Mei Xu, Sarah L Pallas, Michael C
761 Crair, Joe Warren, Wah Chiu, and Gregor Eichele. A digital atlas to characterize the mouse
762 brain transcriptome. PLoS Comput Biol, 1(4):e41, 2005.
763 [6] Mark H. Chin, Alex B. Geng, Arshad H. Khan, Wei-Jun Qian, Vladislav A. Petyuk, Jyl Boline,
764 Shawn Levy, Arthur W. Toga, Richard D. Smith, Richard M. Leahy, and Desmond J. Smith.
765 A genome-scale map of expression for a mouse brain section obtained using voxelation.
766 Physiol. Genomics, 30(3):313&#8211;321, August 2007.
767 [7] D C Van Essen, H A Drury, J Dickson, J Harwell, D Hanlon, and C H Anderson. An integrated
768 software suite for surface-based analyses of cerebral cortex. Journal of the American Medical
769 Informatics Association: JAMIA, 8(5):443&#8211;59, 2001. PMID: 11522765.
770 [8] Shiaoching Gong, Chen Zheng, Martin L. Doughty, Kasia Losos, Nicholas Didkovsky, Uta B.
771 Schambra, Norma J. Nowak, Alexandra Joyner, Gabrielle Leblanc, Mary E. Hatten, and
772 Nathaniel Heintz. A gene expression atlas of the central nervous system based on bacte-
773 rial artificial chromosomes. Nature, 425(6961):917&#8211;925, October 2003.
774 [9] Trevor Hastie, Robert Tibshirani, Michael Eisen, Ash Alizadeh, Ronald Levy, Louis Staudt,
775 Wing Chan, David Botstein, and Patrick Brown. &#8217;Gene shaving&#8217; as a method for identifying dis-
776 tinct sets of genes with similar expression patterns. Genome Biology, 1(2):research0003.1&#8211;
777 research0003.21, 2000.
778 [10] Jano Hemert and Richard Baldock. Matching Spatial Regions with Combinations of Interact-
779 ing Gene Expression Patterns, volume 13 of Communications in Computer and Information
780 Science, pages 347&#8211;361. Springer Berlin Heidelberg, 2008.
781 [11] C Kemp, JB Tenenbaum, TL Griffiths, T Yamada, and N Ueda. Learning systems of concepts
782 with an infinite relational model. In AAAI, 2006.
783 [12] F. Kruggel, M. K. Brckner, Th. Arendt, C. J. Wiggins, and D. Y. von Cramon. Analyzing the
784 neocortical fine-structure. Medical Image Analysis, 7(3):251&#8211;264, September 2003.
785 16
787 [13] Ed S. Lein, Michael J. Hawrylycz, Nancy Ao, Mikael Ayres, Amy Bensinger, Amy Bernard,
788 Andrew F. Boe, Mark S. Boguski, Kevin S. Brockway, Emi J. Byrnes, Lin Chen, Li Chen,
789 Tsuey-Ming Chen, Mei Chi Chin, Jimmy Chong, Brian E. Crook, Aneta Czaplinska, Chinh N.
790 Dang, Suvro Datta, Nick R. Dee, Aimee L. Desaki, Tsega Desta, Ellen Diep, Tim A. Dolbeare,
791 Matthew J. Donelan, Hong-Wei Dong, Jennifer G. Dougherty, Ben J. Duncan, Amanda J.
792 Ebbert, Gregor Eichele, Lili K. Estin, Casey Faber, Benjamin A. Facer, Rick Fields, Shanna R.
793 Fischer, Tim P. Fliss, Cliff Frensley, Sabrina N. Gates, Katie J. Glattfelder, Kevin R. Halverson,
794 Matthew R. Hart, John G. Hohmann, Maureen P. Howell, Darren P. Jeung, Rebecca A. John-
795 son, Patrick T. Karr, Reena Kawal, Jolene M. Kidney, Rachel H. Knapik, Chihchau L. Kuan,
796 James H. Lake, Annabel R. Laramee, Kirk D. Larsen, Christopher Lau, Tracy A. Lemon,
797 Agnes J. Liang, Ying Liu, Lon T. Luong, Jesse Michaels, Judith J. Morgan, Rebecca J. Mor-
798 gan, Marty T. Mortrud, Nerick F. Mosqueda, Lydia L. Ng, Randy Ng, Geralyn J. Orta, Car-
799 oline C. Overly, Tu H. Pak, Sheana E. Parry, Sayan D. Pathak, Owen C. Pearson, Ralph B.
800 Puchalski, Zackery L. Riley, Hannah R. Rockett, Stephen A. Rowland, Joshua J. Royall,
801 Marcos J. Ruiz, Nadia R. Sarno, Katherine Schaffnit, Nadiya V. Shapovalova, Taz Sivisay,
802 Clifford R. Slaughterbeck, Simon C. Smith, Kimberly A. Smith, Bryan I. Smith, Andy J. Sodt,
803 Nick N. Stewart, Kenda-Ruth Stumpf, Susan M. Sunkin, Madhavi Sutram, Angelene Tam,
804 Carey D. Teemer, Christina Thaller, Carol L. Thompson, Lee R. Varnam, Axel Visel, Ray M.
805 Whitlock, Paul E. Wohnoutka, Crissa K. Wolkey, Victoria Y. Wong, Matthew Wood, Murat B.
806 Yaylaoglu, Rob C. Young, Brian L. Youngstrom, Xu Feng Yuan, Bin Zhang, Theresa A. Zwing-
807 man, and Allan R. Jones. Genome-wide atlas of gene expression in the adult mouse brain.
808 Nature, 445(7124):168&#8211;176, 2007.
809 [14] Susan Magdaleno, Patricia Jensen, Craig L. Brumwell, Anna Seal, Karen Lehman, Andrew
810 Asbury, Tony Cheung, Tommie Cornelius, Diana M. Batten, Christopher Eden, Shannon M.
811 Norland, Dennis S. Rice, Nilesh Dosooye, Sundeep Shakya, Perdeep Mehta, and Tom Cur-
812 ran. BGEM: an in situ hybridization database of gene expression in the embryonic and adult
813 mouse nervous system. PLoS Biology, 4(4):e86 EP &#8211;, April 2006.
814 [15] Lydia Ng, Amy Bernard, Chris Lau, Caroline C Overly, Hong-Wei Dong, Chihchau Kuan,
815 Sayan Pathak, Susan M Sunkin, Chinh Dang, Jason W Bohland, Hemant Bokil, Partha P
816 Mitra, Luis Puelles, John Hohmann, David J Anderson, Ed S Lein, Allan R Jones, and Michael
817 Hawrylycz. An anatomic gene expression atlas of the adult mouse brain. Nat Neurosci,
818 12(3):356&#8211;362, March 2009.
819 [16] George Paxinos and Keith B.J. Franklin. The Mouse Brain in Stereotaxic Coordinates. Aca-
820 demic Press, 2 edition, July 2001.
821 [17] A. Schleicher, N. Palomero-Gallagher, P. Morosan, S. Eickhoff, T. Kowalski, K. Vos,
822 K. Amunts, and K. Zilles. Quantitative architectural analysis: a new approach to cortical
823 mapping. Anatomy and Embryology, 210(5):373&#8211;386, December 2005.
824 [18] Oliver Schmitt, Lars Hmke, and Lutz Dmbgen. Detection of cortical transition regions utilizing
825 statistical analyses of excess masses. NeuroImage, 19(1):42&#8211;63, May 2003.
826 [19] S.B. Serpico and L. Bruzzone. A new search algorithm for feature selection in hyperspec-
827 tral remote sensing images. Geoscience and Remote Sensing, IEEE Transactions on,
828 39(7):1360&#8211;1367, 2001.
829 17
831 [20] Constance M. Smith, Jacqueline H. Finger, Terry F. Hayamizu, Ingeborg J. McCright, Janan T.
832 Eppig, James A. Kadin, Joel E. Richardson, and Martin Ringwald. The mouse gene expres-
833 sion database (GXD): 2007 update. Nucl. Acids Res., 35(suppl_1):D618&#8211;623, 2007.
834 [21] Larry Swanson. Brain Maps: Structure of the Rat Brain. Academic Press, 3 edition, November
835 2003.
836 [22] Carol L. Thompson, Sayan D. Pathak, Andreas Jeromin, Lydia L. Ng, Cameron R. MacPher-
837 son, Marty T. Mortrud, Allison Cusick, Zackery L. Riley, Susan M. Sunkin, Amy Bernard,
838 Ralph B. Puchalski, Fred H. Gage, Allan R. Jones, Vladimir B. Bajic, Michael J. Hawrylycz,
839 and Ed S. Lein. Genomic anatomy of the hippocampus. Neuron, 60(6):1010&#8211;1021, Decem-
840 ber 2008.
841 [23] Pavel Tomancak, Amy Beaton, Richard Weiszmann, Elaine Kwan, ShengQiang Shu,
842 Suzanna E Lewis, Stephen Richards, Michael Ashburner, Volker Hartenstein, Susan E Cel-
843 niker, and Gerald M Rubin. Systematic determination of patterns of gene expression during
844 drosophila embryogenesis. Genome Biology, 3(12):research008818814, 2002. PMC151190.
845 [24] Shanmugasundaram Venkataraman, Peter Stevenson, Yiya Yang, Lorna Richardson,
846 Nicholas Burton, Thomas P. Perry, Paul Smith, Richard A. Baldock, Duncan R. Davidson,
847 and Jeffrey H. Christiansen. EMAGE edinburgh mouse atlas of gene expression: 2008 up-
848 date. Nucl. Acids Res., 36(suppl_1):D860&#8211;865, 2008.
849 [25] Axel Visel, Christina Thaller, and Gregor Eichele. GenePaint.org: an atlas of gene expression
850 patterns in the mouse embryo. Nucl. Acids Res., 32(suppl_1):D552&#8211;556, 2004.
851 [26] Robert H Waterston, Kerstin Lindblad-Toh, Ewan Birney, Jane Rogers, Josep F Abril, Pankaj
852 Agarwal, Richa Agarwala, Rachel Ainscough, Marina Alexandersson, Peter An, Stylianos E
853 Antonarakis, John Attwood, Robert Baertsch, Jonathon Bailey, Karen Barlow, Stephan Beck,
854 Eric Berry, Bruce Birren, Toby Bloom, Peer Bork, Marc Botcherby, Nicolas Bray, Michael R
855 Brent, Daniel G Brown, Stephen D Brown, Carol Bult, John Burton, Jonathan Butler,
856 Robert D Campbell, Piero Carninci, Simon Cawley, Francesca Chiaromonte, Asif T Chin-
857 walla, Deanna M Church, Michele Clamp, Christopher Clee, Francis S Collins, Lisa L Cook,
858 Richard R Copley, Alan Coulson, Olivier Couronne, James Cuff, Val Curwen, Tim Cutts,
859 Mark Daly, Robert David, Joy Davies, Kimberly D Delehaunty, Justin Deri, Emmanouil T Der-
860 mitzakis, Colin Dewey, Nicholas J Dickens, Mark Diekhans, Sheila Dodge, Inna Dubchak,
861 Diane M Dunn, Sean R Eddy, Laura Elnitski, Richard D Emes, Pallavi Eswara, Eduardo
862 Eyras, Adam Felsenfeld, Ginger A Fewell, Paul Flicek, Karen Foley, Wayne N Frankel, Lu-
863 cinda A Fulton, Robert S Fulton, Terrence S Furey, Diane Gage, Richard A Gibbs, Gustavo
864 Glusman, Sante Gnerre, Nick Goldman, Leo Goodstadt, Darren Grafham, Tina A Graves,
865 Eric D Green, Simon Gregory, Roderic Guig, Mark Guyer, Ross C Hardison, David Haussler,
866 Yoshihide Hayashizaki, LaDeana W Hillier, Angela Hinrichs, Wratko Hlavina, Timothy Holzer,
867 Fan Hsu, Axin Hua, Tim Hubbard, Adrienne Hunt, Ian Jackson, David B Jaffe, L Steven John-
868 son, Matthew Jones, Thomas A Jones, Ann Joy, Michael Kamal, Elinor K Karlsson, Donna
869 Karolchik, Arkadiusz Kasprzyk, Jun Kawai, Evan Keibler, Cristyn Kells, W James Kent, An-
870 drew Kirby, Diana L Kolbe, Ian Korf, Raju S Kucherlapati, Edward J Kulbokas, David Kulp,
871 Tom Landers, J P Leger, Steven Leonard, Ivica Letunic, Rosie Levine, Jia Li, Ming Li, Chris-
872 tine Lloyd, Susan Lucas, Bin Ma, Donna R Maglott, Elaine R Mardis, Lucy Matthews, Evan
873 18
875 Mauceli, John H Mayer, Megan McCarthy, W Richard McCombie, Stuart McLaren, Kirsten
876 McLay, John D McPherson, Jim Meldrim, Beverley Meredith, Jill P Mesirov, Webb Miller, Tra-
877 cie L Miner, Emmanuel Mongin, Kate T Montgomery, Michael Morgan, Richard Mott, James C
878 Mullikin, Donna M Muzny, William E Nash, Joanne O Nelson, Michael N Nhan, Robert Nicol,
879 Zemin Ning, Chad Nusbaum, Michael J O&#8217;Connor, Yasushi Okazaki, Karen Oliver, Emma
880 Overton-Larty, Lior Pachter, Gens Parra, Kymberlie H Pepin, Jane Peterson, Pavel Pevzner,
881 Robert Plumb, Craig S Pohl, Alex Poliakov, Tracy C Ponce, Chris P Ponting, Simon Potter,
882 Michael Quail, Alexandre Reymond, Bruce A Roe, Krishna M Roskin, Edward M Rubin, Alis-
883 tair G Rust, Ralph Santos, Victor Sapojnikov, Brian Schultz, Jrg Schultz, Matthias S Schwartz,
884 Scott Schwartz, Carol Scott, Steven Seaman, Steve Searle, Ted Sharpe, Andrew Sheridan,
885 Ratna Shownkeen, Sarah Sims, Jonathan B Singer, Guy Slater, Arian Smit, Douglas R Smith,
886 Brian Spencer, Arne Stabenau, Nicole Stange-Thomann, Charles Sugnet, Mikita Suyama,
887 Glenn Tesler, Johanna Thompson, David Torrents, Evanne Trevaskis, John Tromp, Cather-
888 ine Ucla, Abel Ureta-Vidal, Jade P Vinson, Andrew C Von Niederhausern, Claire M Wade,
889 Melanie Wall, Ryan J Weber, Robert B Weiss, Michael C Wendl, Anthony P West, Kris
890 Wetterstrand, Raymond Wheeler, Simon Whelan, Jamey Wierzbowski, David Willey, Sophie
891 Williams, Richard K Wilson, Eitan Winter, Kim C Worley, Dudley Wyman, Shan Yang, Shiaw-
892 Pyng Yang, Evgeny M Zdobnov, Michael C Zody, and Eric S Lander. Initial sequencing and
893 comparative analysis of the mouse genome. Nature, 420(6915):520&#8211;62, December 2002.
894 PMID: 12466850.
895 19