nsf
view grant.html @ 121:3aeb56c97327
.
author | bshanks@bshanks.dyndns.org |
---|---|
date | Wed Jul 08 05:18:30 2009 -0700 (16 years ago) |
parents | dad49a6f95b6 |
children |
line source
1 Introduction
2 Massive new datasets obtained with techniques such as in situ hybridization (ISH), immunohisto-
3 chemistry, in situ transgenic reporter, microarray voxelation, and others, allow the expression levels
4 of many genes at many locations to be compared. Our goal is to develop automated methods to
5 relate spatial variation in gene expression to anatomy. We want to find marker genes for specific
6 anatomical regions, and also to draw new anatomical maps based on gene expression patterns.
7 We will validate these methods by applying them to 46 anatomical areas within the cerebral cortex,
8 by using the Allen Mouse Brain Atlas coronal dataset (ABA).
9 This project has three primary goals:
10 (1) develop an algorithm to screen spatial gene expression data for combinations of marker
11 genes which selectively target anatomical regions.
12 (2) develop an algorithm to suggest new ways of carving up a structure into anatomically dis-
13 tinct regions, based on spatial patterns in gene expression.
14 (3) adapt our tools for the analysis of multi/hyperspectral imaging data from the Geographic
15 Information Systems (GIS) community.
16 We will create a 2-D “flat map” dataset of the mouse cerebral cortex that contains a flattened
17 version of the Allen Mouse Brain Atlas ISH data, as well as the boundaries of cortical anatomical
18 areas. We will use this dataset to validate the methods developed in (1) and (2). In addition to
19 its use in neuroscience, this dataset will be useful as a sample dataset for the machine learning
20 community.
21 Although our particular application involves the 3D spatial distribution of gene expression, the
22 methods we will develop will generalize to any high-dimensional data over points located in a low-
23 dimensional space. In particular, our methods could be applied to the analysis of multi/hyperspectral
24 imaging data, or alternately to genome-wide sequencing data derived from sets of tissues and dis-
25 ease states.
26 All algorithms that we develop will be implemented in a GPL open-source software toolkit. The
27 toolkit and the datasets will be published and freely available for others to use.
28 __________________
29 Background and related work
30 Cortical anatomy
31 The cortex is divided into areas and layers. Because of the cortical columnar organization, the
32 parcellation of the cortex into areas can be drawn as a 2-D map on the surface of the cortex. In the
33 third dimension, the boundaries between the areas continue downwards into the cortical depth,
34 perpendicular to the surface. The layer boundaries run parallel to the surface. One can picture an
35 area of the cortex as a slice of a six-layered cake1.
36 It is known that different cortical areas have distinct roles in both normal functioning and in
37 disease processes, yet there are no known marker genes for most cortical areas. When it is nec-
38 essary to divide a tissue sample into cortical areas, this is a manual process that requires a skilled
39 1Outside of isocortex, the number of layers varies.
40 1
42 human to combine multiple visual cues and interpret them in the context of their approximate
43 location upon the cortical surface.
44 Even the questions of how many areas should be recognized in cortex, and what their arrange-
45 ment is, are still not completely settled. A proposed division of the cortex into areas is called a
46 cortical map. In the rodent, the lack of a single agreed-upon map can be seen by contrasting the
47 recent maps given by Swanson[22] on the one hand, and Paxinos and Franklin[17] on the other.
48 While the maps are certainly very similar in their general arrangement, significant differences re-
49 main.
50 The Allen Mouse Brain Atlas dataset
51 The Allen Mouse Brain Atlas (ABA) data[14] were produced by doing in-situ hybridization on
52 slices of male, 56-day-old C57BL/6J mouse brains. Pictures were taken of the processed slice,
53 and these pictures were semi-automatically analyzed to create a digital measurement of gene
54 expression levels at each location in each slice. Per slice, cellular spatial resolution is achieved.
55 Using this method, a single physical slice can only be used to measure one single gene; many
56 different mouse brains were needed in order to measure the expression of many genes.
57 Mus musculus is thought to contain about 22,000 protein-coding genes[27]. The ABA contains
58 data on about 20,000 genes in sagittal sections, out of which over 4,000 genes are also measured
59 in coronal sections. Our dataset is derived from only the coronal subset of the ABA2. An auto-
60 mated nonlinear alignment procedure located the 2D data from the various slices in a single 3D
61 coordinate system. In the final 3D coordinate system, voxels are cubes with 200 microns on a
62 side. There are 67x41x58 = 159,326 voxels, of which 51,533 are in the brain[16]. For each voxel
63 and each gene, the expression energy[14] within that voxel is made available.
64 The ABA is not the only large public spatial gene expression dataset[9][26][6][15][25][4][24][21][3].
65 However, with the exception of the ABA, GenePaint[26], and EMAGE[25], most of the other re-
66 sources have not (yet) extracted the expression intensity from the ISH images and registered the
67 results into a single 3-D space.
68 The remainder of the background section will be divided into three parts, one for each major
69 goal.
70 Goal 1, From Areas to Genes: Given a map of regions, find genes that mark those regions
71 Machine learning terminology: classifiers The task of looking for marker genes for known
72 anatomical regions means that one is looking for a set of genes such that, if the expression level
73 of those genes is known, then the locations of the regions can be inferred.
74 If we define the regions so that they cover the entire anatomical structure to be subdivided,
75 and restrict ourselves to looking at one voxel at a time, we may say that we are using gene
76 expression in each voxel to assign that voxel to the proper area. We call this a classification
77 task, because each voxel is being assigned to a class (namely, its region). An understanding
78 of the relationship between the combination of gene expression levels and the locations of the
79 regions may be expressed as a function. The input to this function is a voxel, along with the gene
80 expression levels within that voxel; the output is the regional identity of the target voxel, that is, the
81 ____________________________________
82 2The sagittal data do not cover the entire cortex, and also have greater registration error[16]. Genes were selected
83 by the Allen Institute for coronal sectioning based on, “classes of known neuroscientific interest... or through post hoc
84 identification of a marked non-ubiquitous expression pattern”[16].
85 2
87 region to which the target voxel belongs. We call this function a classifier. In general, the input to
88 a classifier is called an instance, and the output is called a label (or a class label).
89 Our goal is not to produce a single classifier, but rather to develop an automated method for
90 determining a classifier for any known anatomical structure. Therefore, we seek a procedure by
91 which a gene expression dataset may be analyzed in concert with an anatomical atlas in order to
92 produce a classifier. The initial gene expression dataset used in the construction of the classifier
93 is called training data. In the machine learning literature, this sort of procedure may be thought
94 of as a supervised learning task, defined as a task in which the goal is to learn a mapping from
95 instances to labels, and the training data consists of a set of instances (voxels) for which the labels
96 (regions) are known.
97 Each gene expression level is called a feature, and the selection of which genes3 to look at is
98 called feature selection. Feature selection is one component of the task of learning a classifier.
99 One class of feature selection methods assigns some sort of score to each candidate gene.
100 The top-ranked genes are then chosen. Some scoring measures can assign a score to a set of
101 selected genes, not just to a single gene; in this case, a dynamic procedure may be used in which
102 features are added and subtracted from the selected set depending on how much they raise the
103 score. Such procedures are called “stepwise” or “greedy”.
104 Although the classifier itself may only look at the gene expression data within each voxel be-
105 fore classifying that voxel, the algorithm which constructs the classifier may look over the entire
106 dataset. We can categorize score-based feature selection methods depending on how the score
107 of calculated. Often the score calculation consists of assigning a sub-score to each voxel, and
108 then aggregating these sub-scores into a final score. If only information from nearby voxels is
109 used to calculate a voxel’s sub-score, then we say it is a local scoring method. If only information
110 from the voxel itself is used to calculate a voxel’s sub-score, then we say it is a pointwise scoring
111 method.
112 Our Strategy for Goal 1
113 Key questions when choosing a learning method are: What are the instances? What are the
114 features? How are the features chosen? Here are four principles that outline our answers to these
115 questions.
116 Principle 1: Combinatorial gene expression
117 It is too much to hope that every anatomical region of interest will be identified by a single
118 gene. For example, in the cortex, there are some areas which are not clearly delineated by any
119 gene included in the ABA coronal dataset. However, at least some of these areas can be delin-
120 eated by looking at combinations of genes (an example of an area for which multiple genes are
121 necessary and sufficient is provided in Preliminary Results, Figure 4). Therefore, each instance
122 should contain multiple features (genes).
123 Principle 2: Only look at combinations of small numbers of genes
124 When the classifier classifies a voxel, it is only allowed to look at the expression of the genes
125 which have been selected as features. The more data that are available to a classifier, the better
126 that it can do. Why not include every gene as a feature? The reason is that we wish to employ the
127 classifier in situations in which it is not feasible to gather data about every gene. For example, if we
128 ____________________________________
129 3Strictly speaking, the features are gene expression levels, but we’ll call them genes.
130 3
132 want to use the expression of marker genes as a trigger for some regionally-targeted intervention,
133 then our intervention must contain a molecular mechanism to check the expression level of each
134 marker gene before it triggers. It is currently infeasible to design a molecular trigger that checks
135 the level of more than a handful of genes. Therefore, we must select only a few genes as features.
136 The requirement to find combinations of only a small number of genes limits us from straightfor-
137 wardly applying many of the most simple techniques from the field of supervised machine learning.
138 In the parlance of machine learning, our task combines feature selection with supervised learning.
139 Principle 3: Use geometry in feature selection
140 When doing feature selection with score-based methods, the simplest thing to do would be
141 to score the performance of each voxel by itself and then combine these scores (pointwise scor-
142 ing). A more powerful approach is to also use information about the geometric relations between
143 each voxel and its neighbors; this requires non-pointwise, local scoring methods. See Preliminary
144 Results, figure 3 for evidence of the complementary nature of pointwise and local scoring methods.
145 Principle 4: Work in 2-D whenever possible
146 There are many anatomical structures which are commonly characterized in terms of a two-
147 dimensional manifold. When it is known that the structure that one is looking for is two-dimensional,
148 the results may be improved by allowing the analysis algorithm to take advantage of this prior
149 knowledge. In addition, it is easier for humans to visualize and work with 2-D data.
150 Goal 2, From Genes to Areas: given gene expression data, discover a map of regions
151 Machine learning terminology: clustering
152 If one is given a dataset consisting merely of instances, with no class labels, then analysis of
153 the dataset is referred to as unsupervised learning in the jargon of machine learning. One thing
154 that you can do with such a dataset is to group instances together. A set of similar instances is
155 called a cluster, and the activity of grouping the data into clusters is called clustering or cluster
156 analysis.
157 The task of deciding how to carve up a structure into anatomical regions can be put into these
158 terms. The instances are once again voxels (or pixels) along with their associated gene expression
159 profiles. We make the assumption that voxels from the same anatomical region have similar gene
160 expression profiles, at least compared to the other regions. This means that clustering voxels is
161 the same as finding potential regions; we seek a partitioning of the voxels into regions, that is, into
162 clusters of voxels with similar gene expression.
163 It is desirable to determine not just one set of regions, but also how these regions relate to
164 each other. The outcome of clustering may be a hierarchical tree of clusters, rather than a single
165 set of clusters which partition the voxels. This is called hierarchical clustering.
166 Similarity scores A crucial choice when designing a clustering method is how to measure
167 similarity, across either pairs of instances, or clusters, or both. There is much overlap between
168 scoring methods for feature selection (discussed above under Goal 1) and scoring methods for
169 similarity.
170 Dimensionality reduction In this section, we discuss reducing the length of the per-pixel gene
171 expression feature vector. By “dimension”, we mean the dimension of this vector, not the spatial
172 4
174 dimension of the underlying data.
177 Figure 1: Top row: Genes Nfic
178 and A930001M12Rik are the most
179 correlated with area SS (somatosen-
180 sory cortex). Bottom row: Genes
181 C130038G02Rik and Cacna1i are
182 those with the best fit using logistic
183 regression. Within each picture, the
184 vertical axis roughly corresponds to
185 anterior at the top and posterior at the
186 bottom, and the horizontal axis roughly
187 corresponds to medial at the left and
188 lateral at the right. The red outline is
189 the boundary of region SS. Pixels are
190 colored according to correlation, with
191 red meaning high correlation and blue
192 meaning low. Unlike Goal 1, there is no externally-imposed need to
193 select only a handful of informative genes for inclusion
194 in the instances. However, some clustering algorithms
195 perform better on small numbers of features4. There are
196 techniques which “summarize” a larger number of fea-
197 tures using a smaller number of features; these tech-
198 niques go by the name of feature extraction or dimen-
199 sionality reduction. The small set of features that such a
200 technique yields is called the reduced feature set. Note
201 that the features in the reduced feature set do not neces-
202 sarily correspond to genes; each feature in the reduced
203 set may be any function of the set of gene expression
204 levels.
205 Clustering genes rather than voxels Although the
206 ultimate goal is to cluster the instances (voxels or pixels),
207 one strategy to achieve this goal is to first cluster the
208 features (genes). There are two ways that clusters of
209 genes could be used.
210 Gene clusters could be used as part of dimensionality
211 reduction: rather than have one feature for each gene,
212 we could have one reduced feature for each gene cluster.
213 Gene clusters could also be used to directly yield a
214 clustering on instances. This is because many genes
215 have an expression pattern which seems to pick out a
216 single, spatially contiguous region. This suggests the fol-
217 lowing procedure: cluster together genes which pick out
218 similar regions, and then to use the more popular com-
219 mon regions as the final clusters. In Preliminary Results,
220 Figure 7, we show that a number of anatomically recog-
221 nized cortical regions, as well as some “superregions” formed by lumping together a few regions,
222 are associated with gene clusters in this fashion.
223 Goal 3: interoperability with multi/hyperspectral imaging analysis software
224 A typical color image associates each pixel with a vector of three values. Multispectral and hyper-
225 spectral images, however, are images which associate each pixel with a vector containing many
226 values. The different positions in the vector correspond to different bands of electromagnetic
227 wavelengths5.
228 Some analysis techniques for hyperspectral imaging, especially preprocessing and calibration
229 techniques, make use of the information that the different values captured at each pixel represent
230 ____________________________________
231 4First, because the number of features in the reduced dataset is less than in the original dataset, the running time of
232 clustering algorithms may be much less. Second, it is thought that some clustering algorithms may give better results
233 on reduced data.
234 5In hyperspectral imaging, the bands are adjacent, and the number of different bands is larger. For conciseness, we
235 discuss only hyperspectral imaging, but our methods are also well suited to multispectral imaging with many bands.
236 5
238 adjacent wavelengths of light, which can be combined to make a spectrum. Other analysis tech-
239 niques ignore the interpretation of the values measured, and their relationship to each other within
240 the electromagnetic spectrum, instead treating them blindly as completely separate features.
241 With both hyperspectral imaging and spatial gene expression data, each location in space
242 is associated with more than three numerical feature values. The analysis of hyperspectral im-
243 ages can involve supervised classification and unsupervised learning. Often hyperspectral images
244 come from satellites looking at the Earth, and it is desirable to classify what sort of objects occupy
245 a given area of land. Sometimes detailed training data is not available, in which case it is desirable
246 at least to cluster together those regions of land which contain similar objects.
247 We believe that it may be possible for these two different field to share some common compu-
248 tational tools. To this end, we intend to make use of existing hyperspectral imaging software when
249 possible, and to develop new software in such a way so as to make it easy to use for the purpose
250 of hyperspectral image analysis, as well as for our primary purpose of spatial gene expression
251 data analysis.
252 Related work
254 Figure 2: Gene Pitx2
255 is selectively underex-
256 pressed in area SS. As noted above, the GIS community has developed tools for supervised
257 classification and unsupervised clustering in the context of the analysis
258 of hyperspectral imaging data. One tool is Spectral Python[5]. Spectral
259 Python implements various supervised and unsupervised classification
260 methods, as well as utility functions for loading, viewing, and saving
261 spatial data. Although Spectral Python has feature extraction methods
262 (such as principal components analysis) which create a small set of
263 new features computed based on the original features, it does not have
264 feature selection methods, that is, methods to select a small subset
265 out of the original features (although feature selection in hyperspectral
266 imaging has been investigated by others[20].
267 There is a substantial body of work on the analysis of gene expression data. Most of this con-
268 cerns gene expression data which are not fundamentally spatial6. Here we review only that work
269 which concerns the automated analysis of spatial gene expression data with respect to anatomy.
270 Relating to Goal 1, GeneAtlas[6] and EMAGE [25] allow the user to construct a search query by
271 demarcating regions and then specifying either the strength of expression or the name of another
272 gene or dataset whose expression pattern is to be matched. Neither GeneAtlas nor EMAGE allow
273 one to search for combinations of genes that define a region in concert.
274 Relating to Goal 2, EMAGE[25] allows the user to select a dataset from among a large number
275 of alternatives, or by running a search query, and then to cluster the genes within that dataset.
276 EMAGE clusters via hierarchical complete linkage clustering.
277 [16] describes AGEA, ”Anatomic Gene Expression Atlas”. AGEA has three components. Gene
278 Finder: The user selects a seed voxel and the system (1) chooses a cluster which includes the
279 seed voxel, (2) yields a list of genes which are overexpressed in that cluster. Correlation: The user
280 selects a seed voxel and the system then shows the user how much correlation there is between
281 the gene expression profile of the seed voxel and every other voxel. Clusters: AGEA includes a
282 preset hierarchical clustering of voxels based on a recursive bifurcation algorithm with correlation
283 ____________________________________
284 6By “fundamentally spatial” we mean that there is information from a large number of spatial locations indexed by
285 spatial coordinates; not just data which have only a few different locations or which is indexed by anatomical label.
286 6
288 as the similarity metric. AGEA has been applied to the cortex. The paper describes interesting
289 results on the structure of correlations between voxel gene expression profiles within a handful of
290 cortical areas. However, that analysis neither looks for genes marking cortical areas, nor does it
291 suggest a cortical map based on gene expression data. Neither of the other components of AGEA
292 can be applied to cortical areas; AGEA’s Gene Finder cannot be used to find marker genes for the
293 cortical areas; and AGEA’s hierarchical clustering does not produce clusters corresponding to the
294 cortical areas7.
297 Figure 3: The top row shows the two
298 genes which (individually) best predict
299 area AUD, according to logistic regres-
300 sion. The bottom row shows the two
301 genes which (individually) best match
302 area AUD, according to gradient sim-
303 ilarity. From left to right and top to
304 bottom, the genes are Ssr1, Efcbp1,
305 Ptk7, and Aph1a. [7] looks at the mean expression level of genes within
306 anatomical regions, and applies a Student’s t-test to de-
307 termine whether the mean expression level of a gene is
308 significantly higher in the target region. This relates to
309 our Goal 1. [7] also clusters genes, relating to our Goal
310 2. For each cluster, prototypical spatial expression pat-
311 terns were created by averaging the genes in the cluster.
312 The prototypes were analyzed manually, without cluster-
313 ing voxels.
314 These related works differ from our strategy for Goal
315 1 in at least three ways. First, they find only single genes,
316 whereas we will also look for combinations of genes.
317 Second, they usually can only use overexpression as
318 a marker, whereas we will also search for underexpres-
319 sion. Third, they use scores based on pointwise expres-
320 sion levels, whereas we will also use geometric scores
321 such as gradient similarity (described in Preliminary Re-
322 sults). Figures 4, 2, and 3 in the Preliminary Results
323 section contain evidence that each of our three choices
324 is the right one.
325 [11] describes a technique to find combinations of
326 marker genes to pick out an anatomical region. They
327 use an evolutionary algorithm to evolve logical operators which combine boolean (thresholded)
328 images in order to match a target image. They apply their technique for finding combinations of
329 marker genes for the purpose of clustering genes around a “seed gene”.
330 Relating to our Goal 2, some researchers have attempted to parcellate cortex on the basis of
331 non-gene expression data. For example, [18], [2], [19], and [1] associate spots on the cortex with
332 the radial profile8 of response to some stain ([13] uses MRI), extract features from this profile, and
333 then use similarity between surface pixels to cluster.
334 [23] describes an analysis of the anatomy of the hippocampus using the ABA dataset. In
335 addition to manual analysis, two clustering methods were employed, a modified Non-negative
336 Matrix Factorization (NNMF), and a hierarchical bifurcation clustering scheme using correlation as
337 similarity. The paper yielded impressive results, proving the usefulness of computational genomic
338 ____________________________________
339 7In both cases, the cause is that pairwise correlations between the gene expression of voxels in different areas but
340 the same layer are often stronger than pairwise correlations between the gene expression of voxels in different layers
341 but the same area. Therefore, a pairwise voxel correlation clustering algorithm will tend to create clusters representing
342 cortical layers, not areas.
343 8A radial profile is a profile along a line perpendicular to the cortical surface.
344 7
346 anatomy. We have run NNMF on the cortical dataset, and while the results are promising, other
347 methods may perform as well or better (see Preliminary Results, Figure 6).
348 Comparing previous work with our Goal 1, there has been fruitful work on finding marker genes,
349 but only one of the projects explored combinations of marker genes, and none of them compared
350 the results obtained by using different algorithms or scoring methods. Comparing previous work
351 with Goal 2, although some projects obtained clusterings, there has not been much comparison
352 between different algorithms or scoring methods, so it is likely that the best clustering method for
353 this application has not yet been found. Also, none of these projects did a separate dimensionality
354 reduction step before clustering pixels, or tried to cluster genes first in order to guide automated
355 clustering of pixels into spatial regions, or used co-clustering algorithms.
356 In summary, (a) only one of the previous projects explores combinations of marker genes, (b)
357 there has been almost no comparison of different algorithms or scoring methods, and (c) there
358 has been no work on computationally finding marker genes applied to cortical areas, or on finding
359 a hierarchical clustering that will yield a map of cortical areas de novo from gene expression data.
360 Our project is guided by a concrete application with a well-specified criterion of success (how
361 well we can find marker genes for / reproduce the layout of cortical areas), which will provide a
362 solid basis for comparing different methods.
363 _________________________________________________
364 Data sharing plan
367 Figure 4: Upper left: wwc1. Upper
368 right: mtif2. Lower left: wwc1 + mtif2
369 (each pixel’s value on the lower left is
370 the sum of the corresponding pixels in
371 the upper row). We are enthusiastic about the sharing of methods and
372 data, and at the conclusion of the project, we will make
373 all of our data and computer source code publically avail-
374 able, either in supplemental attachments to publications,
375 or on a website. The source code will be released under
376 the GNU Public License. We intend to include a soft-
377 ware program which, when run, will take as input the
378 Allen Brain Atlas raw data, and produce as output all
379 numbers and charts found in publications resulting from
380 the project. Source code to be released will include ex-
381 tensions to Caret[8], an existing open-source scientific
382 imaging program, and to Spectral Python. Data to be
383 released will include the 2-D “flat map” dataset. This
384 dataset will be submitted to a machine learning dataset
385 repository.
386 Broader impacts
387 In addition to validating the usefulness of the algorithms,
388 the application of these methods to cortex will produce
389 immediate benefits, because there are currently no known genetic markers for most cortical areas.
390 The method developed in Goal 1 will be applied to each cortical area to find a set of marker
391 genes such that the combinatorial expression pattern of those genes uniquely picks out the target
392 area. Finding marker genes will be useful for drug discovery as well as for experimentation be-
393 cause marker genes can be used to design interventions which selectively target individual cortical
394 areas.
395 The application of the marker gene finding algorithm to the cortex will also support the develop-
396 8
398 ment of new neuroanatomical methods. In addition to finding markers for each individual cortical
399 areas, we will find a small panel of genes that can find many of the areal boundaries at once.
400 The method developed in Goal 2 will provide a genoarchitectonic viewpoint that will contribute
401 to the creation of a better cortical map.
402 The methods we will develop will be applicable to other datasets beyond the brain, and even to
403 datasets outside of biology. The software we develop will be useful for the analysis of hyperspectral
404 images. Our project will draw attention to this area of overlap between neuroscience and GIS, and
405 may lead to future collaborations between these two fields. The cortical dataset that we produce
406 will be useful in the machine learning community as a sample dataset that new algorithms can be
407 tested against. The availability of this sample dataset to the machine learning community may lead
408 to more interest in the design of machine learning algorithms to analyze spatial gene expression.
409 _
410 Preliminary Results
411 Format conversion between SEV, MATLAB, NIFTI
412 We have created software to (politely) download all of the SEV files9 from the Allen Institute web-
413 site. We have also created software to convert between the SEV, MATLAB, and NIFTI file formats,
414 as well as some of Caret’s file formats.
415 Flatmap of cortex
416 We downloaded the ABA data and selected only those voxels which belong to cerebral cortex.
417 We divided the cortex into hemispheres. Using Caret[8], we created a mesh representation of the
418 surface of the selected voxels. For each gene, and for each node of the mesh, we calculated an
419 average of the gene expression of the voxels “underneath” that mesh node. We then flattened
420 the cortex, creating a two-dimensional mesh. We converted this grid into a MATLAB matrix. We
421 manually traced the boundaries of each of 46 cortical areas from the ABA coronal reference atlas
422 slides, and converted this region data into MATLAB format.
423 At this point, the data are in the form of a number of 2-D matrices, all in registration, with the
424 matrix entries representing a grid of points (pixels) over the cortical surface. There is one 2-D
425 matrix whose entries represent the regional label associated with each surface pixel. And for each
426 gene, there is a 2-D matrix whose entries represent the average expression level underneath each
427 surface pixel. The features and the target area are both functions on the surface pixels. They can
428 be referred to as scalar fields over the space of surface pixels; alternately, they can be thought of
429 as images which can be displayed on the flatmapped surface.
430 Feature selection and scoring methods
431 Underexpression of a gene can serve as a marker Underexpression of a gene can sometimes
432 serve as a marker. For example, see Figure 2.
433 Correlation Recall that the instances are surface pixels, and consider the problem of attempt-
434 ing to classify each instance as either a member of a particular anatomical area, or not. The target
435 area can be represented as a boolean mask over the surface pixels.
436 We calculated the correlation between each gene and each cortical area. The top row of Figure
437 1 shows the three genes most correlated with area SS.
438 9SEV is a sparse format for spatial data. It is the format in which the ABA data is made available.
439 9
441 Conditional entropy
442 For each region, we created and ran a forward stepwise procedure which attempted to find
443 pairs of genes such that the conditional entropy of the target area’s boolean mask, conditioned
444 upon the gene pair’s thresholded expression levels, is minimized.
445 This finds pairs of genes which are most informative (at least at these threshold levels) relative
446 to the question, “Is this surface pixel a member of the target area?”. The advantage over linear
447 methods such as logistic regression is that this takes account of arbitrarily nonlinear relationships;
448 for example, if the XOR of two variables predicts the target, conditional entropy would notice,
449 whereas linear methods would not.
450 Gradient similarity We noticed that the previous two scoring methods, which are pointwise,
451 often found genes whose pattern of expression did not look similar in shape to the target region.
452 For this reason we designed a non-pointwise scoring method to detect when a gene had a pattern
453 of expression which looked like it had a boundary whose shape is similar to the shape of the target
454 region. We call this scoring method “gradient similarity”. The formula is:
455 ∑
456 pixel<img src="cmsy8-32.png" alt="∈" />pixels cos(∠∇1 -∠∇2) ⋅|∇1| + |∇2|
457 2 ⋅ pixel_value1 + pixel_value2
458 2
459 where ∇1 and ∇2 are the gradient vectors of the two images at the current pixel; ∠∇i is the
460 angle of the gradient of image i at the current pixel; |∇i| is the magnitude of the gradient of image
461 i at the current pixel; and pixel_valuei is the value of the current pixel in image i.
462 The intuition is that we want to see if the borders of the pattern in the two images are similar; if
463 the borders are similar, then both images will have corresponding pixels with large gradients (be-
464 cause this is a border) which are oriented in a similar direction (because the borders are similar).
465 Gradient similarity provides information complementary to correlation
466 To show that gradient similarity can provide useful information that cannot be detected via
467 pointwise analyses, consider Fig. 3. The pointwise method in the top row identifies genes which
468 express more strongly in AUD than outside of it; its weakness is that this includes many areas
469 which don’t have a salient border matching the areal border. The geometric method identifies
470 genes whose salient expression border seems to partially line up with the border of AUD; its
471 weakness is that this includes genes which don’t express over the entire area.
472 Areas which can be identified by single genes Using gradient similarity, we have already
473 found single genes which roughly identify some areas and groupings of areas. For each of these
474 areas, an example of a gene which roughly identifies it is shown in Figure 5. We have not yet
475 cross-verified these genes in other atlases.
476 In addition, there are a number of areas which are almost identified by single genes: COAa+NLOT
477 (anterior part of cortical amygdalar area, nucleus of the lateral olfactory tract), ENT (entorhinal),
478 ACAv (ventral anterior cingulate), VIS (visual), AUD (auditory).
479 These results validate our expectation that the ABA dataset can be exploited to find marker
480 genes for many cortical areas, while also validating the relevancy of our new scoring method,
481 gradient similarity.
482 10
488 Figure 5: From left to right and top
489 to bottom, single genes which roughly
490 identify areas SS (somatosensory pri-
491 mary + supplemental), SSs (supple-
492 mental somatosensory), PIR (piriform),
493 FRP (frontal pole), RSP (retrosplenial),
494 COApm (Cortical amygdalar, poste-
495 rior part, medial zone). Grouping
496 some areas together, we have also
497 found genes to identify the groups
498 ACA+PL+ILA+DP+ORB+MO (anterior
499 cingulate, prelimbic, infralimbic, dor-
500 sal peduncular, orbital, motor), poste-
501 rior and lateral visual (VISpm, VISpl,
502 VISI, VISp; posteromedial, posterolat-
503 eral, lateral, and primary visual; the
504 posterior and lateral visual area is dis-
505 tinguished from its neighbors, but not
506 from the entire rest of the cortex). The
507 genes are Pitx2, Aldh1a2, Ppfibp1,
508 Slco1a5, Tshz2, Trhr, Col12a1, Ets1. Combinations of multiple genes are useful and
509 necessary for some areas
510 In Figure 4, we give an example of a cortical area
511 which is not marked by any single gene, but which can be
512 identified combinatorially. According to logistic regres-
513 sion, gene wwc1 is the best fit single gene for predicting
514 whether or not a pixel on the cortical surface belongs to
515 the motor area (area MO). The upper-left picture in Fig-
516 ure 4 shows wwc1’s spatial expression pattern over the
517 cortex. The lower-right boundary of MO is represented
518 reasonably well by this gene, but the gene overshoots
519 the upper-left boundary. This flattened 2-D representa-
520 tion does not show it, but the area corresponding to the
521 overshoot is the medial surface of the cortex. MO is only
522 found on the dorsal surface. Gene mtif2 is shown in the
523 upper-right. Mtif2 captures MO’s upper-left boundary, but
524 not its lower-right boundary. Mtif2 does not express very
525 much on the medial surface. By adding together the val-
526 ues at each pixel in these two figures, we get the lower-
527 left image. This combination captures area MO much
528 better than any single gene.
529 This shows that our proposal to develop a method to
530 find combinations of marker genes is both possible and
531 necessary.
532 Multivariate supervised learning
533 Forward stepwise logistic regression Logistic regres-
534 sion is a popular method for predictive modeling of cat-
535 egorical data. As a pilot run, for five cortical areas (SS,
536 AUD, RSP, VIS, and MO), we performed forward step-
537 wise logistic regression to find single genes, pairs of
538 genes, and triplets of genes which predict areal identify.
539 This is an example of feature selection integrated with
540 prediction using a stepwise wrapper. Some of the sin-
541 gle genes found were shown in various figures through-
542 out this document, and Figure 4 shows a combination of
543 genes which was found.
544 SVM on all genes at once
545 In order to see how well one can do when looking at
546 all genes at once, we ran a support vector machine to
547 classify cortical surface pixels based on their gene ex-
548 pression profiles. We achieved classification accuracy of
549 about 81%10. However, as noted above, a classifier that
550 ____________________________________
551 105-fold cross-validation.
552 11
554 looks at all the genes at once isn’t as practically useful
555 as a classifier that uses only a few genes.
556 Data-driven redrawing of the cortical map
557 We have applied the following dimensionality reduction algorithms to reduce the dimensionality
558 of the gene expression profile associated with each pixel: Principal Components Analysis (PCA),
559 Simple PCA, Multi-Dimensional Scaling, Isomap, Landmark Isomap, Laplacian eigenmaps, Local
560 Tangent Space Alignment, Stochastic Proximity Embedding, Fast Maximum Variance Unfolding,
561 Non-negative Matrix Factorization (NNMF). Space constraints prevent us from showing many of
562 the results, but as a sample, PCA, NNMF, and landmark Isomap are shown in the first, second,
563 and third rows of Figure 6.
564 After applying the dimensionality reduction, we ran clustering algorithms on the reduced data.
565 To date we have tried k-means and spectral clustering. The results of k-means after PCA, NNMF,
566 and landmark Isomap are shown in the bottom row of Figure 6. To compare, the leftmost picture
567 on the bottom row of Figure 6 shows some of the major subdivisions of cortex. These results show
568 that different dimensionality reduction techniques capture different aspects of the data and lead
569 to different clusterings, indicating the utility of our proposal to produce a detailed comparison of
570 these techniques as applied to the domain of genomic anatomy.
571 Many areas are captured by clusters of genes We also clustered the genes using gradient
572 similarity to see if the spatial regions defined by any clusters matched known anatomical regions.
573 Figure 7 shows, for ten sample gene clusters, each cluster’s average expression pattern, com-
574 pared to a known anatomical boundary. This suggests that it is worth attempting to cluster genes,
575 and then to use the results to cluster pixels.
576 Our plan: what remains to be done
577 Flatmap cortex and segment cortical layers
578 There are multiple ways to flatten 3-D data into 2-D. We will compare mappings from manifolds to
579 planes which attempt to preserve size (such as the one used by Caret[8]) with mappings which
580 preserve angle (conformal maps). We will also develop a segmentation algorithm to automatically
581 identify the layer boundaries.
582 Develop algorithms that find genetic markers for anatomical regions
583 Scoring measures and feature selection We will develop scoring methods for evaluating how
584 good individual genes are at marking areas. We will compare pointwise, geometric, and information-
585 theoretic measures. We already developed one entirely new scoring method (gradient similarity),
586 but we may develop more. Scoring measures that we will explore will include the L1 norm, cor-
587 relation, expression energy ratio, conditional entropy, gradient similarity, Jaccard similarity, Dice
588 similarity, Hough transform, and statistical tests such as Student’s t-test, and the Mann-Whitney
589 U test (a non-parametric test). In addition, any classifier induces a scoring measure on genes by
590 taking the prediction error when using that gene to predict the target.
591 Using some combination of these measures, we will develop a procedure to find single marker
592 genes for anatomical regions: for each cortical area, we will rank the genes by their ability to
593 delineate that area. We will quantitatively compare the list of single genes generated by our
594 method to the lists generated by methods which are mentioned in Related Work.
595 12
598 Figure 6: First row: the first 6 reduced dimensions, using PCA. Sec-
599 ond row: the first 6 reduced dimensions, using NNMF. Third row: the
600 first six reduced dimensions, using landmark Isomap. Bottom row:
601 examples of kmeans clustering applied to reduced datasets to find
602 7 clusters. Left: 19 of the major subdivisions of the cortex. Sec-
603 ond from left: PCA. Third from left: NNMF. Right: Landmark Isomap.
604 Additional details: In the third and fourth rows, 7 dimensions were
605 found, but only 6 displayed. In the last row: for PCA, 50 dimensions
606 were used; for NNMF, 6 dimensions were used; for landmark Isomap,
607 7 dimensions were used. Some cortical areas have
608 no single marker genes but
609 can be identified by com-
610 binatorial coding. This re-
611 quires multivariate scoring
612 measures and feature se-
613 lection procedures. Many
614 of the measures, such
615 as expression energy, gra-
616 dient similarity, Jaccard,
617 Dice, Hough, Student’s t,
618 and Mann-Whitney U are
619 univariate. We will ex-
620 tend these scoring mea-
621 sures for use in multivariate
622 feature selection, that is,
623 for scoring how well com-
624 binations of genes, rather
625 than individual genes, can
626 distinguish a target area.
627 There are existing mul-
628 tivariate forms of some
629 of the univariate scoring
630 measures, for example,
631 Hotelling’s T-square is a
632 multivariate analog of Stu-
633 dent’s t.
634 We will develop a fea-
635 ture selection procedure for choosing the best small set of marker genes for a given anatomical
636 area. In addition to using the scoring measures that we develop, we will also explore (a) feature
637 selection using a stepwise wrapper over “vanilla” classifiers such as logistic regression, (b) super-
638 vised learning methods such as decision trees which incrementally/greedily combine single gene
639 markers into sets, and (c) supervised learning methods which use soft constraints to minimize
640 number of features used, such as sparse support vector machines (SVMs).
641 Since errors of displacement and of shape may cause genes and target areas to match less
642 than they should, we will consider the robustness of feature selection methods in the presence of
643 error. Some of these methods, such as the Hough transform, are designed to be resistant in the
644 presence of error, but many are not.
645 An area may be difficult to identify because the boundaries are misdrawn in the atlas, or be-
646 cause the shape of the natural domain of gene expression corresponding to the area is different
647 from the shape of the area as recognized by anatomists. We will develop extensions to our pro-
648 cedure which (a) detect when a difficult area could be fit if its boundary were redrawn slightly11,
649 ____________________________________
650 11Not just any redrawing is acceptable, only those which appear to be justified as a natural spatial domain of gene ex-
651 pression by multiple sources of evidence. Interestingly, the need to detect “natural spatial domains of gene expression”
652 in a data-driven fashion means that the methods of Goal 2 might be useful in achieving Goal 1, as well – particularly
653 13
655 and (b) detect when a difficult area could be combined with adjacent areas to create a larger area
656 which can be fit.
657 A future publication on the method that we develop in Goal 1 will review the scoring measures
658 and quantitatively compare their performance in order to provide a foundation for future research
659 of methods of marker gene finding. We will measure the robustness of the scoring measures as
660 well as their absolute performance on our dataset.
661 Develop algorithms to suggest a division of a structure into anatomical parts
663 Figure 7: Prototypes corresponding to sample gene clus-
664 ters, clustered by gradient similarity. Region boundaries for
665 the region that most matches each prototype are overlaid. Dimensionality reduction on gene
666 expression profiles We have al-
667 ready described the application of
668 ten dimensionality reduction algo-
669 rithms for the purpose of replacing
670 the gene expression profiles, which
671 are vectors of about 4000 gene ex-
672 pression levels, with a smaller num-
673 ber of features. We plan to further ex-
674 plore and interpret these results, as
675 well as to apply other unsupervised
676 learning algorithms, including inde-
677 pendent components analysis, self-
678 organizing maps, and generative models such as deep Boltzmann machines. We will explore
679 ways to quantitatively compare the relevance of the different dimensionality reduction methods for
680 identifying cortical areal boundaries.
681 Dimensionality reduction on pixels Instead of applying dimensionality reduction to the gene
682 expression profiles, the same techniques can be applied instead to the pixels. It is possible that
683 the features generated in this way by some dimensionality reduction techniques will directly corre-
684 spond to interesting spatial regions.
685 Clustering and segmentation on pixels We will explore clustering and image segmentation
686 algorithms in order to segment the pixels into regions. We will explore k-means, spectral cluster-
687 ing, gene shaving[10], recursive division clustering, multivariate generalizations of edge detectors,
688 multivariate generalizations of watershed transformations, region growing, active contours, graph
689 partitioning methods, and recursive agglomerative clustering with various linkage functions. These
690 methods can be combined with dimensionality reduction.
691 Clustering on genes We have already shown that the procedure of clustering genes according
692 to gradient similarity, and then creating an averaged prototype of each cluster’s expression pattern,
693 yields some spatial patterns which match cortical areas (Figure 7). We will further explore the
694 clustering of genes.
695 In addition to using the cluster expression prototypes directly to identify spatial regions, this
696 might be useful as a component of dimensionality reduction. For example, one could imagine
697 clustering similar genes and then replacing their expression levels with a single average expression
698 ____________________________________
699 discriminative dimensionality reduction.
700 14
702 level, thereby removing some redundancy from the gene expression profiles. One could then
703 perform clustering on pixels (possibly after a second dimensionality reduction step) in order to
704 identify spatial regions. It remains to be seen whether removal of redundancy would help or hurt
705 the ultimate goal of identifying interesting spatial regions.
706 Co-clustering We will explore some algorithms which simultaneously incorporate clustering
707 on instances and on features (in our case, pixels and genes), for example, IRM[12]. These are
708 called co-clustering or biclustering algorithms.
709 Compare different methods In order to tell which method is best for genomic anatomy, for
710 each experimental method we will compare the cortical map found by unsupervised learning to a
711 cortical map derived from the Allen Reference Atlas. We will explore various quantitative metrics
712 that purport to measure how similar two clusterings are, such as Jaccard, Rand index, Fowlkes-
713 Mallows, variation of information, Larsen, Van Dongen, and others.
714 Discriminative dimensionality reduction In addition to using a purely data-driven approach
715 to identify spatial regions, it might be useful to see how well the known regions can be recon-
716 structed from a small number of features, even if those features are chosen by using knowledge of
717 the regions. For example, linear discriminant analysis could be used as a dimensionality reduction
718 technique in order to identify a few features which are the best linear summary of gene expression
719 profiles for the purpose of discriminating between regions. This reduced feature set could then be
720 used to cluster pixels into regions. Perhaps the resulting clusters will be similar to the reference
721 atlas, yet more faithful to natural spatial domains of gene expression than the reference atlas is.
722 Apply the new methods to the cortex
723 Using the methods developed in Goal 1, we will present, for each cortical area, a short list of
724 markers to identify that area; and we will also present lists of “panels” of genes that can be used
725 to delineate many areas at once.
726 Because in most cases the ABA coronal dataset only contains one ISH per gene, it is possible
727 for an unrelated combination of genes to seem to identify an area when in fact it is only coinci-
728 dence. There are three ways we will validate our marker genes to guard against this. First, we
729 will confirm that putative combinations of marker genes express the same pattern in both hemi-
730 spheres. Second, we will manually validate our final results on other gene expression datasets
731 such as EMAGE, GeneAtlas, and GENSAT[9]. Third, we may conduct ISH experiments jointly with
732 collaborators to get further data on genes of particular interest.
733 Using the methods developed in Goal 2, we will present one or more hierarchical cortical
734 maps. We will identify and explain how the statistical structure in the gene expression data led to
735 any unexpected or interesting features of these maps, and we will provide biological hypotheses
736 to interpret any new cortical areas, or groupings of areas, which are discovered.
737 Apply the new methods to hyperspectral datasets
738 Our software will be able to read and write file formats common in the hyperspectral imaging
739 community such as Erdas LAN and ENVI, and it will be able to convert between the SEV and NIFTI
740 formats from neuroscience and the ENVI format from GIS. The methods developed in Goals 1 and
741 2 will be implemented either as part of Spectral Python or as a separate tool that interoperates
742 with Spectral Python. The methods will be run on hyperspectral satellite image datasets, and their
743 performance will be compared to existing hyperspectral analysis techniques.
744 15
746 References Cited
747 [1] Chris Adamson, Leigh Johnston, Terrie Inder, Sandra Rees, Iven Mareels, and Gary Egan.
748 A tracking approach to parcellation of the cerebral cortex. In Medical Image Computing
749 and Computer-Assisted Intervention MICCAI 2005, volume 3749/2005 of Lecture Notes in
750 Computer Science, pages 294–301. Springer Berlin / Heidelberg, 2005.
751 [2] J. Annese, A. Pitiot, I. D. Dinov, and A. W. Toga. A myelo-architectonic method for the struc-
752 tural classification of cortical areas. NeuroImage, 21(1):15–26, 2004.
753 [3] Tanya Barrett, Dennis B. Troup, Stephen E. Wilhite, Pierre Ledoux, Dmitry Rudnev, Carlos
754 Evangelista, Irene F. Kim, Alexandra Soboleva, Maxim Tomashevsky, and Ron Edgar. NCBI
755 GEO: mining tens of millions of expression profiles–database and tools update. Nucl. Acids
756 Res., 35(suppl_1):D760–765, 2007.
757 [4] George W. Bell, Tatiana A. Yatskievych, and Parker B. Antin. GEISHA, a whole-mount in
758 situ hybridization gene expression screen in chicken embryos. Developmental Dynamics,
759 229(3):677–687, 2004.
760 [5] Thomas Boggs. Spectral python. http://spectralpython.sourceforge.net/, July 2008.
761 [6] James P Carson, Tao Ju, Hui-Chen Lu, Christina Thaller, Mei Xu, Sarah L Pallas, Michael C
762 Crair, Joe Warren, Wah Chiu, and Gregor Eichele. A digital atlas to characterize the mouse
763 brain transcriptome. PLoS Comput Biol, 1(4):e41, 2005.
764 [7] Mark H. Chin, Alex B. Geng, Arshad H. Khan, Wei-Jun Qian, Vladislav A. Petyuk, Jyl Boline,
765 Shawn Levy, Arthur W. Toga, Richard D. Smith, Richard M. Leahy, and Desmond J. Smith.
766 A genome-scale map of expression for a mouse brain section obtained using voxelation.
767 Physiol. Genomics, 30(3):313–321, August 2007.
768 [8] D C Van Essen, H A Drury, J Dickson, J Harwell, D Hanlon, and C H Anderson. An integrated
769 software suite for surface-based analyses of cerebral cortex. Journal of the American Medical
770 Informatics Association: JAMIA, 8(5):443–59, 2001. PMID: 11522765.
771 [9] Shiaoching Gong, Chen Zheng, Martin L. Doughty, Kasia Losos, Nicholas Didkovsky, Uta B.
772 Schambra, Norma J. Nowak, Alexandra Joyner, Gabrielle Leblanc, Mary E. Hatten, and
773 Nathaniel Heintz. A gene expression atlas of the central nervous system based on bacte-
774 rial artificial chromosomes. Nature, 425(6961):917–925, October 2003.
775 [10] Trevor Hastie, Robert Tibshirani, Michael Eisen, Ash Alizadeh, Ronald Levy, Louis Staudt,
776 Wing Chan, David Botstein, and Patrick Brown. ’Gene shaving’ as a method for identifying dis-
777 tinct sets of genes with similar expression patterns. Genome Biology, 1(2):research0003.1–
778 research0003.21, 2000.
779 [11] Jano Hemert and Richard Baldock. Matching spatial regions with combinations of interact-
780 ing gene expression patterns. In Bioinformatics Research and Development, volume 13 of
781 Communications in Computer and Information Science, pages 347–361. Springer Berlin Hei-
782 delberg, 2008.
783 16
785 [12] C Kemp, JB Tenenbaum, TL Griffiths, T Yamada, and N Ueda. Learning systems of concepts
786 with an infinite relational model. In AAAI, 2006.
787 [13] F. Kruggel, M. K. Brckner, Th. Arendt, C. J. Wiggins, and D. Y. von Cramon. Analyzing the
788 neocortical fine-structure. Medical Image Analysis, 7(3):251–264, September 2003.
789 [14] Ed S. Lein, Michael J. Hawrylycz, Nancy Ao, Mikael Ayres, Amy Bensinger, Amy Bernard,
790 Andrew F. Boe, Mark S. Boguski, Kevin S. Brockway, Emi J. Byrnes, Lin Chen, Li Chen,
791 Tsuey-Ming Chen, Mei Chi Chin, Jimmy Chong, Brian E. Crook, Aneta Czaplinska, Chinh N.
792 Dang, Suvro Datta, Nick R. Dee, Aimee L. Desaki, Tsega Desta, Ellen Diep, Tim A. Dolbeare,
793 Matthew J. Donelan, Hong-Wei Dong, Jennifer G. Dougherty, Ben J. Duncan, Amanda J.
794 Ebbert, Gregor Eichele, Lili K. Estin, Casey Faber, Benjamin A. Facer, Rick Fields, Shanna R.
795 Fischer, Tim P. Fliss, Cliff Frensley, Sabrina N. Gates, Katie J. Glattfelder, Kevin R. Halverson,
796 Matthew R. Hart, John G. Hohmann, Maureen P. Howell, Darren P. Jeung, Rebecca A. John-
797 son, Patrick T. Karr, Reena Kawal, Jolene M. Kidney, Rachel H. Knapik, Chihchau L. Kuan,
798 James H. Lake, Annabel R. Laramee, Kirk D. Larsen, Christopher Lau, Tracy A. Lemon,
799 Agnes J. Liang, Ying Liu, Lon T. Luong, Jesse Michaels, Judith J. Morgan, Rebecca J. Mor-
800 gan, Marty T. Mortrud, Nerick F. Mosqueda, Lydia L. Ng, Randy Ng, Geralyn J. Orta, Car-
801 oline C. Overly, Tu H. Pak, Sheana E. Parry, Sayan D. Pathak, Owen C. Pearson, Ralph B.
802 Puchalski, Zackery L. Riley, Hannah R. Rockett, Stephen A. Rowland, Joshua J. Royall,
803 Marcos J. Ruiz, Nadia R. Sarno, Katherine Schaffnit, Nadiya V. Shapovalova, Taz Sivisay,
804 Clifford R. Slaughterbeck, Simon C. Smith, Kimberly A. Smith, Bryan I. Smith, Andy J. Sodt,
805 Nick N. Stewart, Kenda-Ruth Stumpf, Susan M. Sunkin, Madhavi Sutram, Angelene Tam,
806 Carey D. Teemer, Christina Thaller, Carol L. Thompson, Lee R. Varnam, Axel Visel, Ray M.
807 Whitlock, Paul E. Wohnoutka, Crissa K. Wolkey, Victoria Y. Wong, Matthew Wood, Murat B.
808 Yaylaoglu, Rob C. Young, Brian L. Youngstrom, Xu Feng Yuan, Bin Zhang, Theresa A. Zwing-
809 man, and Allan R. Jones. Genome-wide atlas of gene expression in the adult mouse brain.
810 Nature, 445(7124):168–176, 2007.
811 [15] Susan Magdaleno, Patricia Jensen, Craig L. Brumwell, Anna Seal, Karen Lehman, Andrew
812 Asbury, Tony Cheung, Tommie Cornelius, Diana M. Batten, Christopher Eden, Shannon M.
813 Norland, Dennis S. Rice, Nilesh Dosooye, Sundeep Shakya, Perdeep Mehta, and Tom Cur-
814 ran. BGEM: an in situ hybridization database of gene expression in the embryonic and adult
815 mouse nervous system. PLoS Biology, 4(4):e86 EP –, April 2006.
816 [16] Lydia Ng, Amy Bernard, Chris Lau, Caroline C Overly, Hong-Wei Dong, Chihchau Kuan,
817 Sayan Pathak, Susan M Sunkin, Chinh Dang, Jason W Bohland, Hemant Bokil, Partha P
818 Mitra, Luis Puelles, John Hohmann, David J Anderson, Ed S Lein, Allan R Jones, and Michael
819 Hawrylycz. An anatomic gene expression atlas of the adult mouse brain. Nat Neurosci,
820 12(3):356–362, March 2009.
821 [17] George Paxinos and Keith B.J. Franklin. The Mouse Brain in Stereotaxic Coordinates. Aca-
822 demic Press, 2 edition, July 2001.
823 [18] A. Schleicher, N. Palomero-Gallagher, P. Morosan, S. Eickhoff, T. Kowalski, K. Vos,
824 K. Amunts, and K. Zilles. Quantitative architectural analysis: a new approach to cortical
825 mapping. Anatomy and Embryology, 210(5):373–386, December 2005.
826 17
828 [19] Oliver Schmitt, Lars Hmke, and Lutz Dmbgen. Detection of cortical transition regions utilizing
829 statistical analyses of excess masses. NeuroImage, 19(1):42–63, May 2003.
830 [20] S.B. Serpico and L. Bruzzone. A new search algorithm for feature selection in hyperspec-
831 tral remote sensing images. Geoscience and Remote Sensing, IEEE Transactions on,
832 39(7):1360–1367, 2001.
833 [21] Constance M. Smith, Jacqueline H. Finger, Terry F. Hayamizu, Ingeborg J. McCright, Janan T.
834 Eppig, James A. Kadin, Joel E. Richardson, and Martin Ringwald. The mouse gene expres-
835 sion database (GXD): 2007 update. Nucl. Acids Res., 35(suppl_1):D618–623, 2007.
836 [22] Larry Swanson. Brain Maps: Structure of the Rat Brain. Academic Press, 3 edition, November
837 2003.
838 [23] Carol L. Thompson, Sayan D. Pathak, Andreas Jeromin, Lydia L. Ng, Cameron R. MacPher-
839 son, Marty T. Mortrud, Allison Cusick, Zackery L. Riley, Susan M. Sunkin, Amy Bernard,
840 Ralph B. Puchalski, Fred H. Gage, Allan R. Jones, Vladimir B. Bajic, Michael J. Hawrylycz,
841 and Ed S. Lein. Genomic anatomy of the hippocampus. Neuron, 60(6):1010–1021, Decem-
842 ber 2008.
843 [24] Pavel Tomancak, Amy Beaton, Richard Weiszmann, Elaine Kwan, ShengQiang Shu,
844 Suzanna E Lewis, Stephen Richards, Michael Ashburner, Volker Hartenstein, Susan E Cel-
845 niker, and Gerald M Rubin. Systematic determination of patterns of gene expression during
846 drosophila embryogenesis. Genome Biology, 3(12):research008818814, 2002. PMC151190.
847 [25] Shanmugasundaram Venkataraman, Peter Stevenson, Yiya Yang, Lorna Richardson,
848 Nicholas Burton, Thomas P. Perry, Paul Smith, Richard A. Baldock, Duncan R. Davidson,
849 and Jeffrey H. Christiansen. EMAGE edinburgh mouse atlas of gene expression: 2008 up-
850 date. Nucl. Acids Res., 36(suppl_1):D860–865, 2008.
851 [26] Axel Visel, Christina Thaller, and Gregor Eichele. GenePaint.org: an atlas of gene expression
852 patterns in the mouse embryo. Nucl. Acids Res., 32(suppl_1):D552–556, 2004.
853 [27] Robert H Waterston, Kerstin Lindblad-Toh, Ewan Birney, Jane Rogers, Josep F Abril, Pankaj
854 Agarwal, Richa Agarwala, Rachel Ainscough, Marina Alexandersson, Peter An, Stylianos E
855 Antonarakis, John Attwood, Robert Baertsch, Jonathon Bailey, Karen Barlow, Stephan Beck,
856 Eric Berry, Bruce Birren, Toby Bloom, Peer Bork, Marc Botcherby, Nicolas Bray, Michael R
857 Brent, Daniel G Brown, Stephen D Brown, Carol Bult, John Burton, Jonathan Butler,
858 Robert D Campbell, Piero Carninci, Simon Cawley, Francesca Chiaromonte, Asif T Chin-
859 walla, Deanna M Church, Michele Clamp, Christopher Clee, Francis S Collins, Lisa L Cook,
860 Richard R Copley, Alan Coulson, Olivier Couronne, James Cuff, Val Curwen, Tim Cutts,
861 Mark Daly, Robert David, Joy Davies, Kimberly D Delehaunty, Justin Deri, Emmanouil T Der-
862 mitzakis, Colin Dewey, Nicholas J Dickens, Mark Diekhans, Sheila Dodge, Inna Dubchak,
863 Diane M Dunn, Sean R Eddy, Laura Elnitski, Richard D Emes, Pallavi Eswara, Eduardo
864 Eyras, Adam Felsenfeld, Ginger A Fewell, Paul Flicek, Karen Foley, Wayne N Frankel, Lu-
865 cinda A Fulton, Robert S Fulton, Terrence S Furey, Diane Gage, Richard A Gibbs, Gustavo
866 Glusman, Sante Gnerre, Nick Goldman, Leo Goodstadt, Darren Grafham, Tina A Graves,
867 Eric D Green, Simon Gregory, Roderic Guig, Mark Guyer, Ross C Hardison, David Haussler,
868 18
870 Yoshihide Hayashizaki, LaDeana W Hillier, Angela Hinrichs, Wratko Hlavina, Timothy Holzer,
871 Fan Hsu, Axin Hua, Tim Hubbard, Adrienne Hunt, Ian Jackson, David B Jaffe, L Steven John-
872 son, Matthew Jones, Thomas A Jones, Ann Joy, Michael Kamal, Elinor K Karlsson, Donna
873 Karolchik, Arkadiusz Kasprzyk, Jun Kawai, Evan Keibler, Cristyn Kells, W James Kent, An-
874 drew Kirby, Diana L Kolbe, Ian Korf, Raju S Kucherlapati, Edward J Kulbokas, David Kulp,
875 Tom Landers, J P Leger, Steven Leonard, Ivica Letunic, Rosie Levine, Jia Li, Ming Li, Chris-
876 tine Lloyd, Susan Lucas, Bin Ma, Donna R Maglott, Elaine R Mardis, Lucy Matthews, Evan
877 Mauceli, John H Mayer, Megan McCarthy, W Richard McCombie, Stuart McLaren, Kirsten
878 McLay, John D McPherson, Jim Meldrim, Beverley Meredith, Jill P Mesirov, Webb Miller, Tra-
879 cie L Miner, Emmanuel Mongin, Kate T Montgomery, Michael Morgan, Richard Mott, James C
880 Mullikin, Donna M Muzny, William E Nash, Joanne O Nelson, Michael N Nhan, Robert Nicol,
881 Zemin Ning, Chad Nusbaum, Michael J O’Connor, Yasushi Okazaki, Karen Oliver, Emma
882 Overton-Larty, Lior Pachter, Gens Parra, Kymberlie H Pepin, Jane Peterson, Pavel Pevzner,
883 Robert Plumb, Craig S Pohl, Alex Poliakov, Tracy C Ponce, Chris P Ponting, Simon Potter,
884 Michael Quail, Alexandre Reymond, Bruce A Roe, Krishna M Roskin, Edward M Rubin, Alis-
885 tair G Rust, Ralph Santos, Victor Sapojnikov, Brian Schultz, Jrg Schultz, Matthias S Schwartz,
886 Scott Schwartz, Carol Scott, Steven Seaman, Steve Searle, Ted Sharpe, Andrew Sheridan,
887 Ratna Shownkeen, Sarah Sims, Jonathan B Singer, Guy Slater, Arian Smit, Douglas R Smith,
888 Brian Spencer, Arne Stabenau, Nicole Stange-Thomann, Charles Sugnet, Mikita Suyama,
889 Glenn Tesler, Johanna Thompson, David Torrents, Evanne Trevaskis, John Tromp, Cather-
890 ine Ucla, Abel Ureta-Vidal, Jade P Vinson, Andrew C Von Niederhausern, Claire M Wade,
891 Melanie Wall, Ryan J Weber, Robert B Weiss, Michael C Wendl, Anthony P West, Kris
892 Wetterstrand, Raymond Wheeler, Simon Whelan, Jamey Wierzbowski, David Willey, Sophie
893 Williams, Richard K Wilson, Eitan Winter, Kim C Worley, Dudley Wyman, Shan Yang, Shiaw-
894 Pyng Yang, Evgeny M Zdobnov, Michael C Zody, and Eric S Lander. Initial sequencing and
895 comparative analysis of the mouse genome. Nature, 420(6915):520–62, December 2002.
896 PMID: 12466850.
897 19