nsf: e61f822e0375 grant.html

nsf

view grant.html @ 119:e61f822e0375

author	bshanks@bshanks.dyndns.org
date	Tue Jul 07 14:57:48 2009 -0700 (16 years ago)
parents	ffa1390e4f39
children	94284c1ca133

line source

1 Introduction

2 Massive new datasets obtained with techniques such as in situ hybridization (ISH), immunohisto-

3 chemistry, in situ transgenic reporter, microarray voxelation, and others, allow the expression levels

4 of many genes at many locations to be compared. Our goal is to develop automated methods to

5 relate spatial variation in gene expression to anatomy. We want to find marker genes for specific

6 anatomical regions, and also to draw new anatomical maps based on gene expression patterns.

7 We will validate these methods by applying them to 46 anatomical areas within the cerebral cortex,

8 by using the Allen Mouse Brain Atlas coronal dataset (ABA).

9 This project has three primary goals:

10 (1) develop an algorithm to screen spatial gene expression data for combinations of marker

11 genes which selectively target anatomical regions.

12 (2) develop an algorithm to suggest new ways of carving up a structure into anatomically dis-

13 tinct regions, based on spatial patterns in gene expression.

14 (3) adapt our tools for the analysis of multi/hyperspectral imaging data from the Geographic

15 Information Systems (GIS) community.

16 We will create a 2-D “flat map” dataset of the mouse cerebral cortex that contains a flattened

17 version of the Allen Mouse Brain Atlas ISH data, as well as the boundaries of cortical anatomical

18 areas. We will use this dataset to validate the methods developed in (1) and (2). In addition to

19 its use in neuroscience, this dataset will be useful as a sample dataset for the machine learning

20 community.

21 Although our particular application involves the 3D spatial distribution of gene expression, the

22 methods we will develop will generalize to any high-dimensional data over points located in a low-

23 dimensional space. In particular, our methods could be applied to the analysis of multi/hyperspectral

24 imaging data, or alternately to genome-wide sequencing data derived from sets of tissues and dis-

25 ease states.

26 All algorithms that we develop will be implemented in a GPL open-source software toolkit. The

27 toolkit and the datasets will be published and freely available for others to use.

28 __________________

29 Background and related work

30 Cortical anatomy

31 The cortex is divided into areas and layers. Because of the cortical columnar organization, the

32 parcellation of the cortex into areas can be drawn as a 2-D map on the surface of the cortex. In the

33 third dimension, the boundaries between the areas continue downwards into the cortical depth,

34 perpendicular to the surface. The layer boundaries run parallel to the surface. One can picture an

35 area of the cortex as a slice of a six-layered cake1.

36 It is known that different cortical areas have distinct roles in both normal functioning and in

37 disease processes, yet there are no known marker genes for most cortical areas. When it is nec-

38 essary to divide a tissue sample into cortical areas, this is a manual process that requires a skilled

39 1Outside of isocortex, the number of layers varies.

40 1

42 human to combine multiple visual cues and interpret them in the context of their approximate

43 location upon the cortical surface.

44 Even the questions of how many areas should be recognized in cortex, and what their arrange-

45 ment is, are still not completely settled. A proposed division of the cortex into areas is called a

46 cortical map. In the rodent, the lack of a single agreed-upon map can be seen by contrasting the

47 recent maps given by Swanson[21] on the one hand, and Paxinos and Franklin[16] on the other.

48 While the maps are certainly very similar in their general arrangement, significant differences re-

49 main.

50 The Allen Mouse Brain Atlas dataset

51 The Allen Mouse Brain Atlas (ABA) data[13] were produced by doing in-situ hybridization on

52 slices of male, 56-day-old C57BL/6J mouse brains. Pictures were taken of the processed slice,

53 and these pictures were semi-automatically analyzed to create a digital measurement of gene

54 expression levels at each location in each slice. Per slice, cellular spatial resolution is achieved.

55 Using this method, a single physical slice can only be used to measure one single gene; many

56 different mouse brains were needed in order to measure the expression of many genes.

57 Mus musculus is thought to contain about 22,000 protein-coding genes[26]. The ABA contains

58 data on about 20,000 genes in sagittal sections, out of which over 4,000 genes are also measured

59 in coronal sections. Our dataset is derived from only the coronal subset of the ABA2. An auto-

60 mated nonlinear alignment procedure located the 2D data from the various slices in a single 3D

61 coordinate system. In the final 3D coordinate system, voxels are cubes with 200 microns on a

62 side. There are 67x41x58 = 159,326 voxels, of which 51,533 are in the brain[15]. For each voxel

63 and each gene, the expression energy[13] within that voxel is made available.

64 The ABA is not the only large public spatial gene expression dataset[8][25][5][14][24][4][23][20][3].

65 However, with the exception of the ABA, GenePaint[25], and EMAGE[24], most of the other re-

66 sources have not (yet) extracted the expression intensity from the ISH images and registered the

67 results into a single 3-D space.

68 The remainder of the background section will be divided into three parts, one for each major

69 goal.

70 Goal 1, From Areas to Genes: Given a map of regions, find genes that mark those regions

71 Machine learning terminology: classifiers The task of looking for marker genes for known

72 anatomical regions means that one is looking for a set of genes such that, if the expression level

73 of those genes is known, then the locations of the regions can be inferred.

74 If we define the regions so that they cover the entire anatomical structure to be subdivided,

75 and restrict ourselves to looking at one voxel at a time, we may say that we are using gene

76 expression in each voxel to assign that voxel to the proper area. We call this a classification

77 task, because each voxel is being assigned to a class (namely, its region). An understanding

78 of the relationship between the combination of gene expression levels and the locations of the

79 regions may be expressed as a function. The input to this function is a voxel, along with the gene

80 expression levels within that voxel; the output is the regional identity of the target voxel, that is, the

81 ____________________________________

82 2The sagittal data do not cover the entire cortex, and also have greater registration error[15]. Genes were selected

83 by the Allen Institute for coronal sectioning based on, “classes of known neuroscientific interest... or through post hoc

84 identification of a marked non-ubiquitous expression pattern”[15].

85 2

87 region to which the target voxel belongs. We call this function a classifier. In general, the input to

88 a classifier is called an instance, and the output is called a label (or a class label).

89 Our goal is not to produce a single classifier, but rather to develop an automated method for

90 determining a classifier for any known anatomical structure. Therefore, we seek a procedure by

91 which a gene expression dataset may be analyzed in concert with an anatomical atlas in order to

92 produce a classifier. The initial gene expression dataset used in the construction of the classifier

93 is called training data. In the machine learning literature, this sort of procedure may be thought

94 of as a supervised learning task, defined as a task in which the goal is to learn a mapping from

95 instances to labels, and the training data consists of a set of instances (voxels) for which the labels

96 (regions) are known.

97 Each gene expression level is called a feature, and the selection of which genes3 to look at is

98 called feature selection. Feature selection is one component of the task of learning a classifier.

99 One class of feature selection methods assigns some sort of score to each candidate gene.

100 The top-ranked genes are then chosen. Some scoring measures can assign a score to a set of

101 selected genes, not just to a single gene; in this case, a dynamic procedure may be used in which

102 features are added and subtracted from the selected set depending on how much they raise the

103 score. Such procedures are called “stepwise” or “greedy”.

104 Although the classifier itself may only look at the gene expression data within each voxel be-

105 fore classifying that voxel, the algorithm which constructs the classifier may look over the entire

106 dataset. We can categorize score-based feature selection methods depending on how the score

107 of calculated. Often the score calculation consists of assigning a sub-score to each voxel, and

108 then aggregating these sub-scores into a final score. If only information from nearby voxels is

109 used to calculate a voxel’s sub-score, then we say it is a local scoring method. If only information

110 from the voxel itself is used to calculate a voxel’s sub-score, then we say it is a pointwise scoring

111 method.

112 Our Strategy for Goal 1

113 Key questions when choosing a learning method are: What are the instances? What are the

114 features? How are the features chosen? Here are four principles that outline our answers to these

115 questions.

116 Principle 1: Combinatorial gene expression

117 It is too much to hope that every anatomical region of interest will be identified by a single

118 gene. For example, in the cortex, there are some areas which are not clearly delineated by any

119 gene included in the ABA coronal dataset. However, at least some of these areas can be delin-

120 eated by looking at combinations of genes (an example of an area for which multiple genes are

121 necessary and sufficient is provided in Preliminary Results, Figure 4). Therefore, each instance

122 should contain multiple features (genes).

123 Principle 2: Only look at combinations of small numbers of genes

124 When the classifier classifies a voxel, it is only allowed to look at the expression of the genes

125 which have been selected as features. The more data that are available to a classifier, the better

126 that it can do. Why not include every gene as a feature? The reason is that we wish to employ the

127 classifier in situations in which it is not feasible to gather data about every gene. For example, if we

128 ____________________________________

129 3Strictly speaking, the features are gene expression levels, but we’ll call them genes.

130 3

131

132 want to use the expression of marker genes as a trigger for some regionally-targeted intervention,

133 then our intervention must contain a molecular mechanism to check the expression level of each

134 marker gene before it triggers. It is currently infeasible to design a molecular trigger that checks

135 the level of more than a handful of genes. Therefore, we must select only a few genes as features.

136 The requirement to find combinations of only a small number of genes limits us from straightfor-

137 wardly applying many of the most simple techniques from the field of supervised machine learning.

138 In the parlance of machine learning, our task combines feature selection with supervised learning.

139 Principle 3: Use geometry in feature selection

140 When doing feature selection with score-based methods, the simplest thing to do would be

141 to score the performance of each voxel by itself and then combine these scores (pointwise scor-

142 ing). A more powerful approach is to also use information about the geometric relations between

143 each voxel and its neighbors; this requires non-pointwise, local scoring methods. See Preliminary

144 Results, figure 3 for evidence of the complementary nature of pointwise and local scoring methods.

145 Principle 4: Work in 2-D whenever possible

146 There are many anatomical structures which are commonly characterized in terms of a two-

147 dimensional manifold. When it is known that the structure that one is looking for is two-dimensional,

148 the results may be improved by allowing the analysis algorithm to take advantage of this prior

149 knowledge. In addition, it is easier for humans to visualize and work with 2-D data.

150 Goal 2, From Genes to Areas: given gene expression data, discover a map of regions

151 Machine learning terminology: clustering

152 If one is given a dataset consisting merely of instances, with no class labels, then analysis of

153 the dataset is referred to as unsupervised learning in the jargon of machine learning. One thing

154 that you can do with such a dataset is to group instances together. A set of similar instances is

155 called a cluster, and the activity of grouping the data into clusters is called clustering or cluster

156 analysis.

157 The task of deciding how to carve up a structure into anatomical regions can be put into these

158 terms. The instances are once again voxels (or pixels) along with their associated gene expression

159 profiles. We make the assumption that voxels from the same anatomical region have similar gene

160 expression profiles, at least compared to the other regions. This means that clustering voxels is

161 the same as finding potential regions; we seek a partitioning of the voxels into regions, that is, into

162 clusters of voxels with similar gene expression.

163 It is desirable to determine not just one set of regions, but also how these regions relate to

164 each other. The outcome of clustering may be a hierarchical tree of clusters, rather than a single

165 set of clusters which partition the voxels. This is called hierarchical clustering.

166 Similarity scores A crucial choice when designing a clustering method is how to measure

167 similarity, across either pairs of instances, or clusters, or both. There is much overlap between

168 scoring methods for feature selection (discussed above under Goal 1) and scoring methods for

169 similarity.

170 Dimensionality reduction In this section, we discuss reducing the length of the per-pixel gene

171 expression feature vector. By “dimension”, we mean the dimension of this vector, not the spatial

172 4

173

174 dimension of the underlying data.

175

176

177 Figure 1: Top row: Genes Nfic

178 and A930001M12Rik are the most

179 correlated with area SS (somatosen-

180 sory cortex). Bottom row: Genes

181 C130038G02Rik and Cacna1i are

182 those with the best fit using logistic

183 regression. Within each picture, the

184 vertical axis roughly corresponds to

185 anterior at the top and posterior at the

186 bottom, and the horizontal axis roughly

187 corresponds to medial at the left and

188 lateral at the right. The red outline is

189 the boundary of region SS. Pixels are

190 colored according to correlation, with

191 red meaning high correlation and blue

192 meaning low. Unlike Goal 1, there is no externally-imposed need to

193 select only a handful of informative genes for inclusion

194 in the instances. However, some clustering algorithms

195 perform better on small numbers of features4. There are

196 techniques which “summarize” a larger number of fea-

197 tures using a smaller number of features; these tech-

198 niques go by the name of feature extraction or dimen-

199 sionality reduction. The small set of features that such a

200 technique yields is called the reduced feature set. Note

201 that the features in the reduced feature set do not neces-

202 sarily correspond to genes; each feature in the reduced

203 set may be any function of the set of gene expression

204 levels.

205 Clustering genes rather than voxels Although the

206 ultimate goal is to cluster the instances (voxels or pixels),

207 one strategy to achieve this goal is to first cluster the

208 features (genes). There are two ways that clusters of

209 genes could be used.

210 Gene clusters could be used as part of dimensionality

211 reduction: rather than have one feature for each gene,

212 we could have one reduced feature for each gene cluster.

213 Gene clusters could also be used to directly yield a

214 clustering on instances. This is because many genes

215 have an expression pattern which seems to pick out a

216 single, spatially contiguous region. This suggests the fol-

217 lowing procedure: cluster together genes which pick out

218 similar regions, and then to use the more popular com-

219 mon regions as the final clusters. In Preliminary Results,

220 Figure 7, we show that a number of anatomically recog-

221 nized cortical regions, as well as some “superregions” formed by lumping together a few regions,

222 are associated with gene clusters in this fashion.

223 Goal 3: interoperability with multi/hyperspectral imaging analysis software

224 A typical color image associates each pixel with a vector of three values. Multispectral and hyper-

225 spectral images, however, are images which associate each pixel with a vector containing many

226 values. The different positions in the vector correspond to different bands of electromagnetic

227 wavelengths5.

228 Some analysis techniques for hyperspectral imaging, especially preprocessing and calibration

229 techniques, make use of the information that the different values captured at each pixel represent

230 ____________________________________

231 4First, because the number of features in the reduced dataset is less than in the original dataset, the running time of

232 clustering algorithms may be much less. Second, it is thought that some clustering algorithms may give better results

233 on reduced data.

234 5In hyperspectral imaging, the bands are adjacent, and the number of different bands is larger. For conciseness, we

235 discuss only hyperspectral imaging, but our methods are also well suited to multispectral imaging with many bands.

236 5

237

238 adjacent wavelengths of light, which can be combined to make a spectrum. Other analysis tech-

239 niques ignore the interpretation of the values measured, and their relationship to each other within

240 the electromagnetic spectrum, instead treating them blindly as completely separate features.

241 With both hyperspectral imaging and spatial gene expression data, each location in space

242 is associated with more than three numerical feature values. The analysis of hyperspectral im-

243 ages can involve supervised classification and unsupervised learning. Often hyperspectral images

244 come from satellites looking at the Earth, and it is desirable to classify what sort of objects occupy

245 a given area of land. Sometimes detailed training data is not available, in which case it is desirable

246 at least to cluster together those regions of land which contain similar objects.

247 We believe that it may be possible for these two different field to share some common compu-

248 tational tools. To this end, we intend to make use of existing hyperspectral imaging software when

249 possible, and to develop new software in such a way so as to make it easy to use for the purpose

250 of hyperspectral image analysis, as well as for our primary purpose of spatial gene expression

251 data analysis.

252 Related work

253

254 Figure 2: Gene Pitx2

255 is selectively underex-

256 pressed in area SS. As noted above, the GIS community has developed tools for supervised

257 classification and unsupervised clustering in the context of the analysis

258 of hyperspectral imaging data. One tool is Spectral Python6. Spectral

259 Python implements various supervised and unsupervised classification

260 methods, as well as utility functions for loading, viewing, and saving

261 spatial data. Although Spectral Python has feature extraction methods

262 (such as principal components analysis) which create a small set of

263 new features computed based on the original features, it does not have

264 feature selection methods, that is, methods to select a small subset

265 out of the original features (although feature selection in hyperspectral

266 imaging has been investigated by others[19].

267 There is a substantial body of work on the analysis of gene expression data. Most of this con-

268 cerns gene expression data which are not fundamentally spatial7. Here we review only that work

269 which concerns the automated analysis of spatial gene expression data with respect to anatomy.

270 Relating to Goal 1, GeneAtlas[5] and EMAGE [24] allow the user to construct a search query by

271 demarcating regions and then specifying either the strength of expression or the name of another

272 gene or dataset whose expression pattern is to be matched. Neither GeneAtlas nor EMAGE allow

273 one to search for combinations of genes that define a region in concert.

274 Relating to Goal 2, EMAGE[24] allows the user to select a dataset from among a large number

275 of alternatives, or by running a search query, and then to cluster the genes within that dataset.

276 EMAGE clusters via hierarchical complete linkage clustering.

277 [15] describes AGEA, ”Anatomic Gene Expression Atlas”. AGEA has three components. Gene

278 Finder: The user selects a seed voxel and the system (1) chooses a cluster which includes the

279 seed voxel, (2) yields a list of genes which are overexpressed in that cluster. Correlation: The user

280 selects a seed voxel and the system then shows the user how much correlation there is between

281 the gene expression profile of the seed voxel and every other voxel. Clusters: AGEA includes a

282 ____________________________________

283 6http://spectralpython.sourceforge.net/

284 7By “fundamentally spatial” we mean that there is information from a large number of spatial locations indexed by

285 spatial coordinates; not just data which have only a few different locations or which is indexed by anatomical label.

286 6

287

288 preset hierarchical clustering of voxels based on a recursive bifurcation algorithm with correlation

289 as the similarity metric. AGEA has been applied to the cortex. The paper describes interesting

290 results on the structure of correlations between voxel gene expression profiles within a handful of

291 cortical areas. However, that analysis neither looks for genes marking cortical areas, nor does it

292 suggest a cortical map based on gene expression data. Neither of the other components of AGEA

293 can be applied to cortical areas; AGEA’s Gene Finder cannot be used to find marker genes for the

294 cortical areas; and AGEA’s hierarchical clustering does not produce clusters corresponding to the

295 cortical areas8.

296

297

298 Figure 3: The top row shows the two

299 genes which (individually) best predict

300 area AUD, according to logistic regres-

301 sion. The bottom row shows the two

302 genes which (individually) best match

303 area AUD, according to gradient sim-

304 ilarity. From left to right and top to

305 bottom, the genes are Ssr1, Efcbp1,

306 Ptk7, and Aph1a. [6] looks at the mean expression level of genes within

307 anatomical regions, and applies a Student’s t-test to de-

308 termine whether the mean expression level of a gene is

309 significantly higher in the target region. This relates to

310 our Goal 1. [6] also clusters genes, relating to our Goal

311 2. For each cluster, prototypical spatial expression pat-

312 terns were created by averaging the genes in the cluster.

313 The prototypes were analyzed manually, without cluster-

314 ing voxels.

315 These related works differ from our strategy for Goal

316 1 in at least three ways. First, they find only single genes,

317 whereas we will also look for combinations of genes.

318 Second, they usually can only use overexpression as

319 a marker, whereas we will also search for underexpres-

320 sion. Third, they use scores based on pointwise expres-

321 sion levels, whereas we will also use geometric scores

322 such as gradient similarity (described in Preliminary Re-

323 sults). Figures 4, 2, and 3 in the Preliminary Results

324 section contain evidence that each of our three choices

325 is the right one.

326 [10] describes a technique to find combinations of

327 marker genes to pick out an anatomical region. They

328 use an evolutionary algorithm to evolve logical operators which combine boolean (thresholded)

329 images in order to match a target image. They apply their technique for finding combinations of

330 marker genes for the purpose of clustering genes around a “seed gene”.

331 Relating to our Goal 2, some researchers have attempted to parcellate cortex on the basis of

332 non-gene expression data. For example, [17], [2], [18], and [1] associate spots on the cortex with

333 the radial profile9 of response to some stain ([12] uses MRI), extract features from this profile, and

334 then use similarity between surface pixels to cluster.

335 [22] describes an analysis of the anatomy of the hippocampus using the ABA dataset. In

336 addition to manual analysis, two clustering methods were employed, a modified Non-negative

337 Matrix Factorization (NNMF), and a hierarchical bifurcation clustering scheme using correlation as

338 ____________________________________

339 8In both cases, the cause is that pairwise correlations between the gene expression of voxels in different areas but

340 the same layer are often stronger than pairwise correlations between the gene expression of voxels in different layers

341 but the same area. Therefore, a pairwise voxel correlation clustering algorithm will tend to create clusters representing

342 cortical layers, not areas.

343 9A radial profile is a profile along a line perpendicular to the cortical surface.

344 7

345

346 similarity. The paper yielded impressive results, proving the usefulness of computational genomic

347 anatomy. We have run NNMF on the cortical dataset, and while the results are promising, other

348 methods may perform as well or better (see Preliminary Results, Figure 6).

349 Comparing previous work with our Goal 1, there has been fruitful work on finding marker genes,

350 but only one of the projects explored combinations of marker genes, and none of them compared

351 the results obtained by using different algorithms or scoring methods. Comparing previous work

352 with Goal 2, although some projects obtained clusterings, there has not been much comparison

353 between different algorithms or scoring methods, so it is likely that the best clustering method for

354 this application has not yet been found. Also, none of these projects did a separate dimensionality

355 reduction step before clustering pixels, or tried to cluster genes first in order to guide automated

356 clustering of pixels into spatial regions, or used co-clustering algorithms.

357 In summary, (a) only one of the previous projects explores combinations of marker genes, (b)

358 there has been almost no comparison of different algorithms or scoring methods, and (c) there

359 has been no work on computationally finding marker genes applied to cortical areas, or on finding

360 a hierarchical clustering that will yield a map of cortical areas de novo from gene expression data.

361 Our project is guided by a concrete application with a well-specified criterion of success (how

362 well we can find marker genes for / reproduce the layout of cortical areas), which will provide a

363 solid basis for comparing different methods.

364 _________________________________________________

365 Data sharing plan

366

367

368 Figure 4: Upper left: wwc1. Upper

369 right: mtif2. Lower left: wwc1 + mtif2

370 (each pixel’s value on the lower left is

371 the sum of the corresponding pixels in

372 the upper row). We are enthusiastic about the sharing of methods and

373 data, and at the conclusion of the project, we will make

374 all of our data and computer source code publically avail-

375 able, either in supplemental attachments to publications,

376 or on a website. The source code will be released under

377 the GNU Public License. We intend to include a soft-

378 ware program which, when run, will take as input the

379 Allen Brain Atlas raw data, and produce as output all

380 numbers and charts found in publications resulting from

381 the project. Source code to be released will include ex-

382 tensions to Caret[7], an existing open-source scientific

383 imaging program, and to Spectral Python. Data to be

384 released will include the 2-D “flat map” dataset. This

385 dataset will be submitted to a machine learning dataset

386 repository.

387 Broader impacts

388 In addition to validating the usefulness of the algorithms,

389 the application of these methods to cortex will produce

390 immediate benefits, because there are currently no known genetic markers for most cortical areas.

391 The method developed in Goal 1 will be applied to each cortical area to find a set of marker

392 genes such that the combinatorial expression pattern of those genes uniquely picks out the target

393 area. Finding marker genes will be useful for drug discovery as well as for experimentation be-

394 cause marker genes can be used to design interventions which selectively target individual cortical

395 areas.

396 8

397

398 The application of the marker gene finding algorithm to the cortex will also support the develop-

399 ment of new neuroanatomical methods. In addition to finding markers for each individual cortical

400 areas, we will find a small panel of genes that can find many of the areal boundaries at once.

401 The method developed in Goal 2 will provide a genoarchitectonic viewpoint that will contribute

402 to the creation of a better cortical map.

403 The methods we will develop will be applicable to other datasets beyond the brain, and even to

404 datasets outside of biology. The software we develop will be useful for the analysis of hyperspectral

405 images. Our project will draw attention to this area of overlap between neuroscience and GIS, and

406 may lead to future collaborations between these two fields. The cortical dataset that we produce

407 will be useful in the machine learning community as a sample dataset that new algorithms can be

408 tested against. The availability of this sample dataset to the machine learning community may lead

409 to more interest in the design of machine learning algorithms to analyze spatial gene expression.

410 _

411 Preliminary Results

412 Format conversion between SEV, MATLAB, NIFTI

413 We have created software to (politely) download all of the SEV files10 from the Allen Institute

414 website. We have also created software to convert between the SEV, MATLAB, and NIFTI file

415 formats, as well as some of Caret’s file formats.

416 Flatmap of cortex

417 We downloaded the ABA data and selected only those voxels which belong to cerebral cortex.

418 We divided the cortex into hemispheres. Using Caret[7], we created a mesh representation of the

419 surface of the selected voxels. For each gene, and for each node of the mesh, we calculated an

420 average of the gene expression of the voxels “underneath” that mesh node. We then flattened

421 the cortex, creating a two-dimensional mesh. We converted this grid into a MATLAB matrix. We

422 manually traced the boundaries of each of 46 cortical areas from the ABA coronal reference atlas

423 slides, and converted this region data into MATLAB format.

424 At this point, the data are in the form of a number of 2-D matrices, all in registration, with the

425 matrix entries representing a grid of points (pixels) over the cortical surface. There is one 2-D

426 matrix whose entries represent the regional label associated with each surface pixel. And for each

427 gene, there is a 2-D matrix whose entries represent the average expression level underneath each

428 surface pixel. The features and the target area are both functions on the surface pixels. They can

429 be referred to as scalar fields over the space of surface pixels; alternately, they can be thought of

430 as images which can be displayed on the flatmapped surface.

431 Feature selection and scoring methods

432 Underexpression of a gene can serve as a marker Underexpression of a gene can sometimes

433 serve as a marker. For example, see Figure 2.

434 Correlation Recall that the instances are surface pixels, and consider the problem of attempt-

435 ing to classify each instance as either a member of a particular anatomical area, or not. The target

436 area can be represented as a boolean mask over the surface pixels.

437 10SEV is a sparse format for spatial data. It is the format in which the ABA data is made available.

438 9

439

440 We calculated the correlation between each gene and each cortical area. The top row of Figure

441 1 shows the three genes most correlated with area SS.

442 Conditional entropy

443 For each region, we created and ran a forward stepwise procedure which attempted to find

444 pairs of genes such that the conditional entropy of the target area’s boolean mask, conditioned

445 upon the gene pair’s thresholded expression levels, is minimized.

446 This finds pairs of genes which are most informative (at least at these threshold levels) relative

447 to the question, “Is this surface pixel a member of the target area?”. The advantage over linear

448 methods such as logistic regression is that this takes account of arbitrarily nonlinear relationships;

449 for example, if the XOR of two variables predicts the target, conditional entropy would notice,

450 whereas linear methods would not.

451 Gradient similarity We noticed that the previous two scoring methods, which are pointwise,

452 often found genes whose pattern of expression did not look similar in shape to the target region.

453 For this reason we designed a non-pointwise scoring method to detect when a gene had a pattern

454 of expression which looked like it had a boundary whose shape is similar to the shape of the target

455 region. We call this scoring method “gradient similarity”. The formula is:

456 ∑

457 pixel<img src="cmsy8-32.png" alt="∈" />pixels cos(∠∇1 -∠∇2) ⋅|∇1| + |∇2|

458 2 ⋅ pixel_value1 + pixel_value2

459 2

460 where ∇1 and ∇2 are the gradient vectors of the two images at the current pixel; ∠∇i is the

461 angle of the gradient of image i at the current pixel; |∇i| is the magnitude of the gradient of image

462 i at the current pixel; and pixel_valuei is the value of the current pixel in image i.

463 The intuition is that we want to see if the borders of the pattern in the two images are similar; if

464 the borders are similar, then both images will have corresponding pixels with large gradients (be-

465 cause this is a border) which are oriented in a similar direction (because the borders are similar).

466 Gradient similarity provides information complementary to correlation

467 To show that gradient similarity can provide useful information that cannot be detected via

468 pointwise analyses, consider Fig. 3. The pointwise method in the top row identifies genes which

469 express more strongly in AUD than outside of it; its weakness is that this includes many areas

470 which don’t have a salient border matching the areal border. The geometric method identifies

471 genes whose salient expression border seems to partially line up with the border of AUD; its

472 weakness is that this includes genes which don’t express over the entire area.

473 Areas which can be identified by single genes Using gradient similarity, we have already

474 found single genes which roughly identify some areas and groupings of areas. For each of these

475 areas, an example of a gene which roughly identifies it is shown in Figure 5. We have not yet

476 cross-verified these genes in other atlases.

477 In addition, there are a number of areas which are almost identified by single genes: COAa+NLOT

478 (anterior part of cortical amygdalar area, nucleus of the lateral olfactory tract), ENT (entorhinal),

479 ACAv (ventral anterior cingulate), VIS (visual), AUD (auditory).

480 These results validate our expectation that the ABA dataset can be exploited to find marker

481 genes for many cortical areas, while also validating the relevancy of our new scoring method,

482 gradient similarity.

483 10

484

485

486

487

488

489 Figure 5: From left to right and top

490 to bottom, single genes which roughly

491 identify areas SS (somatosensory pri-

492 mary + supplemental), SSs (supple-

493 mental somatosensory), PIR (piriform),

494 FRP (frontal pole), RSP (retrosplenial),

495 COApm (Cortical amygdalar, poste-

496 rior part, medial zone). Grouping

497 some areas together, we have also

498 found genes to identify the groups

499 ACA+PL+ILA+DP+ORB+MO (anterior

500 cingulate, prelimbic, infralimbic, dor-

501 sal peduncular, orbital, motor), poste-

502 rior and lateral visual (VISpm, VISpl,

503 VISI, VISp; posteromedial, posterolat-

504 eral, lateral, and primary visual; the

505 posterior and lateral visual area is dis-

506 tinguished from its neighbors, but not

507 from the entire rest of the cortex). The

508 genes are Pitx2, Aldh1a2, Ppfibp1,

509 Slco1a5, Tshz2, Trhr, Col12a1, Ets1. Combinations of multiple genes are useful and

510 necessary for some areas

511 In Figure 4, we give an example of a cortical area

512 which is not marked by any single gene, but which can be

513 identified combinatorially. According to logistic regres-

514 sion, gene wwc1 is the best fit single gene for predicting

515 whether or not a pixel on the cortical surface belongs to

516 the motor area (area MO). The upper-left picture in Fig-

517 ure 4 shows wwc1’s spatial expression pattern over the

518 cortex. The lower-right boundary of MO is represented

519 reasonably well by this gene, but the gene overshoots

520 the upper-left boundary. This flattened 2-D representa-

521 tion does not show it, but the area corresponding to the

522 overshoot is the medial surface of the cortex. MO is only

523 found on the dorsal surface. Gene mtif2 is shown in the

524 upper-right. Mtif2 captures MO’s upper-left boundary, but

525 not its lower-right boundary. Mtif2 does not express very

526 much on the medial surface. By adding together the val-

527 ues at each pixel in these two figures, we get the lower-

528 left image. This combination captures area MO much

529 better than any single gene.

530 This shows that our proposal to develop a method to

531 find combinations of marker genes is both possible and

532 necessary.

533 Multivariate supervised learning

534 Forward stepwise logistic regression Logistic regres-

535 sion is a popular method for predictive modeling of cat-

536 egorical data. As a pilot run, for five cortical areas (SS,

537 AUD, RSP, VIS, and MO), we performed forward step-

538 wise logistic regression to find single genes, pairs of

539 genes, and triplets of genes which predict areal identify.

540 This is an example of feature selection integrated with

541 prediction using a stepwise wrapper. Some of the sin-

542 gle genes found were shown in various figures through-

543 out this document, and Figure 4 shows a combination of

544 genes which was found.

545 SVM on all genes at once

546 In order to see how well one can do when looking at

547 all genes at once, we ran a support vector machine to

548 classify cortical surface pixels based on their gene ex-

549 pression profiles. We achieved classification accuracy of

550 about 81%11. However, as noted above, a classifier that

551 ____________________________________

552 115-fold cross-validation.

553 11

554

555 looks at all the genes at once isn’t as practically useful

556 as a classifier that uses only a few genes.

557 Data-driven redrawing of the cortical map

558 We have applied the following dimensionality reduction algorithms to reduce the dimensionality

559 of the gene expression profile associated with each pixel: Principal Components Analysis (PCA),

560 Simple PCA, Multi-Dimensional Scaling, Isomap, Landmark Isomap, Laplacian eigenmaps, Local

561 Tangent Space Alignment, Stochastic Proximity Embedding, Fast Maximum Variance Unfolding,

562 Non-negative Matrix Factorization (NNMF). Space constraints prevent us from showing many of

563 the results, but as a sample, PCA, NNMF, and landmark Isomap are shown in the first, second,

564 and third rows of Figure 6.

565 After applying the dimensionality reduction, we ran clustering algorithms on the reduced data.

566 To date we have tried k-means and spectral clustering. The results of k-means after PCA, NNMF,

567 and landmark Isomap are shown in the bottom row of Figure 6. To compare, the leftmost picture

568 on the bottom row of Figure 6 shows some of the major subdivisions of cortex. These results show

569 that different dimensionality reduction techniques capture different aspects of the data and lead

570 to different clusterings, indicating the utility of our proposal to produce a detailed comparison of

571 these techniques as applied to the domain of genomic anatomy.

572 Many areas are captured by clusters of genes We also clustered the genes using gradient

573 similarity to see if the spatial regions defined by any clusters matched known anatomical regions.

574 Figure 7 shows, for ten sample gene clusters, each cluster’s average expression pattern, com-

575 pared to a known anatomical boundary. This suggests that it is worth attempting to cluster genes,

576 and then to use the results to cluster pixels.

577 Our plan: what remains to be done

578 Flatmap cortex and segment cortical layers

579 There are multiple ways to flatten 3-D data into 2-D. We will compare mappings from manifolds to

580 planes which attempt to preserve size (such as the one used by Caret[7]) with mappings which

581 preserve angle (conformal maps). We will also develop a segmentation algorithm to automatically

582 identify the layer boundaries.

583 Develop algorithms that find genetic markers for anatomical regions

584 Scoring measures and feature selection We will develop scoring methods for evaluating how

585 good individual genes are at marking areas. We will compare pointwise, geometric, and information-

586 theoretic measures. We already developed one entirely new scoring method (gradient similarity),

587 but we may develop more. Scoring measures that we will explore will include the L1 norm, cor-

588 relation, expression energy ratio, conditional entropy, gradient similarity, Jaccard similarity, Dice

589 similarity, Hough transform, and statistical tests such as Student’s t-test, and the Mann-Whitney

590 U test (a non-parametric test). In addition, any classifier induces a scoring measure on genes by

591 taking the prediction error when using that gene to predict the target.

592 Using some combination of these measures, we will develop a procedure to find single marker

593 genes for anatomical regions: for each cortical area, we will rank the genes by their ability to

594 delineate that area. We will quantitatively compare the list of single genes generated by our

595 method to the lists generated by methods which are mentioned in Related Work.

596 12

597

598

599 Figure 6: First row: the first 6 reduced dimensions, using PCA. Sec-

600 ond row: the first 6 reduced dimensions, using NNMF. Third row: the

601 first six reduced dimensions, using landmark Isomap. Bottom row:

602 examples of kmeans clustering applied to reduced datasets to find

603 7 clusters. Left: 19 of the major subdivisions of the cortex. Sec-

604 ond from left: PCA. Third from left: NNMF. Right: Landmark Isomap.

605 Additional details: In the third and fourth rows, 7 dimensions were

606 found, but only 6 displayed. In the last row: for PCA, 50 dimensions

607 were used; for NNMF, 6 dimensions were used; for landmark Isomap,

608 7 dimensions were used. Some cortical areas have

609 no single marker genes but

610 can be identified by com-

611 binatorial coding. This re-

612 quires multivariate scoring

613 measures and feature se-

614 lection procedures. Many

615 of the measures, such

616 as expression energy, gra-

617 dient similarity, Jaccard,

618 Dice, Hough, Student’s t,

619 and Mann-Whitney U are

620 univariate. We will ex-

621 tend these scoring mea-

622 sures for use in multivariate

623 feature selection, that is,

624 for scoring how well com-

625 binations of genes, rather

626 than individual genes, can

627 distinguish a target area.

628 There are existing mul-

629 tivariate forms of some

630 of the univariate scoring

631 measures, for example,

632 Hotelling’s T-square is a

633 multivariate analog of Stu-

634 dent’s t.

635 We will develop a fea-

636 ture selection procedure for choosing the best small set of marker genes for a given anatomical

637 area. In addition to using the scoring measures that we develop, we will also explore (a) feature

638 selection using a stepwise wrapper over “vanilla” classifiers such as logistic regression, (b) super-

639 vised learning methods such as decision trees which incrementally/greedily combine single gene

640 markers into sets, and (c) supervised learning methods which use soft constraints to minimize

641 number of features used, such as sparse support vector machines (SVMs).

642 Since errors of displacement and of shape may cause genes and target areas to match less

643 than they should, we will consider the robustness of feature selection methods in the presence of

644 error. Some of these methods, such as the Hough transform, are designed to be resistant in the

645 presence of error, but many are not.

646 An area may be difficult to identify because the boundaries are misdrawn in the atlas, or be-

647 cause the shape of the natural domain of gene expression corresponding to the area is different

648 from the shape of the area as recognized by anatomists. We will develop extensions to our pro-

649 cedure which (a) detect when a difficult area could be fit if its boundary were redrawn slightly12,

650 ____________________________________

651 12Not just any redrawing is acceptable, only those which appear to be justified as a natural spatial domain of gene ex-

652 pression by multiple sources of evidence. Interestingly, the need to detect “natural spatial domains of gene expression”

653 in a data-driven fashion means that the methods of Goal 2 might be useful in achieving Goal 1, as well – particularly

654 13

655

656 and (b) detect when a difficult area could be combined with adjacent areas to create a larger area

657 which can be fit.

658 A future publication on the method that we develop in Goal 1 will review the scoring measures

659 and quantitatively compare their performance in order to provide a foundation for future research

660 of methods of marker gene finding. We will measure the robustness of the scoring measures as

661 well as their absolute performance on our dataset.

662 Develop algorithms to suggest a division of a structure into anatomical parts

663

664 Figure 7: Prototypes corresponding to sample gene clus-

665 ters, clustered by gradient similarity. Region boundaries for

666 the region that most matches each prototype are overlaid. Dimensionality reduction on gene

667 expression profiles We have al-

668 ready described the application of

669 ten dimensionality reduction algo-

670 rithms for the purpose of replacing

671 the gene expression profiles, which

672 are vectors of about 4000 gene ex-

673 pression levels, with a smaller num-

674 ber of features. We plan to further ex-

675 plore and interpret these results, as

676 well as to apply other unsupervised

677 learning algorithms, including inde-

678 pendent components analysis, self-

679 organizing maps, and generative models such as deep Boltzmann machines. We will explore

680 ways to quantitatively compare the relevance of the different dimensionality reduction methods for

681 identifying cortical areal boundaries.

682 Dimensionality reduction on pixels Instead of applying dimensionality reduction to the gene

683 expression profiles, the same techniques can be applied instead to the pixels. It is possible that

684 the features generated in this way by some dimensionality reduction techniques will directly corre-

685 spond to interesting spatial regions.

686 Clustering and segmentation on pixels We will explore clustering and image segmentation

687 algorithms in order to segment the pixels into regions. We will explore k-means, spectral cluster-

688 ing, gene shaving[9], recursive division clustering, multivariate generalizations of edge detectors,

689 multivariate generalizations of watershed transformations, region growing, active contours, graph

690 partitioning methods, and recursive agglomerative clustering with various linkage functions. These

691 methods can be combined with dimensionality reduction.

692 Clustering on genes We have already shown that the procedure of clustering genes according

693 to gradient similarity, and then creating an averaged prototype of each cluster’s expression pattern,

694 yields some spatial patterns which match cortical areas (Figure 7). We will further explore the

695 clustering of genes.

696 In addition to using the cluster expression prototypes directly to identify spatial regions, this

697 might be useful as a component of dimensionality reduction. For example, one could imagine

698 clustering similar genes and then replacing their expression levels with a single average expression

699 ____________________________________

700 discriminative dimensionality reduction.

701 14

702

703 level, thereby removing some redundancy from the gene expression profiles. One could then

704 perform clustering on pixels (possibly after a second dimensionality reduction step) in order to

705 identify spatial regions. It remains to be seen whether removal of redundancy would help or hurt

706 the ultimate goal of identifying interesting spatial regions.

707 Co-clustering We will explore some algorithms which simultaneously incorporate clustering

708 on instances and on features (in our case, pixels and genes), for example, IRM[11]. These are

709 called co-clustering or biclustering algorithms.

710 Compare different methods In order to tell which method is best for genomic anatomy, for

711 each experimental method we will compare the cortical map found by unsupervised learning to a

712 cortical map derived from the Allen Reference Atlas. We will explore various quantitative metrics

713 that purport to measure how similar two clusterings are, such as Jaccard, Rand index, Fowlkes-

714 Mallows, variation of information, Larsen, Van Dongen, and others.

715 Discriminative dimensionality reduction In addition to using a purely data-driven approach

716 to identify spatial regions, it might be useful to see how well the known regions can be recon-

717 structed from a small number of features, even if those features are chosen by using knowledge of

718 the regions. For example, linear discriminant analysis could be used as a dimensionality reduction

719 technique in order to identify a few features which are the best linear summary of gene expression

720 profiles for the purpose of discriminating between regions. This reduced feature set could then be

721 used to cluster pixels into regions. Perhaps the resulting clusters will be similar to the reference

722 atlas, yet more faithful to natural spatial domains of gene expression than the reference atlas is.

723 Apply the new methods to the cortex

724 Using the methods developed in Goal 1, we will present, for each cortical area, a short list of

725 markers to identify that area; and we will also present lists of “panels” of genes that can be used

726 to delineate many areas at once.

727 Because in most cases the ABA coronal dataset only contains one ISH per gene, it is possible

728 for an unrelated combination of genes to seem to identify an area when in fact it is only coinci-

729 dence. There are three ways we will validate our marker genes to guard against this. First, we

730 will confirm that putative combinations of marker genes express the same pattern in both hemi-

731 spheres. Second, we will manually validate our final results on other gene expression datasets

732 such as EMAGE, GeneAtlas, and GENSAT[8]. Third, we may conduct ISH experiments jointly with

733 collaborators to get further data on genes of particular interest.

734 Using the methods developed in Goal 2, we will present one or more hierarchical cortical

735 maps. We will identify and explain how the statistical structure in the gene expression data led to

736 any unexpected or interesting features of these maps, and we will provide biological hypotheses

737 to interpret any new cortical areas, or groupings of areas, which are discovered.

738 Apply the new methods to hyperspectral datasets

739 Our software will be able to read and write file formats common in the hyperspectral imaging

740 community such as Erdas LAN and ENVI, and it will be able to convert between the SEV and NIFTI

741 formats from neuroscience and the ENVI format from GIS. The methods developed in Goals 1 and

742 2 will be implemented either as part of Spectral Python or as a separate tool that interoperates

743 with Spectral Python. The methods will be run on hyperspectral satellite image datasets, and their

744 performance will be compared to existing hyperspectral analysis techniques.

745 15

746

747 References Cited

748 [1] Chris Adamson, Leigh Johnston, Terrie Inder, Sandra Rees, Iven Mareels, and Gary Egan.

749 A Tracking Approach to Parcellation of the Cerebral Cortex, volume 3749/2005 of Lecture

750 Notes in Computer Science, pages 294–301. Springer Berlin / Heidelberg, 2005.

751 [2] J. Annese, A. Pitiot, I. D. Dinov, and A. W. Toga. A myelo-architectonic method for the struc-

752 tural classification of cortical areas. NeuroImage, 21(1):15–26, 2004.

753 [3] Tanya Barrett, Dennis B. Troup, Stephen E. Wilhite, Pierre Ledoux, Dmitry Rudnev, Carlos

754 Evangelista, Irene F. Kim, Alexandra Soboleva, Maxim Tomashevsky, and Ron Edgar. NCBI

755 GEO: mining tens of millions of expression profiles–database and tools update. Nucl. Acids

756 Res., 35(suppl_1):D760–765, 2007.

757 [4] George W. Bell, Tatiana A. Yatskievych, and Parker B. Antin. GEISHA, a whole-mount in

758 situ hybridization gene expression screen in chicken embryos. Developmental Dynamics,

759 229(3):677–687, 2004.

760 [5] James P Carson, Tao Ju, Hui-Chen Lu, Christina Thaller, Mei Xu, Sarah L Pallas, Michael C

761 Crair, Joe Warren, Wah Chiu, and Gregor Eichele. A digital atlas to characterize the mouse

762 brain transcriptome. PLoS Comput Biol, 1(4):e41, 2005.

763 [6] Mark H. Chin, Alex B. Geng, Arshad H. Khan, Wei-Jun Qian, Vladislav A. Petyuk, Jyl Boline,

764 Shawn Levy, Arthur W. Toga, Richard D. Smith, Richard M. Leahy, and Desmond J. Smith.

765 A genome-scale map of expression for a mouse brain section obtained using voxelation.

766 Physiol. Genomics, 30(3):313–321, August 2007.

767 [7] D C Van Essen, H A Drury, J Dickson, J Harwell, D Hanlon, and C H Anderson. An integrated

768 software suite for surface-based analyses of cerebral cortex. Journal of the American Medical

769 Informatics Association: JAMIA, 8(5):443–59, 2001. PMID: 11522765.

770 [8] Shiaoching Gong, Chen Zheng, Martin L. Doughty, Kasia Losos, Nicholas Didkovsky, Uta B.

771 Schambra, Norma J. Nowak, Alexandra Joyner, Gabrielle Leblanc, Mary E. Hatten, and

772 Nathaniel Heintz. A gene expression atlas of the central nervous system based on bacte-

773 rial artificial chromosomes. Nature, 425(6961):917–925, October 2003.

774 [9] Trevor Hastie, Robert Tibshirani, Michael Eisen, Ash Alizadeh, Ronald Levy, Louis Staudt,

775 Wing Chan, David Botstein, and Patrick Brown. ’Gene shaving’ as a method for identifying dis-

776 tinct sets of genes with similar expression patterns. Genome Biology, 1(2):research0003.1–

777 research0003.21, 2000.

778 [10] Jano Hemert and Richard Baldock. Matching Spatial Regions with Combinations of Interact-

779 ing Gene Expression Patterns, volume 13 of Communications in Computer and Information

780 Science, pages 347–361. Springer Berlin Heidelberg, 2008.

781 [11] C Kemp, JB Tenenbaum, TL Griffiths, T Yamada, and N Ueda. Learning systems of concepts

782 with an infinite relational model. In AAAI, 2006.

783 [12] F. Kruggel, M. K. Brckner, Th. Arendt, C. J. Wiggins, and D. Y. von Cramon. Analyzing the

784 neocortical fine-structure. Medical Image Analysis, 7(3):251–264, September 2003.

785 16

786

787 [13] Ed S. Lein, Michael J. Hawrylycz, Nancy Ao, Mikael Ayres, Amy Bensinger, Amy Bernard,

788 Andrew F. Boe, Mark S. Boguski, Kevin S. Brockway, Emi J. Byrnes, Lin Chen, Li Chen,

789 Tsuey-Ming Chen, Mei Chi Chin, Jimmy Chong, Brian E. Crook, Aneta Czaplinska, Chinh N.

790 Dang, Suvro Datta, Nick R. Dee, Aimee L. Desaki, Tsega Desta, Ellen Diep, Tim A. Dolbeare,

791 Matthew J. Donelan, Hong-Wei Dong, Jennifer G. Dougherty, Ben J. Duncan, Amanda J.

792 Ebbert, Gregor Eichele, Lili K. Estin, Casey Faber, Benjamin A. Facer, Rick Fields, Shanna R.

793 Fischer, Tim P. Fliss, Cliff Frensley, Sabrina N. Gates, Katie J. Glattfelder, Kevin R. Halverson,

794 Matthew R. Hart, John G. Hohmann, Maureen P. Howell, Darren P. Jeung, Rebecca A. John-

795 son, Patrick T. Karr, Reena Kawal, Jolene M. Kidney, Rachel H. Knapik, Chihchau L. Kuan,

796 James H. Lake, Annabel R. Laramee, Kirk D. Larsen, Christopher Lau, Tracy A. Lemon,

797 Agnes J. Liang, Ying Liu, Lon T. Luong, Jesse Michaels, Judith J. Morgan, Rebecca J. Mor-

798 gan, Marty T. Mortrud, Nerick F. Mosqueda, Lydia L. Ng, Randy Ng, Geralyn J. Orta, Car-

799 oline C. Overly, Tu H. Pak, Sheana E. Parry, Sayan D. Pathak, Owen C. Pearson, Ralph B.

800 Puchalski, Zackery L. Riley, Hannah R. Rockett, Stephen A. Rowland, Joshua J. Royall,

801 Marcos J. Ruiz, Nadia R. Sarno, Katherine Schaffnit, Nadiya V. Shapovalova, Taz Sivisay,

802 Clifford R. Slaughterbeck, Simon C. Smith, Kimberly A. Smith, Bryan I. Smith, Andy J. Sodt,

803 Nick N. Stewart, Kenda-Ruth Stumpf, Susan M. Sunkin, Madhavi Sutram, Angelene Tam,

804 Carey D. Teemer, Christina Thaller, Carol L. Thompson, Lee R. Varnam, Axel Visel, Ray M.

805 Whitlock, Paul E. Wohnoutka, Crissa K. Wolkey, Victoria Y. Wong, Matthew Wood, Murat B.

806 Yaylaoglu, Rob C. Young, Brian L. Youngstrom, Xu Feng Yuan, Bin Zhang, Theresa A. Zwing-

807 man, and Allan R. Jones. Genome-wide atlas of gene expression in the adult mouse brain.

808 Nature, 445(7124):168–176, 2007.

809 [14] Susan Magdaleno, Patricia Jensen, Craig L. Brumwell, Anna Seal, Karen Lehman, Andrew

810 Asbury, Tony Cheung, Tommie Cornelius, Diana M. Batten, Christopher Eden, Shannon M.

811 Norland, Dennis S. Rice, Nilesh Dosooye, Sundeep Shakya, Perdeep Mehta, and Tom Cur-

812 ran. BGEM: an in situ hybridization database of gene expression in the embryonic and adult

813 mouse nervous system. PLoS Biology, 4(4):e86 EP –, April 2006.

814 [15] Lydia Ng, Amy Bernard, Chris Lau, Caroline C Overly, Hong-Wei Dong, Chihchau Kuan,

815 Sayan Pathak, Susan M Sunkin, Chinh Dang, Jason W Bohland, Hemant Bokil, Partha P

816 Mitra, Luis Puelles, John Hohmann, David J Anderson, Ed S Lein, Allan R Jones, and Michael

817 Hawrylycz. An anatomic gene expression atlas of the adult mouse brain. Nat Neurosci,

818 12(3):356–362, March 2009.

819 [16] George Paxinos and Keith B.J. Franklin. The Mouse Brain in Stereotaxic Coordinates. Aca-

820 demic Press, 2 edition, July 2001.

821 [17] A. Schleicher, N. Palomero-Gallagher, P. Morosan, S. Eickhoff, T. Kowalski, K. Vos,

822 K. Amunts, and K. Zilles. Quantitative architectural analysis: a new approach to cortical

823 mapping. Anatomy and Embryology, 210(5):373–386, December 2005.

824 [18] Oliver Schmitt, Lars Hmke, and Lutz Dmbgen. Detection of cortical transition regions utilizing

825 statistical analyses of excess masses. NeuroImage, 19(1):42–63, May 2003.

826 [19] S.B. Serpico and L. Bruzzone. A new search algorithm for feature selection in hyperspec-

827 tral remote sensing images. Geoscience and Remote Sensing, IEEE Transactions on,

828 39(7):1360–1367, 2001.

829 17

830

831 [20] Constance M. Smith, Jacqueline H. Finger, Terry F. Hayamizu, Ingeborg J. McCright, Janan T.

832 Eppig, James A. Kadin, Joel E. Richardson, and Martin Ringwald. The mouse gene expres-

833 sion database (GXD): 2007 update. Nucl. Acids Res., 35(suppl_1):D618–623, 2007.

834 [21] Larry Swanson. Brain Maps: Structure of the Rat Brain. Academic Press, 3 edition, November

835 2003.

836 [22] Carol L. Thompson, Sayan D. Pathak, Andreas Jeromin, Lydia L. Ng, Cameron R. MacPher-

837 son, Marty T. Mortrud, Allison Cusick, Zackery L. Riley, Susan M. Sunkin, Amy Bernard,

838 Ralph B. Puchalski, Fred H. Gage, Allan R. Jones, Vladimir B. Bajic, Michael J. Hawrylycz,

839 and Ed S. Lein. Genomic anatomy of the hippocampus. Neuron, 60(6):1010–1021, Decem-

840 ber 2008.

841 [23] Pavel Tomancak, Amy Beaton, Richard Weiszmann, Elaine Kwan, ShengQiang Shu,

842 Suzanna E Lewis, Stephen Richards, Michael Ashburner, Volker Hartenstein, Susan E Cel-

843 niker, and Gerald M Rubin. Systematic determination of patterns of gene expression during

844 drosophila embryogenesis. Genome Biology, 3(12):research008818814, 2002. PMC151190.

845 [24] Shanmugasundaram Venkataraman, Peter Stevenson, Yiya Yang, Lorna Richardson,

846 Nicholas Burton, Thomas P. Perry, Paul Smith, Richard A. Baldock, Duncan R. Davidson,

847 and Jeffrey H. Christiansen. EMAGE edinburgh mouse atlas of gene expression: 2008 up-

848 date. Nucl. Acids Res., 36(suppl_1):D860–865, 2008.

849 [25] Axel Visel, Christina Thaller, and Gregor Eichele. GenePaint.org: an atlas of gene expression

850 patterns in the mouse embryo. Nucl. Acids Res., 32(suppl_1):D552–556, 2004.

851 [26] Robert H Waterston, Kerstin Lindblad-Toh, Ewan Birney, Jane Rogers, Josep F Abril, Pankaj

852 Agarwal, Richa Agarwala, Rachel Ainscough, Marina Alexandersson, Peter An, Stylianos E

853 Antonarakis, John Attwood, Robert Baertsch, Jonathon Bailey, Karen Barlow, Stephan Beck,

854 Eric Berry, Bruce Birren, Toby Bloom, Peer Bork, Marc Botcherby, Nicolas Bray, Michael R

855 Brent, Daniel G Brown, Stephen D Brown, Carol Bult, John Burton, Jonathan Butler,

856 Robert D Campbell, Piero Carninci, Simon Cawley, Francesca Chiaromonte, Asif T Chin-

857 walla, Deanna M Church, Michele Clamp, Christopher Clee, Francis S Collins, Lisa L Cook,

858 Richard R Copley, Alan Coulson, Olivier Couronne, James Cuff, Val Curwen, Tim Cutts,

859 Mark Daly, Robert David, Joy Davies, Kimberly D Delehaunty, Justin Deri, Emmanouil T Der-

860 mitzakis, Colin Dewey, Nicholas J Dickens, Mark Diekhans, Sheila Dodge, Inna Dubchak,

861 Diane M Dunn, Sean R Eddy, Laura Elnitski, Richard D Emes, Pallavi Eswara, Eduardo

862 Eyras, Adam Felsenfeld, Ginger A Fewell, Paul Flicek, Karen Foley, Wayne N Frankel, Lu-

863 cinda A Fulton, Robert S Fulton, Terrence S Furey, Diane Gage, Richard A Gibbs, Gustavo

864 Glusman, Sante Gnerre, Nick Goldman, Leo Goodstadt, Darren Grafham, Tina A Graves,

865 Eric D Green, Simon Gregory, Roderic Guig, Mark Guyer, Ross C Hardison, David Haussler,

866 Yoshihide Hayashizaki, LaDeana W Hillier, Angela Hinrichs, Wratko Hlavina, Timothy Holzer,

867 Fan Hsu, Axin Hua, Tim Hubbard, Adrienne Hunt, Ian Jackson, David B Jaffe, L Steven John-

868 son, Matthew Jones, Thomas A Jones, Ann Joy, Michael Kamal, Elinor K Karlsson, Donna

869 Karolchik, Arkadiusz Kasprzyk, Jun Kawai, Evan Keibler, Cristyn Kells, W James Kent, An-

870 drew Kirby, Diana L Kolbe, Ian Korf, Raju S Kucherlapati, Edward J Kulbokas, David Kulp,

871 Tom Landers, J P Leger, Steven Leonard, Ivica Letunic, Rosie Levine, Jia Li, Ming Li, Chris-

872 tine Lloyd, Susan Lucas, Bin Ma, Donna R Maglott, Elaine R Mardis, Lucy Matthews, Evan

873 18

874

875 Mauceli, John H Mayer, Megan McCarthy, W Richard McCombie, Stuart McLaren, Kirsten

876 McLay, John D McPherson, Jim Meldrim, Beverley Meredith, Jill P Mesirov, Webb Miller, Tra-

877 cie L Miner, Emmanuel Mongin, Kate T Montgomery, Michael Morgan, Richard Mott, James C

878 Mullikin, Donna M Muzny, William E Nash, Joanne O Nelson, Michael N Nhan, Robert Nicol,

879 Zemin Ning, Chad Nusbaum, Michael J O’Connor, Yasushi Okazaki, Karen Oliver, Emma

880 Overton-Larty, Lior Pachter, Gens Parra, Kymberlie H Pepin, Jane Peterson, Pavel Pevzner,

881 Robert Plumb, Craig S Pohl, Alex Poliakov, Tracy C Ponce, Chris P Ponting, Simon Potter,

882 Michael Quail, Alexandre Reymond, Bruce A Roe, Krishna M Roskin, Edward M Rubin, Alis-

883 tair G Rust, Ralph Santos, Victor Sapojnikov, Brian Schultz, Jrg Schultz, Matthias S Schwartz,

884 Scott Schwartz, Carol Scott, Steven Seaman, Steve Searle, Ted Sharpe, Andrew Sheridan,

885 Ratna Shownkeen, Sarah Sims, Jonathan B Singer, Guy Slater, Arian Smit, Douglas R Smith,

886 Brian Spencer, Arne Stabenau, Nicole Stange-Thomann, Charles Sugnet, Mikita Suyama,

887 Glenn Tesler, Johanna Thompson, David Torrents, Evanne Trevaskis, John Tromp, Cather-

888 ine Ucla, Abel Ureta-Vidal, Jade P Vinson, Andrew C Von Niederhausern, Claire M Wade,

889 Melanie Wall, Ryan J Weber, Robert B Weiss, Michael C Wendl, Anthony P West, Kris

890 Wetterstrand, Raymond Wheeler, Simon Whelan, Jamey Wierzbowski, David Willey, Sophie

891 Williams, Richard K Wilson, Eitan Winter, Kim C Worley, Dudley Wyman, Shan Yang, Shiaw-

892 Pyng Yang, Evgeny M Zdobnov, Michael C Zody, and Eric S Lander. Initial sequencing and

893 comparative analysis of the mouse genome. Nature, 420(6915):520–62, December 2002.

894 PMID: 12466850.

895 19

896

897