nsf: 3aeb56c97327 grant.html

nsf

view grant.html @ 121:3aeb56c97327

author	bshanks@bshanks.dyndns.org
date	Wed Jul 08 05:18:30 2009 -0700 (16 years ago)
parents	dad49a6f95b6
children

line source

1 Introduction

2 Massive new datasets obtained with techniques such as in situ hybridization (ISH), immunohisto-

3 chemistry, in situ transgenic reporter, microarray voxelation, and others, allow the expression levels

4 of many genes at many locations to be compared. Our goal is to develop automated methods to

5 relate spatial variation in gene expression to anatomy. We want to find marker genes for specific

6 anatomical regions, and also to draw new anatomical maps based on gene expression patterns.

7 We will validate these methods by applying them to 46 anatomical areas within the cerebral cortex,

8 by using the Allen Mouse Brain Atlas coronal dataset (ABA).

9 This project has three primary goals:

10 (1) develop an algorithm to screen spatial gene expression data for combinations of marker

11 genes which selectively target anatomical regions.

12 (2) develop an algorithm to suggest new ways of carving up a structure into anatomically dis-

13 tinct regions, based on spatial patterns in gene expression.

14 (3) adapt our tools for the analysis of multi/hyperspectral imaging data from the Geographic

15 Information Systems (GIS) community.

16 We will create a 2-D “flat map” dataset of the mouse cerebral cortex that contains a flattened

17 version of the Allen Mouse Brain Atlas ISH data, as well as the boundaries of cortical anatomical

18 areas. We will use this dataset to validate the methods developed in (1) and (2). In addition to

19 its use in neuroscience, this dataset will be useful as a sample dataset for the machine learning

20 community.

21 Although our particular application involves the 3D spatial distribution of gene expression, the

22 methods we will develop will generalize to any high-dimensional data over points located in a low-

23 dimensional space. In particular, our methods could be applied to the analysis of multi/hyperspectral

24 imaging data, or alternately to genome-wide sequencing data derived from sets of tissues and dis-

25 ease states.

26 All algorithms that we develop will be implemented in a GPL open-source software toolkit. The

27 toolkit and the datasets will be published and freely available for others to use.

28 __________________

29 Background and related work

30 Cortical anatomy

31 The cortex is divided into areas and layers. Because of the cortical columnar organization, the

32 parcellation of the cortex into areas can be drawn as a 2-D map on the surface of the cortex. In the

33 third dimension, the boundaries between the areas continue downwards into the cortical depth,

34 perpendicular to the surface. The layer boundaries run parallel to the surface. One can picture an

35 area of the cortex as a slice of a six-layered cake1.

36 It is known that different cortical areas have distinct roles in both normal functioning and in

37 disease processes, yet there are no known marker genes for most cortical areas. When it is nec-

38 essary to divide a tissue sample into cortical areas, this is a manual process that requires a skilled

39 1Outside of isocortex, the number of layers varies.

40 1

42 human to combine multiple visual cues and interpret them in the context of their approximate

43 location upon the cortical surface.

44 Even the questions of how many areas should be recognized in cortex, and what their arrange-

45 ment is, are still not completely settled. A proposed division of the cortex into areas is called a

46 cortical map. In the rodent, the lack of a single agreed-upon map can be seen by contrasting the

47 recent maps given by Swanson[22] on the one hand, and Paxinos and Franklin[17] on the other.

48 While the maps are certainly very similar in their general arrangement, significant differences re-

49 main.

50 The Allen Mouse Brain Atlas dataset

51 The Allen Mouse Brain Atlas (ABA) data[14] were produced by doing in-situ hybridization on

52 slices of male, 56-day-old C57BL/6J mouse brains. Pictures were taken of the processed slice,

53 and these pictures were semi-automatically analyzed to create a digital measurement of gene

54 expression levels at each location in each slice. Per slice, cellular spatial resolution is achieved.

55 Using this method, a single physical slice can only be used to measure one single gene; many

56 different mouse brains were needed in order to measure the expression of many genes.

57 Mus musculus is thought to contain about 22,000 protein-coding genes[27]. The ABA contains

58 data on about 20,000 genes in sagittal sections, out of which over 4,000 genes are also measured

59 in coronal sections. Our dataset is derived from only the coronal subset of the ABA2. An auto-

60 mated nonlinear alignment procedure located the 2D data from the various slices in a single 3D

61 coordinate system. In the final 3D coordinate system, voxels are cubes with 200 microns on a

62 side. There are 67x41x58 = 159,326 voxels, of which 51,533 are in the brain[16]. For each voxel

63 and each gene, the expression energy[14] within that voxel is made available.

64 The ABA is not the only large public spatial gene expression dataset[9][26][6][15][25][4][24][21][3].

65 However, with the exception of the ABA, GenePaint[26], and EMAGE[25], most of the other re-

66 sources have not (yet) extracted the expression intensity from the ISH images and registered the

67 results into a single 3-D space.

68 The remainder of the background section will be divided into three parts, one for each major

69 goal.

70 Goal 1, From Areas to Genes: Given a map of regions, find genes that mark those regions

71 Machine learning terminology: classifiers The task of looking for marker genes for known

72 anatomical regions means that one is looking for a set of genes such that, if the expression level

73 of those genes is known, then the locations of the regions can be inferred.

74 If we define the regions so that they cover the entire anatomical structure to be subdivided,

75 and restrict ourselves to looking at one voxel at a time, we may say that we are using gene

76 expression in each voxel to assign that voxel to the proper area. We call this a classification

77 task, because each voxel is being assigned to a class (namely, its region). An understanding

78 of the relationship between the combination of gene expression levels and the locations of the

79 regions may be expressed as a function. The input to this function is a voxel, along with the gene

80 expression levels within that voxel; the output is the regional identity of the target voxel, that is, the

81 ____________________________________

82 2The sagittal data do not cover the entire cortex, and also have greater registration error[16]. Genes were selected

83 by the Allen Institute for coronal sectioning based on, “classes of known neuroscientific interest... or through post hoc

84 identification of a marked non-ubiquitous expression pattern”[16].

85 2

87 region to which the target voxel belongs. We call this function a classifier. In general, the input to

88 a classifier is called an instance, and the output is called a label (or a class label).

89 Our goal is not to produce a single classifier, but rather to develop an automated method for

90 determining a classifier for any known anatomical structure. Therefore, we seek a procedure by

91 which a gene expression dataset may be analyzed in concert with an anatomical atlas in order to

92 produce a classifier. The initial gene expression dataset used in the construction of the classifier

93 is called training data. In the machine learning literature, this sort of procedure may be thought

94 of as a supervised learning task, defined as a task in which the goal is to learn a mapping from

95 instances to labels, and the training data consists of a set of instances (voxels) for which the labels

96 (regions) are known.

97 Each gene expression level is called a feature, and the selection of which genes3 to look at is

98 called feature selection. Feature selection is one component of the task of learning a classifier.

99 One class of feature selection methods assigns some sort of score to each candidate gene.

100 The top-ranked genes are then chosen. Some scoring measures can assign a score to a set of

101 selected genes, not just to a single gene; in this case, a dynamic procedure may be used in which

102 features are added and subtracted from the selected set depending on how much they raise the

103 score. Such procedures are called “stepwise” or “greedy”.

104 Although the classifier itself may only look at the gene expression data within each voxel be-

105 fore classifying that voxel, the algorithm which constructs the classifier may look over the entire

106 dataset. We can categorize score-based feature selection methods depending on how the score

107 of calculated. Often the score calculation consists of assigning a sub-score to each voxel, and

108 then aggregating these sub-scores into a final score. If only information from nearby voxels is

109 used to calculate a voxel’s sub-score, then we say it is a local scoring method. If only information

110 from the voxel itself is used to calculate a voxel’s sub-score, then we say it is a pointwise scoring

111 method.

112 Our Strategy for Goal 1

113 Key questions when choosing a learning method are: What are the instances? What are the

114 features? How are the features chosen? Here are four principles that outline our answers to these

115 questions.

116 Principle 1: Combinatorial gene expression

117 It is too much to hope that every anatomical region of interest will be identified by a single

118 gene. For example, in the cortex, there are some areas which are not clearly delineated by any

119 gene included in the ABA coronal dataset. However, at least some of these areas can be delin-

120 eated by looking at combinations of genes (an example of an area for which multiple genes are

121 necessary and sufficient is provided in Preliminary Results, Figure 4). Therefore, each instance

122 should contain multiple features (genes).

123 Principle 2: Only look at combinations of small numbers of genes

124 When the classifier classifies a voxel, it is only allowed to look at the expression of the genes

125 which have been selected as features. The more data that are available to a classifier, the better

126 that it can do. Why not include every gene as a feature? The reason is that we wish to employ the

127 classifier in situations in which it is not feasible to gather data about every gene. For example, if we

128 ____________________________________

129 3Strictly speaking, the features are gene expression levels, but we’ll call them genes.

130 3

131

132 want to use the expression of marker genes as a trigger for some regionally-targeted intervention,

133 then our intervention must contain a molecular mechanism to check the expression level of each

134 marker gene before it triggers. It is currently infeasible to design a molecular trigger that checks

135 the level of more than a handful of genes. Therefore, we must select only a few genes as features.

136 The requirement to find combinations of only a small number of genes limits us from straightfor-

137 wardly applying many of the most simple techniques from the field of supervised machine learning.

138 In the parlance of machine learning, our task combines feature selection with supervised learning.

139 Principle 3: Use geometry in feature selection

140 When doing feature selection with score-based methods, the simplest thing to do would be

141 to score the performance of each voxel by itself and then combine these scores (pointwise scor-

142 ing). A more powerful approach is to also use information about the geometric relations between

143 each voxel and its neighbors; this requires non-pointwise, local scoring methods. See Preliminary

144 Results, figure 3 for evidence of the complementary nature of pointwise and local scoring methods.

145 Principle 4: Work in 2-D whenever possible

146 There are many anatomical structures which are commonly characterized in terms of a two-

147 dimensional manifold. When it is known that the structure that one is looking for is two-dimensional,

148 the results may be improved by allowing the analysis algorithm to take advantage of this prior

149 knowledge. In addition, it is easier for humans to visualize and work with 2-D data.

150 Goal 2, From Genes to Areas: given gene expression data, discover a map of regions

151 Machine learning terminology: clustering

152 If one is given a dataset consisting merely of instances, with no class labels, then analysis of

153 the dataset is referred to as unsupervised learning in the jargon of machine learning. One thing

154 that you can do with such a dataset is to group instances together. A set of similar instances is

155 called a cluster, and the activity of grouping the data into clusters is called clustering or cluster

156 analysis.

157 The task of deciding how to carve up a structure into anatomical regions can be put into these

158 terms. The instances are once again voxels (or pixels) along with their associated gene expression

159 profiles. We make the assumption that voxels from the same anatomical region have similar gene

160 expression profiles, at least compared to the other regions. This means that clustering voxels is

161 the same as finding potential regions; we seek a partitioning of the voxels into regions, that is, into

162 clusters of voxels with similar gene expression.

163 It is desirable to determine not just one set of regions, but also how these regions relate to

164 each other. The outcome of clustering may be a hierarchical tree of clusters, rather than a single

165 set of clusters which partition the voxels. This is called hierarchical clustering.

166 Similarity scores A crucial choice when designing a clustering method is how to measure

167 similarity, across either pairs of instances, or clusters, or both. There is much overlap between

168 scoring methods for feature selection (discussed above under Goal 1) and scoring methods for

169 similarity.

170 Dimensionality reduction In this section, we discuss reducing the length of the per-pixel gene

171 expression feature vector. By “dimension”, we mean the dimension of this vector, not the spatial

172 4

173

174 dimension of the underlying data.

175

176

177 Figure 1: Top row: Genes Nfic

178 and A930001M12Rik are the most

179 correlated with area SS (somatosen-

180 sory cortex). Bottom row: Genes

181 C130038G02Rik and Cacna1i are

182 those with the best fit using logistic

183 regression. Within each picture, the

184 vertical axis roughly corresponds to

185 anterior at the top and posterior at the

186 bottom, and the horizontal axis roughly

187 corresponds to medial at the left and

188 lateral at the right. The red outline is

189 the boundary of region SS. Pixels are

190 colored according to correlation, with

191 red meaning high correlation and blue

192 meaning low. Unlike Goal 1, there is no externally-imposed need to

193 select only a handful of informative genes for inclusion

194 in the instances. However, some clustering algorithms

195 perform better on small numbers of features4. There are

196 techniques which “summarize” a larger number of fea-

197 tures using a smaller number of features; these tech-

198 niques go by the name of feature extraction or dimen-

199 sionality reduction. The small set of features that such a

200 technique yields is called the reduced feature set. Note

201 that the features in the reduced feature set do not neces-

202 sarily correspond to genes; each feature in the reduced

203 set may be any function of the set of gene expression

204 levels.

205 Clustering genes rather than voxels Although the

206 ultimate goal is to cluster the instances (voxels or pixels),

207 one strategy to achieve this goal is to first cluster the

208 features (genes). There are two ways that clusters of

209 genes could be used.

210 Gene clusters could be used as part of dimensionality

211 reduction: rather than have one feature for each gene,

212 we could have one reduced feature for each gene cluster.

213 Gene clusters could also be used to directly yield a

214 clustering on instances. This is because many genes

215 have an expression pattern which seems to pick out a

216 single, spatially contiguous region. This suggests the fol-

217 lowing procedure: cluster together genes which pick out

218 similar regions, and then to use the more popular com-

219 mon regions as the final clusters. In Preliminary Results,

220 Figure 7, we show that a number of anatomically recog-

221 nized cortical regions, as well as some “superregions” formed by lumping together a few regions,

222 are associated with gene clusters in this fashion.

223 Goal 3: interoperability with multi/hyperspectral imaging analysis software

224 A typical color image associates each pixel with a vector of three values. Multispectral and hyper-

225 spectral images, however, are images which associate each pixel with a vector containing many

226 values. The different positions in the vector correspond to different bands of electromagnetic

227 wavelengths5.

228 Some analysis techniques for hyperspectral imaging, especially preprocessing and calibration

229 techniques, make use of the information that the different values captured at each pixel represent

230 ____________________________________

231 4First, because the number of features in the reduced dataset is less than in the original dataset, the running time of

232 clustering algorithms may be much less. Second, it is thought that some clustering algorithms may give better results

233 on reduced data.

234 5In hyperspectral imaging, the bands are adjacent, and the number of different bands is larger. For conciseness, we

235 discuss only hyperspectral imaging, but our methods are also well suited to multispectral imaging with many bands.

236 5

237

238 adjacent wavelengths of light, which can be combined to make a spectrum. Other analysis tech-

239 niques ignore the interpretation of the values measured, and their relationship to each other within

240 the electromagnetic spectrum, instead treating them blindly as completely separate features.

241 With both hyperspectral imaging and spatial gene expression data, each location in space

242 is associated with more than three numerical feature values. The analysis of hyperspectral im-

243 ages can involve supervised classification and unsupervised learning. Often hyperspectral images

244 come from satellites looking at the Earth, and it is desirable to classify what sort of objects occupy

245 a given area of land. Sometimes detailed training data is not available, in which case it is desirable

246 at least to cluster together those regions of land which contain similar objects.

247 We believe that it may be possible for these two different field to share some common compu-

248 tational tools. To this end, we intend to make use of existing hyperspectral imaging software when

249 possible, and to develop new software in such a way so as to make it easy to use for the purpose

250 of hyperspectral image analysis, as well as for our primary purpose of spatial gene expression

251 data analysis.

252 Related work

253

254 Figure 2: Gene Pitx2

255 is selectively underex-

256 pressed in area SS. As noted above, the GIS community has developed tools for supervised

257 classification and unsupervised clustering in the context of the analysis

258 of hyperspectral imaging data. One tool is Spectral Python[5]. Spectral

259 Python implements various supervised and unsupervised classification

260 methods, as well as utility functions for loading, viewing, and saving

261 spatial data. Although Spectral Python has feature extraction methods

262 (such as principal components analysis) which create a small set of

263 new features computed based on the original features, it does not have

264 feature selection methods, that is, methods to select a small subset

265 out of the original features (although feature selection in hyperspectral

266 imaging has been investigated by others[20].

267 There is a substantial body of work on the analysis of gene expression data. Most of this con-

268 cerns gene expression data which are not fundamentally spatial6. Here we review only that work

269 which concerns the automated analysis of spatial gene expression data with respect to anatomy.

270 Relating to Goal 1, GeneAtlas[6] and EMAGE [25] allow the user to construct a search query by

271 demarcating regions and then specifying either the strength of expression or the name of another

272 gene or dataset whose expression pattern is to be matched. Neither GeneAtlas nor EMAGE allow

273 one to search for combinations of genes that define a region in concert.

274 Relating to Goal 2, EMAGE[25] allows the user to select a dataset from among a large number

275 of alternatives, or by running a search query, and then to cluster the genes within that dataset.

276 EMAGE clusters via hierarchical complete linkage clustering.

277 [16] describes AGEA, ”Anatomic Gene Expression Atlas”. AGEA has three components. Gene

278 Finder: The user selects a seed voxel and the system (1) chooses a cluster which includes the

279 seed voxel, (2) yields a list of genes which are overexpressed in that cluster. Correlation: The user

280 selects a seed voxel and the system then shows the user how much correlation there is between

281 the gene expression profile of the seed voxel and every other voxel. Clusters: AGEA includes a

282 preset hierarchical clustering of voxels based on a recursive bifurcation algorithm with correlation

283 ____________________________________

284 6By “fundamentally spatial” we mean that there is information from a large number of spatial locations indexed by

285 spatial coordinates; not just data which have only a few different locations or which is indexed by anatomical label.

286 6

287

288 as the similarity metric. AGEA has been applied to the cortex. The paper describes interesting

289 results on the structure of correlations between voxel gene expression profiles within a handful of

290 cortical areas. However, that analysis neither looks for genes marking cortical areas, nor does it

291 suggest a cortical map based on gene expression data. Neither of the other components of AGEA

292 can be applied to cortical areas; AGEA’s Gene Finder cannot be used to find marker genes for the

293 cortical areas; and AGEA’s hierarchical clustering does not produce clusters corresponding to the

294 cortical areas7.

295

296

297 Figure 3: The top row shows the two

298 genes which (individually) best predict

299 area AUD, according to logistic regres-

300 sion. The bottom row shows the two

301 genes which (individually) best match

302 area AUD, according to gradient sim-

303 ilarity. From left to right and top to

304 bottom, the genes are Ssr1, Efcbp1,

305 Ptk7, and Aph1a. [7] looks at the mean expression level of genes within

306 anatomical regions, and applies a Student’s t-test to de-

307 termine whether the mean expression level of a gene is

308 significantly higher in the target region. This relates to

309 our Goal 1. [7] also clusters genes, relating to our Goal

310 2. For each cluster, prototypical spatial expression pat-

311 terns were created by averaging the genes in the cluster.

312 The prototypes were analyzed manually, without cluster-

313 ing voxels.

314 These related works differ from our strategy for Goal

315 1 in at least three ways. First, they find only single genes,

316 whereas we will also look for combinations of genes.

317 Second, they usually can only use overexpression as

318 a marker, whereas we will also search for underexpres-

319 sion. Third, they use scores based on pointwise expres-

320 sion levels, whereas we will also use geometric scores

321 such as gradient similarity (described in Preliminary Re-

322 sults). Figures 4, 2, and 3 in the Preliminary Results

323 section contain evidence that each of our three choices

324 is the right one.

325 [11] describes a technique to find combinations of

326 marker genes to pick out an anatomical region. They

327 use an evolutionary algorithm to evolve logical operators which combine boolean (thresholded)

328 images in order to match a target image. They apply their technique for finding combinations of

329 marker genes for the purpose of clustering genes around a “seed gene”.

330 Relating to our Goal 2, some researchers have attempted to parcellate cortex on the basis of

331 non-gene expression data. For example, [18], [2], [19], and [1] associate spots on the cortex with

332 the radial profile8 of response to some stain ([13] uses MRI), extract features from this profile, and

333 then use similarity between surface pixels to cluster.

334 [23] describes an analysis of the anatomy of the hippocampus using the ABA dataset. In

335 addition to manual analysis, two clustering methods were employed, a modified Non-negative

336 Matrix Factorization (NNMF), and a hierarchical bifurcation clustering scheme using correlation as

337 similarity. The paper yielded impressive results, proving the usefulness of computational genomic

338 ____________________________________

339 7In both cases, the cause is that pairwise correlations between the gene expression of voxels in different areas but

340 the same layer are often stronger than pairwise correlations between the gene expression of voxels in different layers

341 but the same area. Therefore, a pairwise voxel correlation clustering algorithm will tend to create clusters representing

342 cortical layers, not areas.

343 8A radial profile is a profile along a line perpendicular to the cortical surface.

344 7

345

346 anatomy. We have run NNMF on the cortical dataset, and while the results are promising, other

347 methods may perform as well or better (see Preliminary Results, Figure 6).

348 Comparing previous work with our Goal 1, there has been fruitful work on finding marker genes,

349 but only one of the projects explored combinations of marker genes, and none of them compared

350 the results obtained by using different algorithms or scoring methods. Comparing previous work

351 with Goal 2, although some projects obtained clusterings, there has not been much comparison

352 between different algorithms or scoring methods, so it is likely that the best clustering method for

353 this application has not yet been found. Also, none of these projects did a separate dimensionality

354 reduction step before clustering pixels, or tried to cluster genes first in order to guide automated

355 clustering of pixels into spatial regions, or used co-clustering algorithms.

356 In summary, (a) only one of the previous projects explores combinations of marker genes, (b)

357 there has been almost no comparison of different algorithms or scoring methods, and (c) there

358 has been no work on computationally finding marker genes applied to cortical areas, or on finding

359 a hierarchical clustering that will yield a map of cortical areas de novo from gene expression data.

360 Our project is guided by a concrete application with a well-specified criterion of success (how

361 well we can find marker genes for / reproduce the layout of cortical areas), which will provide a

362 solid basis for comparing different methods.

363 _________________________________________________

364 Data sharing plan

365

366

367 Figure 4: Upper left: wwc1. Upper

368 right: mtif2. Lower left: wwc1 + mtif2

369 (each pixel’s value on the lower left is

370 the sum of the corresponding pixels in

371 the upper row). We are enthusiastic about the sharing of methods and

372 data, and at the conclusion of the project, we will make

373 all of our data and computer source code publically avail-

374 able, either in supplemental attachments to publications,

375 or on a website. The source code will be released under

376 the GNU Public License. We intend to include a soft-

377 ware program which, when run, will take as input the

378 Allen Brain Atlas raw data, and produce as output all

379 numbers and charts found in publications resulting from

380 the project. Source code to be released will include ex-

381 tensions to Caret[8], an existing open-source scientific

382 imaging program, and to Spectral Python. Data to be

383 released will include the 2-D “flat map” dataset. This

384 dataset will be submitted to a machine learning dataset

385 repository.

386 Broader impacts

387 In addition to validating the usefulness of the algorithms,

388 the application of these methods to cortex will produce

389 immediate benefits, because there are currently no known genetic markers for most cortical areas.

390 The method developed in Goal 1 will be applied to each cortical area to find a set of marker

391 genes such that the combinatorial expression pattern of those genes uniquely picks out the target

392 area. Finding marker genes will be useful for drug discovery as well as for experimentation be-

393 cause marker genes can be used to design interventions which selectively target individual cortical

394 areas.

395 The application of the marker gene finding algorithm to the cortex will also support the develop-

396 8

397

398 ment of new neuroanatomical methods. In addition to finding markers for each individual cortical

399 areas, we will find a small panel of genes that can find many of the areal boundaries at once.

400 The method developed in Goal 2 will provide a genoarchitectonic viewpoint that will contribute

401 to the creation of a better cortical map.

402 The methods we will develop will be applicable to other datasets beyond the brain, and even to

403 datasets outside of biology. The software we develop will be useful for the analysis of hyperspectral

404 images. Our project will draw attention to this area of overlap between neuroscience and GIS, and

405 may lead to future collaborations between these two fields. The cortical dataset that we produce

406 will be useful in the machine learning community as a sample dataset that new algorithms can be

407 tested against. The availability of this sample dataset to the machine learning community may lead

408 to more interest in the design of machine learning algorithms to analyze spatial gene expression.

409 _

410 Preliminary Results

411 Format conversion between SEV, MATLAB, NIFTI

412 We have created software to (politely) download all of the SEV files9 from the Allen Institute web-

413 site. We have also created software to convert between the SEV, MATLAB, and NIFTI file formats,

414 as well as some of Caret’s file formats.

415 Flatmap of cortex

416 We downloaded the ABA data and selected only those voxels which belong to cerebral cortex.

417 We divided the cortex into hemispheres. Using Caret[8], we created a mesh representation of the

418 surface of the selected voxels. For each gene, and for each node of the mesh, we calculated an

419 average of the gene expression of the voxels “underneath” that mesh node. We then flattened

420 the cortex, creating a two-dimensional mesh. We converted this grid into a MATLAB matrix. We

421 manually traced the boundaries of each of 46 cortical areas from the ABA coronal reference atlas

422 slides, and converted this region data into MATLAB format.

423 At this point, the data are in the form of a number of 2-D matrices, all in registration, with the

424 matrix entries representing a grid of points (pixels) over the cortical surface. There is one 2-D

425 matrix whose entries represent the regional label associated with each surface pixel. And for each

426 gene, there is a 2-D matrix whose entries represent the average expression level underneath each

427 surface pixel. The features and the target area are both functions on the surface pixels. They can

428 be referred to as scalar fields over the space of surface pixels; alternately, they can be thought of

429 as images which can be displayed on the flatmapped surface.

430 Feature selection and scoring methods

431 Underexpression of a gene can serve as a marker Underexpression of a gene can sometimes

432 serve as a marker. For example, see Figure 2.

433 Correlation Recall that the instances are surface pixels, and consider the problem of attempt-

434 ing to classify each instance as either a member of a particular anatomical area, or not. The target

435 area can be represented as a boolean mask over the surface pixels.

436 We calculated the correlation between each gene and each cortical area. The top row of Figure

437 1 shows the three genes most correlated with area SS.

438 9SEV is a sparse format for spatial data. It is the format in which the ABA data is made available.

439 9

440

441 Conditional entropy

442 For each region, we created and ran a forward stepwise procedure which attempted to find

443 pairs of genes such that the conditional entropy of the target area’s boolean mask, conditioned

444 upon the gene pair’s thresholded expression levels, is minimized.

445 This finds pairs of genes which are most informative (at least at these threshold levels) relative

446 to the question, “Is this surface pixel a member of the target area?”. The advantage over linear

447 methods such as logistic regression is that this takes account of arbitrarily nonlinear relationships;

448 for example, if the XOR of two variables predicts the target, conditional entropy would notice,

449 whereas linear methods would not.

450 Gradient similarity We noticed that the previous two scoring methods, which are pointwise,

451 often found genes whose pattern of expression did not look similar in shape to the target region.

452 For this reason we designed a non-pointwise scoring method to detect when a gene had a pattern

453 of expression which looked like it had a boundary whose shape is similar to the shape of the target

454 region. We call this scoring method “gradient similarity”. The formula is:

455 ∑

456 pixel<img src="cmsy8-32.png" alt="∈" />pixels cos(∠∇1 -∠∇2) ⋅|∇1| + |∇2|

457 2 ⋅ pixel_value1 + pixel_value2

458 2

459 where ∇1 and ∇2 are the gradient vectors of the two images at the current pixel; ∠∇i is the

460 angle of the gradient of image i at the current pixel; |∇i| is the magnitude of the gradient of image

461 i at the current pixel; and pixel_valuei is the value of the current pixel in image i.

462 The intuition is that we want to see if the borders of the pattern in the two images are similar; if

463 the borders are similar, then both images will have corresponding pixels with large gradients (be-

464 cause this is a border) which are oriented in a similar direction (because the borders are similar).

465 Gradient similarity provides information complementary to correlation

466 To show that gradient similarity can provide useful information that cannot be detected via

467 pointwise analyses, consider Fig. 3. The pointwise method in the top row identifies genes which

468 express more strongly in AUD than outside of it; its weakness is that this includes many areas

469 which don’t have a salient border matching the areal border. The geometric method identifies

470 genes whose salient expression border seems to partially line up with the border of AUD; its

471 weakness is that this includes genes which don’t express over the entire area.

472 Areas which can be identified by single genes Using gradient similarity, we have already

473 found single genes which roughly identify some areas and groupings of areas. For each of these

474 areas, an example of a gene which roughly identifies it is shown in Figure 5. We have not yet

475 cross-verified these genes in other atlases.

476 In addition, there are a number of areas which are almost identified by single genes: COAa+NLOT

477 (anterior part of cortical amygdalar area, nucleus of the lateral olfactory tract), ENT (entorhinal),

478 ACAv (ventral anterior cingulate), VIS (visual), AUD (auditory).

479 These results validate our expectation that the ABA dataset can be exploited to find marker

480 genes for many cortical areas, while also validating the relevancy of our new scoring method,

481 gradient similarity.

482 10

483

484

485

486

487

488 Figure 5: From left to right and top

489 to bottom, single genes which roughly

490 identify areas SS (somatosensory pri-

491 mary + supplemental), SSs (supple-

492 mental somatosensory), PIR (piriform),

493 FRP (frontal pole), RSP (retrosplenial),

494 COApm (Cortical amygdalar, poste-

495 rior part, medial zone). Grouping

496 some areas together, we have also

497 found genes to identify the groups

498 ACA+PL+ILA+DP+ORB+MO (anterior

499 cingulate, prelimbic, infralimbic, dor-

500 sal peduncular, orbital, motor), poste-

501 rior and lateral visual (VISpm, VISpl,

502 VISI, VISp; posteromedial, posterolat-

503 eral, lateral, and primary visual; the

504 posterior and lateral visual area is dis-

505 tinguished from its neighbors, but not

506 from the entire rest of the cortex). The

507 genes are Pitx2, Aldh1a2, Ppfibp1,

508 Slco1a5, Tshz2, Trhr, Col12a1, Ets1. Combinations of multiple genes are useful and

509 necessary for some areas

510 In Figure 4, we give an example of a cortical area

511 which is not marked by any single gene, but which can be

512 identified combinatorially. According to logistic regres-

513 sion, gene wwc1 is the best fit single gene for predicting

514 whether or not a pixel on the cortical surface belongs to

515 the motor area (area MO). The upper-left picture in Fig-

516 ure 4 shows wwc1’s spatial expression pattern over the

517 cortex. The lower-right boundary of MO is represented

518 reasonably well by this gene, but the gene overshoots

519 the upper-left boundary. This flattened 2-D representa-

520 tion does not show it, but the area corresponding to the

521 overshoot is the medial surface of the cortex. MO is only

522 found on the dorsal surface. Gene mtif2 is shown in the

523 upper-right. Mtif2 captures MO’s upper-left boundary, but

524 not its lower-right boundary. Mtif2 does not express very

525 much on the medial surface. By adding together the val-

526 ues at each pixel in these two figures, we get the lower-

527 left image. This combination captures area MO much

528 better than any single gene.

529 This shows that our proposal to develop a method to

530 find combinations of marker genes is both possible and

531 necessary.

532 Multivariate supervised learning

533 Forward stepwise logistic regression Logistic regres-

534 sion is a popular method for predictive modeling of cat-

535 egorical data. As a pilot run, for five cortical areas (SS,

536 AUD, RSP, VIS, and MO), we performed forward step-

537 wise logistic regression to find single genes, pairs of

538 genes, and triplets of genes which predict areal identify.

539 This is an example of feature selection integrated with

540 prediction using a stepwise wrapper. Some of the sin-

541 gle genes found were shown in various figures through-

542 out this document, and Figure 4 shows a combination of

543 genes which was found.

544 SVM on all genes at once

545 In order to see how well one can do when looking at

546 all genes at once, we ran a support vector machine to

547 classify cortical surface pixels based on their gene ex-

548 pression profiles. We achieved classification accuracy of

549 about 81%10. However, as noted above, a classifier that

550 ____________________________________

551 105-fold cross-validation.

552 11

553

554 looks at all the genes at once isn’t as practically useful

555 as a classifier that uses only a few genes.

556 Data-driven redrawing of the cortical map

557 We have applied the following dimensionality reduction algorithms to reduce the dimensionality

558 of the gene expression profile associated with each pixel: Principal Components Analysis (PCA),

559 Simple PCA, Multi-Dimensional Scaling, Isomap, Landmark Isomap, Laplacian eigenmaps, Local

560 Tangent Space Alignment, Stochastic Proximity Embedding, Fast Maximum Variance Unfolding,

561 Non-negative Matrix Factorization (NNMF). Space constraints prevent us from showing many of

562 the results, but as a sample, PCA, NNMF, and landmark Isomap are shown in the first, second,

563 and third rows of Figure 6.

564 After applying the dimensionality reduction, we ran clustering algorithms on the reduced data.

565 To date we have tried k-means and spectral clustering. The results of k-means after PCA, NNMF,

566 and landmark Isomap are shown in the bottom row of Figure 6. To compare, the leftmost picture

567 on the bottom row of Figure 6 shows some of the major subdivisions of cortex. These results show

568 that different dimensionality reduction techniques capture different aspects of the data and lead

569 to different clusterings, indicating the utility of our proposal to produce a detailed comparison of

570 these techniques as applied to the domain of genomic anatomy.

571 Many areas are captured by clusters of genes We also clustered the genes using gradient

572 similarity to see if the spatial regions defined by any clusters matched known anatomical regions.

573 Figure 7 shows, for ten sample gene clusters, each cluster’s average expression pattern, com-

574 pared to a known anatomical boundary. This suggests that it is worth attempting to cluster genes,

575 and then to use the results to cluster pixels.

576 Our plan: what remains to be done

577 Flatmap cortex and segment cortical layers

578 There are multiple ways to flatten 3-D data into 2-D. We will compare mappings from manifolds to

579 planes which attempt to preserve size (such as the one used by Caret[8]) with mappings which

580 preserve angle (conformal maps). We will also develop a segmentation algorithm to automatically

581 identify the layer boundaries.

582 Develop algorithms that find genetic markers for anatomical regions

583 Scoring measures and feature selection We will develop scoring methods for evaluating how

584 good individual genes are at marking areas. We will compare pointwise, geometric, and information-

585 theoretic measures. We already developed one entirely new scoring method (gradient similarity),

586 but we may develop more. Scoring measures that we will explore will include the L1 norm, cor-

587 relation, expression energy ratio, conditional entropy, gradient similarity, Jaccard similarity, Dice

588 similarity, Hough transform, and statistical tests such as Student’s t-test, and the Mann-Whitney

589 U test (a non-parametric test). In addition, any classifier induces a scoring measure on genes by

590 taking the prediction error when using that gene to predict the target.

591 Using some combination of these measures, we will develop a procedure to find single marker

592 genes for anatomical regions: for each cortical area, we will rank the genes by their ability to

593 delineate that area. We will quantitatively compare the list of single genes generated by our

594 method to the lists generated by methods which are mentioned in Related Work.

595 12

596

597

598 Figure 6: First row: the first 6 reduced dimensions, using PCA. Sec-

599 ond row: the first 6 reduced dimensions, using NNMF. Third row: the

600 first six reduced dimensions, using landmark Isomap. Bottom row:

601 examples of kmeans clustering applied to reduced datasets to find

602 7 clusters. Left: 19 of the major subdivisions of the cortex. Sec-

603 ond from left: PCA. Third from left: NNMF. Right: Landmark Isomap.

604 Additional details: In the third and fourth rows, 7 dimensions were

605 found, but only 6 displayed. In the last row: for PCA, 50 dimensions

606 were used; for NNMF, 6 dimensions were used; for landmark Isomap,

607 7 dimensions were used. Some cortical areas have

608 no single marker genes but

609 can be identified by com-

610 binatorial coding. This re-

611 quires multivariate scoring

612 measures and feature se-

613 lection procedures. Many

614 of the measures, such

615 as expression energy, gra-

616 dient similarity, Jaccard,

617 Dice, Hough, Student’s t,

618 and Mann-Whitney U are

619 univariate. We will ex-

620 tend these scoring mea-

621 sures for use in multivariate

622 feature selection, that is,

623 for scoring how well com-

624 binations of genes, rather

625 than individual genes, can

626 distinguish a target area.

627 There are existing mul-

628 tivariate forms of some

629 of the univariate scoring

630 measures, for example,

631 Hotelling’s T-square is a

632 multivariate analog of Stu-

633 dent’s t.

634 We will develop a fea-

635 ture selection procedure for choosing the best small set of marker genes for a given anatomical

636 area. In addition to using the scoring measures that we develop, we will also explore (a) feature

637 selection using a stepwise wrapper over “vanilla” classifiers such as logistic regression, (b) super-

638 vised learning methods such as decision trees which incrementally/greedily combine single gene

639 markers into sets, and (c) supervised learning methods which use soft constraints to minimize

640 number of features used, such as sparse support vector machines (SVMs).

641 Since errors of displacement and of shape may cause genes and target areas to match less

642 than they should, we will consider the robustness of feature selection methods in the presence of

643 error. Some of these methods, such as the Hough transform, are designed to be resistant in the

644 presence of error, but many are not.

645 An area may be difficult to identify because the boundaries are misdrawn in the atlas, or be-

646 cause the shape of the natural domain of gene expression corresponding to the area is different

647 from the shape of the area as recognized by anatomists. We will develop extensions to our pro-

648 cedure which (a) detect when a difficult area could be fit if its boundary were redrawn slightly11,

649 ____________________________________

650 11Not just any redrawing is acceptable, only those which appear to be justified as a natural spatial domain of gene ex-

651 pression by multiple sources of evidence. Interestingly, the need to detect “natural spatial domains of gene expression”

652 in a data-driven fashion means that the methods of Goal 2 might be useful in achieving Goal 1, as well – particularly

653 13

654

655 and (b) detect when a difficult area could be combined with adjacent areas to create a larger area

656 which can be fit.

657 A future publication on the method that we develop in Goal 1 will review the scoring measures

658 and quantitatively compare their performance in order to provide a foundation for future research

659 of methods of marker gene finding. We will measure the robustness of the scoring measures as

660 well as their absolute performance on our dataset.

661 Develop algorithms to suggest a division of a structure into anatomical parts

662

663 Figure 7: Prototypes corresponding to sample gene clus-

664 ters, clustered by gradient similarity. Region boundaries for

665 the region that most matches each prototype are overlaid. Dimensionality reduction on gene

666 expression profiles We have al-

667 ready described the application of

668 ten dimensionality reduction algo-

669 rithms for the purpose of replacing

670 the gene expression profiles, which

671 are vectors of about 4000 gene ex-

672 pression levels, with a smaller num-

673 ber of features. We plan to further ex-

674 plore and interpret these results, as

675 well as to apply other unsupervised

676 learning algorithms, including inde-

677 pendent components analysis, self-

678 organizing maps, and generative models such as deep Boltzmann machines. We will explore

679 ways to quantitatively compare the relevance of the different dimensionality reduction methods for

680 identifying cortical areal boundaries.

681 Dimensionality reduction on pixels Instead of applying dimensionality reduction to the gene

682 expression profiles, the same techniques can be applied instead to the pixels. It is possible that

683 the features generated in this way by some dimensionality reduction techniques will directly corre-

684 spond to interesting spatial regions.

685 Clustering and segmentation on pixels We will explore clustering and image segmentation

686 algorithms in order to segment the pixels into regions. We will explore k-means, spectral cluster-

687 ing, gene shaving[10], recursive division clustering, multivariate generalizations of edge detectors,

688 multivariate generalizations of watershed transformations, region growing, active contours, graph

689 partitioning methods, and recursive agglomerative clustering with various linkage functions. These

690 methods can be combined with dimensionality reduction.

691 Clustering on genes We have already shown that the procedure of clustering genes according

692 to gradient similarity, and then creating an averaged prototype of each cluster’s expression pattern,

693 yields some spatial patterns which match cortical areas (Figure 7). We will further explore the

694 clustering of genes.

695 In addition to using the cluster expression prototypes directly to identify spatial regions, this

696 might be useful as a component of dimensionality reduction. For example, one could imagine

697 clustering similar genes and then replacing their expression levels with a single average expression

698 ____________________________________

699 discriminative dimensionality reduction.

700 14

701

702 level, thereby removing some redundancy from the gene expression profiles. One could then

703 perform clustering on pixels (possibly after a second dimensionality reduction step) in order to

704 identify spatial regions. It remains to be seen whether removal of redundancy would help or hurt

705 the ultimate goal of identifying interesting spatial regions.

706 Co-clustering We will explore some algorithms which simultaneously incorporate clustering

707 on instances and on features (in our case, pixels and genes), for example, IRM[12]. These are

708 called co-clustering or biclustering algorithms.

709 Compare different methods In order to tell which method is best for genomic anatomy, for

710 each experimental method we will compare the cortical map found by unsupervised learning to a

711 cortical map derived from the Allen Reference Atlas. We will explore various quantitative metrics

712 that purport to measure how similar two clusterings are, such as Jaccard, Rand index, Fowlkes-

713 Mallows, variation of information, Larsen, Van Dongen, and others.

714 Discriminative dimensionality reduction In addition to using a purely data-driven approach

715 to identify spatial regions, it might be useful to see how well the known regions can be recon-

716 structed from a small number of features, even if those features are chosen by using knowledge of

717 the regions. For example, linear discriminant analysis could be used as a dimensionality reduction

718 technique in order to identify a few features which are the best linear summary of gene expression

719 profiles for the purpose of discriminating between regions. This reduced feature set could then be

720 used to cluster pixels into regions. Perhaps the resulting clusters will be similar to the reference

721 atlas, yet more faithful to natural spatial domains of gene expression than the reference atlas is.

722 Apply the new methods to the cortex

723 Using the methods developed in Goal 1, we will present, for each cortical area, a short list of

724 markers to identify that area; and we will also present lists of “panels” of genes that can be used

725 to delineate many areas at once.

726 Because in most cases the ABA coronal dataset only contains one ISH per gene, it is possible

727 for an unrelated combination of genes to seem to identify an area when in fact it is only coinci-

728 dence. There are three ways we will validate our marker genes to guard against this. First, we

729 will confirm that putative combinations of marker genes express the same pattern in both hemi-

730 spheres. Second, we will manually validate our final results on other gene expression datasets

731 such as EMAGE, GeneAtlas, and GENSAT[9]. Third, we may conduct ISH experiments jointly with

732 collaborators to get further data on genes of particular interest.

733 Using the methods developed in Goal 2, we will present one or more hierarchical cortical

734 maps. We will identify and explain how the statistical structure in the gene expression data led to

735 any unexpected or interesting features of these maps, and we will provide biological hypotheses

736 to interpret any new cortical areas, or groupings of areas, which are discovered.

737 Apply the new methods to hyperspectral datasets

738 Our software will be able to read and write file formats common in the hyperspectral imaging

739 community such as Erdas LAN and ENVI, and it will be able to convert between the SEV and NIFTI

740 formats from neuroscience and the ENVI format from GIS. The methods developed in Goals 1 and

741 2 will be implemented either as part of Spectral Python or as a separate tool that interoperates

742 with Spectral Python. The methods will be run on hyperspectral satellite image datasets, and their

743 performance will be compared to existing hyperspectral analysis techniques.

744 15

745

746 References Cited

747 [1] Chris Adamson, Leigh Johnston, Terrie Inder, Sandra Rees, Iven Mareels, and Gary Egan.

748 A tracking approach to parcellation of the cerebral cortex. In Medical Image Computing

749 and Computer-Assisted Intervention MICCAI 2005, volume 3749/2005 of Lecture Notes in

750 Computer Science, pages 294–301. Springer Berlin / Heidelberg, 2005.

751 [2] J. Annese, A. Pitiot, I. D. Dinov, and A. W. Toga. A myelo-architectonic method for the struc-

752 tural classification of cortical areas. NeuroImage, 21(1):15–26, 2004.

753 [3] Tanya Barrett, Dennis B. Troup, Stephen E. Wilhite, Pierre Ledoux, Dmitry Rudnev, Carlos

754 Evangelista, Irene F. Kim, Alexandra Soboleva, Maxim Tomashevsky, and Ron Edgar. NCBI

755 GEO: mining tens of millions of expression profiles–database and tools update. Nucl. Acids

756 Res., 35(suppl_1):D760–765, 2007.

757 [4] George W. Bell, Tatiana A. Yatskievych, and Parker B. Antin. GEISHA, a whole-mount in

758 situ hybridization gene expression screen in chicken embryos. Developmental Dynamics,

759 229(3):677–687, 2004.

760 [5] Thomas Boggs. Spectral python. http://spectralpython.sourceforge.net/, July 2008.

761 [6] James P Carson, Tao Ju, Hui-Chen Lu, Christina Thaller, Mei Xu, Sarah L Pallas, Michael C

762 Crair, Joe Warren, Wah Chiu, and Gregor Eichele. A digital atlas to characterize the mouse

763 brain transcriptome. PLoS Comput Biol, 1(4):e41, 2005.

764 [7] Mark H. Chin, Alex B. Geng, Arshad H. Khan, Wei-Jun Qian, Vladislav A. Petyuk, Jyl Boline,

765 Shawn Levy, Arthur W. Toga, Richard D. Smith, Richard M. Leahy, and Desmond J. Smith.

766 A genome-scale map of expression for a mouse brain section obtained using voxelation.

767 Physiol. Genomics, 30(3):313–321, August 2007.

768 [8] D C Van Essen, H A Drury, J Dickson, J Harwell, D Hanlon, and C H Anderson. An integrated

769 software suite for surface-based analyses of cerebral cortex. Journal of the American Medical

770 Informatics Association: JAMIA, 8(5):443–59, 2001. PMID: 11522765.

771 [9] Shiaoching Gong, Chen Zheng, Martin L. Doughty, Kasia Losos, Nicholas Didkovsky, Uta B.

772 Schambra, Norma J. Nowak, Alexandra Joyner, Gabrielle Leblanc, Mary E. Hatten, and

773 Nathaniel Heintz. A gene expression atlas of the central nervous system based on bacte-

774 rial artificial chromosomes. Nature, 425(6961):917–925, October 2003.

775 [10] Trevor Hastie, Robert Tibshirani, Michael Eisen, Ash Alizadeh, Ronald Levy, Louis Staudt,

776 Wing Chan, David Botstein, and Patrick Brown. ’Gene shaving’ as a method for identifying dis-

777 tinct sets of genes with similar expression patterns. Genome Biology, 1(2):research0003.1–

778 research0003.21, 2000.

779 [11] Jano Hemert and Richard Baldock. Matching spatial regions with combinations of interact-

780 ing gene expression patterns. In Bioinformatics Research and Development, volume 13 of

781 Communications in Computer and Information Science, pages 347–361. Springer Berlin Hei-

782 delberg, 2008.

783 16

784

785 [12] C Kemp, JB Tenenbaum, TL Griffiths, T Yamada, and N Ueda. Learning systems of concepts

786 with an infinite relational model. In AAAI, 2006.

787 [13] F. Kruggel, M. K. Brckner, Th. Arendt, C. J. Wiggins, and D. Y. von Cramon. Analyzing the

788 neocortical fine-structure. Medical Image Analysis, 7(3):251–264, September 2003.

789 [14] Ed S. Lein, Michael J. Hawrylycz, Nancy Ao, Mikael Ayres, Amy Bensinger, Amy Bernard,

790 Andrew F. Boe, Mark S. Boguski, Kevin S. Brockway, Emi J. Byrnes, Lin Chen, Li Chen,

791 Tsuey-Ming Chen, Mei Chi Chin, Jimmy Chong, Brian E. Crook, Aneta Czaplinska, Chinh N.

792 Dang, Suvro Datta, Nick R. Dee, Aimee L. Desaki, Tsega Desta, Ellen Diep, Tim A. Dolbeare,

793 Matthew J. Donelan, Hong-Wei Dong, Jennifer G. Dougherty, Ben J. Duncan, Amanda J.

794 Ebbert, Gregor Eichele, Lili K. Estin, Casey Faber, Benjamin A. Facer, Rick Fields, Shanna R.

795 Fischer, Tim P. Fliss, Cliff Frensley, Sabrina N. Gates, Katie J. Glattfelder, Kevin R. Halverson,

796 Matthew R. Hart, John G. Hohmann, Maureen P. Howell, Darren P. Jeung, Rebecca A. John-

797 son, Patrick T. Karr, Reena Kawal, Jolene M. Kidney, Rachel H. Knapik, Chihchau L. Kuan,

798 James H. Lake, Annabel R. Laramee, Kirk D. Larsen, Christopher Lau, Tracy A. Lemon,

799 Agnes J. Liang, Ying Liu, Lon T. Luong, Jesse Michaels, Judith J. Morgan, Rebecca J. Mor-

800 gan, Marty T. Mortrud, Nerick F. Mosqueda, Lydia L. Ng, Randy Ng, Geralyn J. Orta, Car-

801 oline C. Overly, Tu H. Pak, Sheana E. Parry, Sayan D. Pathak, Owen C. Pearson, Ralph B.

802 Puchalski, Zackery L. Riley, Hannah R. Rockett, Stephen A. Rowland, Joshua J. Royall,

803 Marcos J. Ruiz, Nadia R. Sarno, Katherine Schaffnit, Nadiya V. Shapovalova, Taz Sivisay,

804 Clifford R. Slaughterbeck, Simon C. Smith, Kimberly A. Smith, Bryan I. Smith, Andy J. Sodt,

805 Nick N. Stewart, Kenda-Ruth Stumpf, Susan M. Sunkin, Madhavi Sutram, Angelene Tam,

806 Carey D. Teemer, Christina Thaller, Carol L. Thompson, Lee R. Varnam, Axel Visel, Ray M.

807 Whitlock, Paul E. Wohnoutka, Crissa K. Wolkey, Victoria Y. Wong, Matthew Wood, Murat B.

808 Yaylaoglu, Rob C. Young, Brian L. Youngstrom, Xu Feng Yuan, Bin Zhang, Theresa A. Zwing-

809 man, and Allan R. Jones. Genome-wide atlas of gene expression in the adult mouse brain.

810 Nature, 445(7124):168–176, 2007.

811 [15] Susan Magdaleno, Patricia Jensen, Craig L. Brumwell, Anna Seal, Karen Lehman, Andrew

812 Asbury, Tony Cheung, Tommie Cornelius, Diana M. Batten, Christopher Eden, Shannon M.

813 Norland, Dennis S. Rice, Nilesh Dosooye, Sundeep Shakya, Perdeep Mehta, and Tom Cur-

814 ran. BGEM: an in situ hybridization database of gene expression in the embryonic and adult

815 mouse nervous system. PLoS Biology, 4(4):e86 EP –, April 2006.

816 [16] Lydia Ng, Amy Bernard, Chris Lau, Caroline C Overly, Hong-Wei Dong, Chihchau Kuan,

817 Sayan Pathak, Susan M Sunkin, Chinh Dang, Jason W Bohland, Hemant Bokil, Partha P

818 Mitra, Luis Puelles, John Hohmann, David J Anderson, Ed S Lein, Allan R Jones, and Michael

819 Hawrylycz. An anatomic gene expression atlas of the adult mouse brain. Nat Neurosci,

820 12(3):356–362, March 2009.

821 [17] George Paxinos and Keith B.J. Franklin. The Mouse Brain in Stereotaxic Coordinates. Aca-

822 demic Press, 2 edition, July 2001.

823 [18] A. Schleicher, N. Palomero-Gallagher, P. Morosan, S. Eickhoff, T. Kowalski, K. Vos,

824 K. Amunts, and K. Zilles. Quantitative architectural analysis: a new approach to cortical

825 mapping. Anatomy and Embryology, 210(5):373–386, December 2005.

826 17

827

828 [19] Oliver Schmitt, Lars Hmke, and Lutz Dmbgen. Detection of cortical transition regions utilizing

829 statistical analyses of excess masses. NeuroImage, 19(1):42–63, May 2003.

830 [20] S.B. Serpico and L. Bruzzone. A new search algorithm for feature selection in hyperspec-

831 tral remote sensing images. Geoscience and Remote Sensing, IEEE Transactions on,

832 39(7):1360–1367, 2001.

833 [21] Constance M. Smith, Jacqueline H. Finger, Terry F. Hayamizu, Ingeborg J. McCright, Janan T.

834 Eppig, James A. Kadin, Joel E. Richardson, and Martin Ringwald. The mouse gene expres-

835 sion database (GXD): 2007 update. Nucl. Acids Res., 35(suppl_1):D618–623, 2007.

836 [22] Larry Swanson. Brain Maps: Structure of the Rat Brain. Academic Press, 3 edition, November

837 2003.

838 [23] Carol L. Thompson, Sayan D. Pathak, Andreas Jeromin, Lydia L. Ng, Cameron R. MacPher-

839 son, Marty T. Mortrud, Allison Cusick, Zackery L. Riley, Susan M. Sunkin, Amy Bernard,

840 Ralph B. Puchalski, Fred H. Gage, Allan R. Jones, Vladimir B. Bajic, Michael J. Hawrylycz,

841 and Ed S. Lein. Genomic anatomy of the hippocampus. Neuron, 60(6):1010–1021, Decem-

842 ber 2008.

843 [24] Pavel Tomancak, Amy Beaton, Richard Weiszmann, Elaine Kwan, ShengQiang Shu,

844 Suzanna E Lewis, Stephen Richards, Michael Ashburner, Volker Hartenstein, Susan E Cel-

845 niker, and Gerald M Rubin. Systematic determination of patterns of gene expression during

846 drosophila embryogenesis. Genome Biology, 3(12):research008818814, 2002. PMC151190.

847 [25] Shanmugasundaram Venkataraman, Peter Stevenson, Yiya Yang, Lorna Richardson,

848 Nicholas Burton, Thomas P. Perry, Paul Smith, Richard A. Baldock, Duncan R. Davidson,

849 and Jeffrey H. Christiansen. EMAGE edinburgh mouse atlas of gene expression: 2008 up-

850 date. Nucl. Acids Res., 36(suppl_1):D860–865, 2008.

851 [26] Axel Visel, Christina Thaller, and Gregor Eichele. GenePaint.org: an atlas of gene expression

852 patterns in the mouse embryo. Nucl. Acids Res., 32(suppl_1):D552–556, 2004.

853 [27] Robert H Waterston, Kerstin Lindblad-Toh, Ewan Birney, Jane Rogers, Josep F Abril, Pankaj

854 Agarwal, Richa Agarwala, Rachel Ainscough, Marina Alexandersson, Peter An, Stylianos E

855 Antonarakis, John Attwood, Robert Baertsch, Jonathon Bailey, Karen Barlow, Stephan Beck,

856 Eric Berry, Bruce Birren, Toby Bloom, Peer Bork, Marc Botcherby, Nicolas Bray, Michael R

857 Brent, Daniel G Brown, Stephen D Brown, Carol Bult, John Burton, Jonathan Butler,

858 Robert D Campbell, Piero Carninci, Simon Cawley, Francesca Chiaromonte, Asif T Chin-

859 walla, Deanna M Church, Michele Clamp, Christopher Clee, Francis S Collins, Lisa L Cook,

860 Richard R Copley, Alan Coulson, Olivier Couronne, James Cuff, Val Curwen, Tim Cutts,

861 Mark Daly, Robert David, Joy Davies, Kimberly D Delehaunty, Justin Deri, Emmanouil T Der-

862 mitzakis, Colin Dewey, Nicholas J Dickens, Mark Diekhans, Sheila Dodge, Inna Dubchak,

863 Diane M Dunn, Sean R Eddy, Laura Elnitski, Richard D Emes, Pallavi Eswara, Eduardo

864 Eyras, Adam Felsenfeld, Ginger A Fewell, Paul Flicek, Karen Foley, Wayne N Frankel, Lu-

865 cinda A Fulton, Robert S Fulton, Terrence S Furey, Diane Gage, Richard A Gibbs, Gustavo

866 Glusman, Sante Gnerre, Nick Goldman, Leo Goodstadt, Darren Grafham, Tina A Graves,

867 Eric D Green, Simon Gregory, Roderic Guig, Mark Guyer, Ross C Hardison, David Haussler,

868 18

869

870 Yoshihide Hayashizaki, LaDeana W Hillier, Angela Hinrichs, Wratko Hlavina, Timothy Holzer,

871 Fan Hsu, Axin Hua, Tim Hubbard, Adrienne Hunt, Ian Jackson, David B Jaffe, L Steven John-

872 son, Matthew Jones, Thomas A Jones, Ann Joy, Michael Kamal, Elinor K Karlsson, Donna

873 Karolchik, Arkadiusz Kasprzyk, Jun Kawai, Evan Keibler, Cristyn Kells, W James Kent, An-

874 drew Kirby, Diana L Kolbe, Ian Korf, Raju S Kucherlapati, Edward J Kulbokas, David Kulp,

875 Tom Landers, J P Leger, Steven Leonard, Ivica Letunic, Rosie Levine, Jia Li, Ming Li, Chris-

876 tine Lloyd, Susan Lucas, Bin Ma, Donna R Maglott, Elaine R Mardis, Lucy Matthews, Evan

877 Mauceli, John H Mayer, Megan McCarthy, W Richard McCombie, Stuart McLaren, Kirsten

878 McLay, John D McPherson, Jim Meldrim, Beverley Meredith, Jill P Mesirov, Webb Miller, Tra-

879 cie L Miner, Emmanuel Mongin, Kate T Montgomery, Michael Morgan, Richard Mott, James C

880 Mullikin, Donna M Muzny, William E Nash, Joanne O Nelson, Michael N Nhan, Robert Nicol,

881 Zemin Ning, Chad Nusbaum, Michael J O’Connor, Yasushi Okazaki, Karen Oliver, Emma

882 Overton-Larty, Lior Pachter, Gens Parra, Kymberlie H Pepin, Jane Peterson, Pavel Pevzner,

883 Robert Plumb, Craig S Pohl, Alex Poliakov, Tracy C Ponce, Chris P Ponting, Simon Potter,

884 Michael Quail, Alexandre Reymond, Bruce A Roe, Krishna M Roskin, Edward M Rubin, Alis-

885 tair G Rust, Ralph Santos, Victor Sapojnikov, Brian Schultz, Jrg Schultz, Matthias S Schwartz,

886 Scott Schwartz, Carol Scott, Steven Seaman, Steve Searle, Ted Sharpe, Andrew Sheridan,

887 Ratna Shownkeen, Sarah Sims, Jonathan B Singer, Guy Slater, Arian Smit, Douglas R Smith,

888 Brian Spencer, Arne Stabenau, Nicole Stange-Thomann, Charles Sugnet, Mikita Suyama,

889 Glenn Tesler, Johanna Thompson, David Torrents, Evanne Trevaskis, John Tromp, Cather-

890 ine Ucla, Abel Ureta-Vidal, Jade P Vinson, Andrew C Von Niederhausern, Claire M Wade,

891 Melanie Wall, Ryan J Weber, Robert B Weiss, Michael C Wendl, Anthony P West, Kris

892 Wetterstrand, Raymond Wheeler, Simon Whelan, Jamey Wierzbowski, David Willey, Sophie

893 Williams, Richard K Wilson, Eitan Winter, Kim C Worley, Dudley Wyman, Shan Yang, Shiaw-

894 Pyng Yang, Evgeny M Zdobnov, Michael C Zody, and Eric S Lander. Initial sequencing and

895 comparative analysis of the mouse genome. Nature, 420(6915):520–62, December 2002.

896 PMID: 12466850.

897 19

898

899