cg: 56a898ced81d grant.html

cg

view grant.html @ 14:56a898ced81d

author	bshanks@bshanks.dyndns.org
date	Sat Apr 11 21:43:12 2009 -0700 (16 years ago)
parents	3bc61ab8e776
children	395faa66383e

line source

1 Specific aims

2 todo test4

3 Massive new datasets obtained with techniques such as in situ hybridization

4 (ISH) and BAC-transgenics allow the expression levels of many genes at many

5 locations to be compared. Our goal is to develop automated methods to relate

6 spatial variation in gene expression to anatomy. We want to find marker genes

7 for specific anatomical regions, and also to draw new anatomical maps based on

8 gene expression patterns. We have three specific aims:

9 (1) develop an algorithm to screen spatial gene expression data for combina-

10 tions of marker genes which selectively target anatomical regions

11 (2) develop an algorithm to suggest new ways of carving up a structure into

12 anatomical subregions, based on spatial patterns in gene expression

13 (3) create a 2-D “flat map” dataset of the mouse cerebral cortex that contains

14 a flattened version of the Allen Mouse Brain Atlas ISH data, as well as

15 the boundaries of cortical anatomical areas. Use this dataset to validate

16 the methods developed in (1) and (2).

17 In addition to validating the usefulness of the algorithms, the application of

18 these methods to cerebral cortex will produce immediate benefits, because there

19 are currently no known genetic markers for many cortical areas. The results

20 of the project will support the development of new ways to selectively target

21 cortical areas, and it will support the development of a method for identifying

22 the cortical areal boundaries present in small tissue samples.

23 All algorithms that we develop will be implemented in an open-source soft-

24 ware toolkit. The toolkit, as well as the machine-readable datasets developed

25 in aim (3), will be published and freely available for others to use.

26 Background and significance

27 Aim 1

28 Machine learning terminology

29 The task of looking for marker genes for anatomical subregions means that one

30 is looking for a set of genes such that, if the expression level of those genes is

31 known, then the locations of the subregions can be inferred.

32 If we define the subregions so that they cover the entire anatomical structure

33 to be divided, then instead of saying that we are using gene expression to find

34 the locations of the subregions, we may say that we are using gene expression to

35 determine to which subregion each voxel within the structure belongs. We call

36 this a classification task, because each voxel is being assigned to a class (namely,

37 its subregion).

38 1

40 Therefore, an understanding of the relationship between the combination of

41 their expression levels and the locations of the subregions may be expressed as

42 a function. The input to this function is a voxel, along with the gene expression

43 levels within that voxel; the output is the subregional identity of the target

44 voxel, that is, the subregion to which the target voxel belongs. We call this

45 function a classifier. In general, the input to a classifier is called an instance,

46 and the output is called a label.

47 The object of aim 1 is not to produce a single classifier, but rather to develop

48 an automated method for determining a classifier for any known anatomical

49 structure. Therefore, we seek a procedure by which a gene expression dataset

50 may be analyzed in concert with an anatomical atlas in order to produce a

51 classifier. Such a procedure is a type of a machine learning procedure. The

52 construction of the classifier is called training (also learning), and the initial

53 gene expression dataset used in the construction of the classifier is called training

54 data.

55 In the machine learning literature, this sort of procedure may be thought

56 of as a supervised learning task, defined as a task in whcih the goal is to learn

57 a mapping from instances to labels, and the training data consists of a set of

58 instances (voxels) for which the labels (subregions) are known.

59 Each gene expression level is called a feature, and the selection of which

60 genes to include is called feature selection. Feature selection is one component

61 of the task of learning a classifier. Some methods for learning classifiers start

62 out with a separate feature selection phase, whereas other methods combine

63 feature selection with other aspects of training.

64 One class of feature selection methods assigns some sort of score to each

65 candidate gene. The top-ranked genes are then chosen. Some scoring measures

66 can assign a score to a set of selected genes, not just to a single gene; in this

67 case, a dynamic procedure may be used in which features are added and sub-

68 tracted from the selected set depending on how much they raise the score. Such

69 procedures are called “stepwise” or “greedy”.

70 Although the classifier itself may only look at the gene expression data within

71 each voxel before classifying that voxel, the learning algorithm which constructs

72 the classifier may look over the entire dataset. We can categorize score-based

73 feature selection methods depending on how the score of calculated. Often

74 the score calculation consists of assigning a sub-score to each voxel, and then

75 aggregating these sub-scores into a final score (the aggregation is often a sum or

76 a sum of squares). If only information from nearby voxels is used to calculate a

77 voxel’s sub-score, then we say it is a local scoring method. If only information

78 from the voxel itself is used to calculate a voxel’s sub-score, then we say it is a

79 pointwise scoring method.

80 Key questions when choosing a learning method are: What are the instances?

81 What are the features? How are the features chosen? Here are four principles

82 that outline our answers to these questions.

83 2

85 Principle 1: Combinatorial gene expression

86 Above, we defined an “instance” as the combination of a voxel with the “asso-

87 ciated gene expression data”. In our case this refers to the expression level of

88 genes within the voxel, but should we include the expression levels of all genes,

89 or only a few of them?

90 It is too much to hope that every anatomical region of interest will be iden-

91 tified by a single gene. For example, in the cortex, there are some areas which

92 are not clearly delineated by any gene included in the Allen Brain Atlas (ABA)

93 dataset. However, at least some of these areas can be delineated by looking

94 at combinations of genes (an example of an area for which multiple genes are

95 necessary and sufficient is provided in Preliminary Results).

96 Principle 2: Only look at combinations of small numbers of genes

97 When the classifier classifies a voxel, it is only allowed to look at the expression of

98 the genes which have been selected as features. The more data that is available

99 to a classifier, the better that it can do. For example, perhaps there are weak

100 correlations over many genes that add up to a strong signal. So, why not include

101 every gene as a feature? The reason is that we wish to employ the classifier in

102 situations in which it is not feasible to gather data about every gene. For

103 example, if we want to use the expression of marker genes as a trigger for some

104 regionally-targeted intervention, then our intervention must contain a molecular

105 mechanism to check the expression level of each marker gene before it triggers.

106 It is currently infeasible to design a molecular trigger that checks the level of

107 more than a handful of genes. Similarly, if the goal is to develop a procedure to

108 do ISH on tissue samples in order to label their anatomy, then it is infeasible

109 to label more than a few genes. Therefore, we must select only a few genes as

110 features.

111 Principle 3: Use geometry in feature selection

112 When doing feature selection with score-based methods, the simplest thing to do

113 would be to score the performance of each voxel by itself and then combine these

114 scores (pointwise scoring). A more powerful approach is to also use information

115 about the geometric relations between each voxel and its neighbors; this requires

116 non-pointwise, local scoring methods. See Preliminary Results for evidence of

117 the complementary nature of pointwise and local scoring methods.

118 Principle 4: Work in 2-D whenever possible

119 There are many anatomical structures which are commonly characterized in

120 terms of a two-dimensional manifold. When it is known that the structure that

121 one is looking for is two-dimensional, the results may be improved by allowing

122 the analysis algorithm to take advantage of this prior knowledge. In addition,

123 it is easier for humans to visualize and work with 2-D data.

124 Therefore, when possible, the instances should represent pixels, not voxels.

125 3

126

127 Aim 2

128 todo

129 Aim 3

130 Background

131 The cortex is divided into areas and layers. To a first approximation, the par-

132 cellation of the cortex into areas can be drawn as a 2-D map on the surface

133 of the cortex. In the third dimension, the boundaries between the areas con-

134 tinue downwards into the cortical depth, perpendicular to the surface. The layer

135 boundaries run parallel to the surface. One can picture an area of the cortex as

136 a slice of many-layered cake.

137 Although it is known that different cortical areas have distinct roles in both

138 normal functioning and in disease processes, there are no known marker genes

139 for many cortical areas. When it is necessary to divide a tissue sample into

140 cortical areas, this is a manual process that requires a skilled human to combine

141 multiple visual cues and interpret them in the context of their approximate

142 location upon the cortical surface.

143 Even the questions of how many areas should be recognized in cortex, and

144 what their arrangement is, are still not completely settled. A proposed division

145 of the cortex into areas is called a cortical map. In the rodent, the lack of a

146 single agreed-upon map can be seen by contrasting the recent maps given by

147 Swanson?? on the one hand, and Paxinos and Franklin?? on the other. While

148 the maps are certainly very similar in their general arrangement, significant

149 differences remain in the details.

150 Significance

151 The method developed in aim (1) will be applied to each cortical area to find

152 a set of marker genes such that the combinatorial expression pattern of those

153 genes uniquely picks out the target area. Finding marker genes will be useful

154 for drug discovery as well as for experimentation because marker genes can be

155 used to design interventions which selectively target individual cortical areas.

156 The application of the marker gene finding algorithm to the cortex will

157 also support the development of new neuroanatomical methods. In addition to

158 finding markers for each individual cortical areas, we will find a small panel

159 of genes that can find many of the areal boundaries at once. This panel of

160 marker genes will allow the development of an ISH protocol that will allow

161 experimenters to more easily identify which anatomical areas are present in

162 small samples of cortex.

163 The method developed in aim (3) will provide a genoarchitectonic viewpoint

164 that will contribute to the creation of a better map. The development of present-

165 day cortical maps was driven by the application of histological stains. It is

166 conceivable that if a different set of stains had been available which identified

167 a different set of features, then the today’s cortical maps would have come out

168 4

169

170 differently. Since the number of classes of stains is small compared to the number

171 of genes, it is likely that there are many repeated, salient spatial patterns in

172 the gene expression which have not yet been captured by any stain. Therefore,

173 current ideas about cortical anatomy need to incorporate what we can learn

174 from looking at the patterns of gene expression.

175 While we do not here propose to analyze human gene expression data, it is

176 conceivable that the methods we propose to develop could be used to suggest

177 modifications to the human cortical map as well.

178 Related work

179 todo

180 Preliminary work

181 Justification of principles 1 thur 3

182 Principle 1: Combinatorial gene expression

183 Here we give an example of a cortical area which is not marked by any single

184 gene, but which can be identified combinatorially. according to logistic regres-

185 sion, gene wwc11 is the best fit single gene for predicting whether or not a pixel

186 on the cortical surface belongs to the motor area (area MO). The upper-left

187 picture in Figure shows wwc1’s spatial expression pattern over the cortex. The

188 lower-right boundary of MO is represented reasonably well by this gene, however

189 the gene overshoots the upper-left boundary. This flattened 2-D representation

190 does not show it, but the area corresponding to the overshoot is the medial

191 surface of the cortex. MO is only found on the lateral surface (todo).

192 Gnee mtif22 is shown in figure the upper-right of Fig. . Mtif2 captures MO’s

193 upper-left boundary, but not its lower-right boundary. Mtif2 does not express

194 very much on the medial surface. By adding together the values at each pixel

195 in these two figures, we get the lower-left of Figure . This combination captures

196 area MO much better than any single gene.

197 Principle 2: Only look at combinations of small numbers of genes

198 In order to see how well one can do when looking at all genes at once, we ran

199 a support vector machine to classify cortical surface pixels based on their gene

200 expression profiles. We achieved classification accuracy of about 81%3. As noted

201 above, however, a classifier that looks at all the genes at once isn’t practically

202 useful.

203 _____________________

204 1“WW, C2 and coiled-coil domain containing 1”; EntrezGene ID 211652

205 2“mitochondrial translational initiation factor 2”; EntrezGene ID 76784

206 3Using the Shogun SVM package (todo:cite), with parameters type=GMNPSVM (multi-

207 class b-SVM), kernal = gaussian with sigma = 0.1, c = 10, epsilon = 1e-1 – these are the

208 first parameters we tried, so presumably performance would improve with different choices of

209 parameters. 5-fold cross-validation.

210 5

211

212

213

214 Figure 1: Upper left: wwc1. Upper right: mtif2. Lower left: wwc1 + mtif2

215 (each pixel’s value on the lower left is the sum of the corresponding pixels in

216 the upper row). Within each picture, the vertical axis roughly corresponds to

217 anterior at the top and posterior at the bottom, and the horizontal axis roughly

218 corresponds to medial at the left and lateral at the right. The red outline is

219 the boundary of region MO. Pixels are colored approximately according to the

220 density of expressing cells underneath each pixel, with red meaning a lot of

221 expression and blue meaning little.

222 6

223

224 The requirement to find combinations of only a small number of genes limits

225 us from straightforwardly applying many of the most simple techniques from

226 the field of supervised machine learning. In the parlance of machine learning,

227 our task combines feature selection with supervised learning.

228 Principle 3: Use geometry

229 To show that local geometry can provide useful information that cannot be

230 detected via pointwise analyses, consider Fig. . The top row of Fig. displays

231 the 3 genes which most match area AUD, according to a pointwise method4. The

232 bottom row displays the 3 genes which most match AUD according to a method

233 which considers local geometry5 The pointwise method in the top row identifies

234 genes which express more strongly in AUD than outside of it; its weakness is that

235 this includes many areas which don’t have a salient border matching the areal

236 border. The geometric method identifies genes whose salient expression border

237 seems to partially line up with the border of AUD; its weakness is that this

238 includes genes which don’t express over the entire area. Genes which have high

239 rankings using both pointwise and border criteria, such as Aph1a in the example,

240 may be particularly good markers. None of these genes are, individually, a

241 perfect marker for AUD; we deliberately chose a “difficult” area in order to

242 better contrast pointwise with geometric methods.

243 Principle 4: Work in 2-D whenever possible

244 In anatomy, the manifold of interest is usually either defined by a combination

245 of two relevant anatomical axes (todo), or by the surface of the structure (as is

246 the case with the cortex). In the former case, the manifold of interest is a plane,

247 but in the latter case it is curved. If the manifold is curved, there are various

248 methods for mapping the manifold into a plane.

249 The method that we will develop will begin by mapping the data into a

250 2-D plane. Although the manifold that characterized cortical areas is known

251 to be the cortical surface, it remains to be seen which method of mapping the

252 manifold into a plane is optimal for this application. We will compare mappings

253 which attempt to preserve size (such as the one used by Caret??) with mappings

254 which preserve angle (conformal maps).

255 Although there is much 2-D organization in anatomy, there are also struc-

256 tures whose shape is fundamentally 3-dimensional. If possible, we would like

257 the method we develop to include a statistical test that warns the user if the

258 assumption of 2-D structure seems to be wrong.

259 ——

260 ____________________

261 4For each gene, a logistic regression in which the response variable was whether or not a

262 surface pixel was within area AUD, and the predictor variable was the value of the expression

263 of the gene underneath that pixel. The resulting scores were used to rank the genes in terms

264 of how well they predict area AUD.

265 5For each gene the gradient similarity (see section ??) between (a) a map of the expression

266 of each gene on the cortical surface and (b) the shape of area AUD, was calculated, and this

267 was used to rank the genes.

268 7

269

270

271

272 Figure 2: The top row shows the three genes which (individually) best predict

273 area AUD, according to logistic regression. The bottom row shows the three

274 genes which (individually) best match area AUD, according to gradient similar-

275 ity. From left to right and top to bottom, the genes are Ssr1, Efcbp1, Aph1a,

276 Ptk7, Aph1a again, and Lepr

277 Massive new datasets obtained with techniques such as in situ hybridization

278 (ISH) and BAC-transgenics allow the expression levels of many genes at many

279 locations to be compared. This can be used to find marker genes for specific

280 anatomical structures, as well as to draw new anatomical maps. Our goal is

281 to develop automated methods to relate spatial variation in gene expression to

282 anatomy. We have five specific aims:

283 (1) develop an algorithm to screen spatial gene expression data for combi-

284 nations of marker genes which selectively target individual anatomical

285 structures

286 (2) develop an algorithm to screen spatial gene expression data for combina-

287 tions of marker genes which can be used to delineate most of the bound-

288 aries between a number of anatomical structures at once

289 (3) develop an algorithm to suggest new ways of dividing a structure up into

290 anatomical subregions, based on spatial patterns in gene expression

291 (4) create a flat (2-D) map of the mouse cerebral cortex that contains a flat-

292 tened version of the Allen Mouse Brain Atlas ISH dataset, as well as the

293 boundaries of anatomical areas within the cortex. For each cortical layer,

294 a layer-specific flat dataset will be created. A single combined flat dataset

295 will be created which averages information from all of the layers. These

296 datasets will be made available in both MATLAB and Caret formats.

297 (5) validate the methods developed in (1), (2) and (3) by applying them to

298 the cerebral cortex datasets created in (4)

299 8

300

301 All algorithms that we develop will be implemented in an open-source soft-

302 ware toolkit. The toolkit, as well as the machine-readable datasets developed in

303 aim (4) and any other intermediate dataset we produce, will be published and

304 freely available for others to use.

305 In addition to developing generally useful methods, the application of these

306 methods to cerebral cortex will produce immediate benefits that are only one

307 step removed from clinical application, while also supporting the development

308 of new neuroanatomical techniques. The method developed in aim (1) will be

309 applied to each cortical area to find a set of marker genes. Currently, despite

310 the distinct roles of different cortical areas in both normal functioning and

311 disease processes, there are no known marker genes for many cortical areas.

312 Finding marker genes will be immediately useful for drug discovery as well as for

313 experimentation because once marker genes for an area are known, interventions

314 can be designed which selectively target that area.

315 The method developed in aim (2) will be used to find a small panel of genes

316 that can find most of the boundaries between areas in the cortex. Today, finding

317 cortical areal boundaries in a tissue sample is a manual process that requires a

318 skilled human to combine multiple visual cues over a large area of the cortical

319 surface. A panel of marker genes will allow the development of an ISH protocol

320 that will allow experimenters to more easily identify which anatomical areas are

321 present in small samples of cortex.

322 For each cortical layer, a layer-specific flat dataset will be created. A single

323 combined flat dataset will be created which averages information from all of

324 the layers. These datasets will be made available in both MATLAB and Caret

325 formats.

326 ___________________________________________________________

327 New techniques allow the expression levels of many genes at many locations

328 to be compared. It is thought that even neighboring anatomical structures have

329 different gene expression profiles. We propose to develop automated methods

330 to relate the spatial variation in gene expression to anatomy. We will develop

331 two kinds of techniques:

332 (a) techniques to screen for combinations of marker genes which selectively

333 target anatomical structures

334 (b) techniques to suggest new ways of dividing a structure up into anatomical

335 subregions, based on the shapes of contours in the gene expression

336 The first kind of technique will be helpful for finding marker genes associated

337 with known anatomical features. The second kind of technique will be helpful in

338 creating new anatomical maps, maps which reflect differences in gene expression

339 the same way that existing maps reflect differences in histology.

340 We intend to develop our techniques using the adult mouse cerebral cortex

341 as a testbed. The Allen Brain Atlas has collected a dataset containing the

342 expression level of about 4000 genes* over a set of over 150000 voxels, with a

343 spatial resolution of approximately 200 microns[?].

344 9

345

346 We expect to discover sets of marker genes that pick out specific cortical

347 areas. This will allow the development of drugs and other interventions that

348 selectively target individual cortical areas. Therefore our research will lead

349 to application in drug discovery, in the development of other targeted clinical

350 interventions, and in the development of new experimental techniques.

351 The best way to divide up rodent cortex into areas has not been completely

352 determined, as can be seen by the differences in the recent maps given by Swan-

353 son on the one hand, and Paxinos and Franklin on the other. It is likely that our

354 study, by showing which areal divisions naturally follow from gene expression

355 data, as opposed to traditional histological data, will contribute to the creation

356 of a better map. While we do not here propose to analyze human gene expres-

357 sion data, it is conceivable that the methods we propose to develop could be

358 used to suggest modifications to the human cortical map as well.

359 In the following, we will only be talking about coronal data.

360 The Allen Brain Atlas provides “Smoothed Energy Volumes”, which are

361 One type of artifact in the Allen Brain Atlas data is what we call a “slice

362 artifact”. We have noticed two types of slice artifacts in the dataset. The first

363 type, a “missing slice artifact”, occurs when the ISH procedure on a slice did

364 not come out well. In this case, the Allen Brain investigators excluded the slice

365 at issue from the dataset. This means that no gene expression information is

366 available for that gene for the region of space covered by that slice. This results

367 in an expression level of zero being assigned to voxels covered by the slice. This

368 is partially but not completely ameliorated by the smoothing that is applied to

369 create the Smoothed Energy Volumes. The usual end result is that a region of

370 space which is shaped and oriented like a coronal slice is marked as having less

371 gene expression than surrounding regions.

372 The second type of slice artifact is caused by the fact that all of the slices

373 have a consistent orientation. Since there may be artifacts (such as how well

374 the ISH worked) which are constant within each slice but which vary between

375 different slices, the result is that ceteris paribus, when one compares the genetic

376 data of a voxel to another voxel within the same coronal plane, one would expect

377 to find more similarity than if one compared a voxel to another voxel displaced

378 along the rostrocaudal axis.

379 We are enthusiastic about the sharing of methods, data, and results, and

380 at the conclusion of the project, we will make all of our data and computer

381 source code publically available. Our goal is that replicating our results, or

382 applying the methods we develop to other targets, will be quick and easy for

383 other investigators. In order to aid in understanding and replicating our results,

384 we intend to include a software program which, when run, will take as input

385 the Allen Brain Atlas raw data, and produce as output all numbers and charts

386 found in publications resulting from the project.

387 To aid in the replication of our results, we will include a script which takes

388 as input the dataset in aim (3) and provides as output all of the tables in figures

389 in our publications .

390 We also expect to weigh in on the debate about how to best partition rodent

391 cortex

392 10

393

394 be useful for drug discovery as well

395 * Another 16000 genes are available, but they do not cover the entire cerebral

396 cortex with high spatial resolution.

397 User-definable ROIs Combinatorial gene expression Negative as well as pos-

398 itive signal Use geometry Search for local boundaries if necessary Flatmapped

399 Specific aims

400 Develop algorithms that find genetic markers for anatomical regions

401 1. Develop scoring measures for evaluating how good individual genes are at

402 marking areas: we will compare pointwise, geometric, and information-

403 theoretic measures.

404 2. Develop a procedure to find single marker genes for anatomical regions: for

405 each cortical area, by using or combining the scoring measures developed,

406 we will rank the genes by their ability to delineate each area.

407 3. Extend the procedure to handle difficult areas by using combinatorial cod-

408 ing: for areas that cannot be identified by any single gene, identify them

409 with a handful of genes. We will consider both (a) algorithms that incre-

410 mentally/greedily combine single gene markers into sets, such as forward

411 stepwise regression and decision trees, and also (b) supervised learning

412 techniques which use soft constraints to minimize the number of features,

413 such as sparse support vector machines.

414 4. Extend the procedure to handle difficult areas by combining or redrawing

415 the boundaries: An area may be difficult to identify because the bound-

416 aries are misdrawn, or because it does not “really” exist as a single area,

417 at least on the genetic level. We will develop extensions to our procedure

418 which (a) detect when a difficult area could be fit if its boundary were

419 redrawn slightly, and (b) detect when a difficult area could be combined

420 with adjacent areas to create a larger area which can be fit.

421 Apply these algorithms to the cortex

422 1. Create open source format conversion tools: we will create tools to bulk

423 download the ABA dataset and to convert between SEV, NIFTI and MAT-

424 LAB formats.

425 2. Flatmap the ABA cortex data: map the ABA data onto a plane and draw

426 the cortical area boundaries onto it.

427 3. Find layer boundaries: cluster similar voxels together in order to auto-

428 matically find the cortical layer boundaries.

429 4. Run the procedures that we developed on the cortex: we will present, for

430 each area, a short list of markers to identify that area; and we will also

431 11

432

433 present lists of “panels” of genes that can be used to delineate many areas

434 at once.

435 Develop algorithms to suggest a division of a structure into anatom-

436 ical parts

437 1. Explore dimensionality reduction algorithms applied to pixels: including

438 TODO

439 2. Explore dimensionality reduction algorithms applied to genes: including

440 TODO

441 3. Explore clustering algorithms applied to pixels: including TODO

442 4. Explore clustering algorithms applied to genes: including gene shaving,

443 TODO

444 5. Develop an algorithm to use dimensionality reduction and/or hierarchial

445 clustering to create anatomical maps

446 6. Run this algorithm on the cortex: present a hierarchial, genoarchitectonic

447 map of the cortex

448 gradient similarity is calculated as: ∑

449 pixels cos(abs(∠∇1 - ∠∇2)) ⋅|∇1|+|∇2|

450 2 ⋅

451 pixel_value1+pixel_value2

452 2

453 (todo) Technically, we say that an anatomical structure has a fundamen-

454 tally 2-D organization when there exists a commonly used, generic, anatomical

455 structure-preserving map from 3-D space to a 2-D manifold.

456 Related work:

457 The Allen Brain Institute has developed an interactive web interface called

458 AGEA which allows an investigator to (1) calculate lists of genes which are se-

459 lectively overexpressed in certain anatomical regions (ABA calls this the “Gene

460 Finder” function) (2) to visualize the correlation between the genetic profiles of

461 voxels in the dataset, and (3) to visualize a hierarchial clustering of voxels in

462 the dataset [?]. AGEA is an impressive and useful tool, however, it does not

463 solve the same problems that we propose to solve with this project.

464 First we describe AGEA’s “Gene Finder”, and then compare it to our pro-

465 posed method for finding marker genes. AGEA’s Gene Finder first asks the

466 investigator to select a single “seed voxel” of interest. It then uses a clustering

467 method, combined with built-in knowledge of major anatomical structures, to

468 select two sets of voxels; an “ROI” and a “comparator region”*. The seed voxel

469 is always contained within the ROI, and the ROI is always contained within the

470 comparator region. The comparator region is similar but not identical to the

471 set of voxels making up the major anatomical region containing the ROI. Gene

472 Finder then looks for genes which can distinguish the ROI from the comparator

473 region. Specifically, it finds genes for which the ratio (expression energy in the

474 ROI) / (expression energy in the comparator region) is high.

475 12

476

477 Informally, the Gene Finder first infers an ROI based on clustering the seed

478 voxel with other voxels. Then, the Gene Finder finds genes which overexpress

479 in the ROI as compared to other voxels in the major anatomical region.

480 There are three major differences between our approach and Gene Finder.

481 First, Gene Finder focuses on individual genes and individual ROIs in isola-

482 tion. This is great for regions which can be picked out from all other regions by a

483 single gene, but not all of them can (todo). There are at least two ways this can

484 miss out on useful genes. First, a gene might express in part of a region, but not

485 throughout the whole region, but there may be another gene which expresses

486 in the rest of the region*. Second, a gene might express in a region, but not in

487 any of its neighbors, but it might express also in other non-neighboring regions.

488 To take advantage of these types of genes, we propose to find combinations of

489 genes which, together, can identify the boundaries of all subregions within the

490 containing region.

491 Second, Gene Finder uses a pointwise metric, namely expression energy ratio,

492 to decide whether a gene is good for picking out a region. We have found better

493 results by using metrics which take into account not just single voxels, but also

494 the local geometry of neighboring voxels, such as the local gradient (todo). In

495 addition, we have found that often the absence of gene expression can be used

496 as a marker, which will not be caught by Gene Finder’s expression energy ratio

497 (todo).

498 Third, Gene Finder chooses the ROI based only on the seed voxel. This

499 often does not permit the user to query the ROI that they are interested in. For

500 example, in all of our tests of Gene Finder in cortex, the ROIs chosen tend to

501 be cortical layers, rather than cortical areas.

502 In summary, when Gene Finder picks the ROI that you want, and when this

503 ROI can be easily picked out from neighboring regions by single genes which

504 selectively overexpress in the ROI compared to the entire major anatomical re-

505 gion, Gene Finder will work. However, Gene Finder will not pick cortical areas

506 as ROIs, and even if it could, many cortical areas cannot be uniquely picked out

507 by the overexpression of any single gene. By contrast, we will target cortical

508 areas, we will explore a variety of metrics which can complement the shortcom-

509 ings of expression energy ratio, and we will use the combinatorial expression of

510 genes to pick out cortical areas even when no individual gene will do.

511 * The terms “ROI” and “comparator region” are our own; the ABI calls

512 them the “local region” and the “larger anatomical context”. The ABI uses the

513 term “specificity comparator” to mean the major anatomic region containing

514 the ROI, which is not exactly identical to the comparator region.

515 ** In this case, the union of the area of expression of the two genes would

516 suffice; one could also imagine that there could be situations in which the in-

517 tersection of multiple genes would be needed, or a combination of unions and

518 intersections.

519 Now we describe AGEA’s hierarchial clustering, and compare it to our pro-

520 posal. The goal of AGEA’s hierarchial clustering is to generate a binary tree of

521 clusters, where a cluster is a collection of voxels. AGEA begins by computing

522 the Pearson correlation between each pair of voxels. They then employ a recur-

523 13

524

525 sive divisive (top-down) hierarchial clustering procedure on the voxels, which

526 means that they start with all of the voxels, and then they divide them into clus-

527 ters, and then within each cluster, they divide that cluster into smaller clusters,

528 etc***. At each step, the collection of voxels is partitioned into two smaller

529 clusters in a way that maximizes the following quantity: average correlation

530 between all possible pairs of voxels containing one voxel from each cluster.

531 There are three major differences between our approach and AGEA’s hier-

532 archial clustering. First, AGEA’s clustering method separates cortical layers

533 before it separates cortical areas.

534 following procedure is used for the purpose of dividing a collection of voxels

535 into smaller clusters: partition the voxels into two sets, such that the following

536 quantity is maximized:

537 *** depending on which level of the tree is being created, the voxels are

538 subsampled in order to save time

539 does not allow the user to input anything other than a seed voxel; this means

540 that for each seed voxel, there is only one

541 The role of the “local region” is to serve as a region of interest for which

542 marker genes are desired; the role of the “larger anatomical context” is to be

543 the structure

544 There are two kinds of differences between AGEA and our project; differ-

545 ences that relate to the treatment of the cortex, and differences in the type of

546 generalizable methods being developed. As relates

547 indicate an ROI

548 explore simple correlation-based relationships between voxels, genes, and

549 clusters of voxels.

550 There have not yet been any studies which describe the results of applying

551 AGEA to the cerebral cortex; however, we suspect that the AGEA metrics are

552 not optimal for the task of relating genes to cortical areas. A voxel’s gene

553 expression profile depends upon both its cortical area and its cortical layer,

554 however, AGEA has no mechanism to distinguish these two. As a result, voxels

555 in the same layer but different areas are often clustered together by AGEA. As

556 part of the project, we will compare the performance of our techniques against

557 AGEA’s.

558 —

559 The Allen Brain Institute has developed interactive tools called AGEA which

560 allow an investigator to explore simple correlation-based relationships between

561 voxels, genes, and clusters of voxels. There have not yet been any studies

562 which describe the results of applying AGEA to the cerebral cortex; however,

563 we suspect that the AGEA metrics are not optimal for the task of relating

564 genes to cortical areas. A voxel’s gene expression profile depends upon both

565 its cortical area and its cortical layer, however, AGEA has no mechanism to

566 distinguish these two. As a result, voxels in the same layer but different areas

567 are often clustered together by AGEA. As part of the project, we will compare

568 the performance of our techniques against AGEA’s.

569 Another difference between our techniques and AGEA’s is that AGEA allows

570 the user to enter only a voxel location, and then to either explore the rest of

571 14

572

573 the brain’s relationship to that particular voxel, or explore a partitioning of

574 the brain based on pairwise voxel correlation. If the user is interested not in a

575 single voxel, but rather an entire anatomical structure, AGEA will only succeed

576 to the extent that the selected voxel is a typical representative of the structure.

577 As discussed in the previous paragraph, this poses problems for structures like

578 cortical areas, which (because of their division into cortical layers) do not have

579 a single “typical representative”.

580 By contrast, in our system, the user will start by selecting, not a single voxel,

581 but rather, an anatomical superstructure to be divided into pieces (for example,

582 the cerebral cortex). We expect that our methods will take into account not

583 just pairwise statistics between voxels, but also large-scale geometric features

584 (for example, the rapidity of change in gene expression as regional boundaries

585 are crossed) which optimize the discriminability of regions within the selected

586 superstructure.

587 —–

588 screen for combinations of marker genes which selectively target anatom-

589 ical structures pick delineate the boundaries between neighboring anatomical

590 structures. (b) techniques to screen for marker genes which pick out anatomical

591 structures of interest

592 , techniques which: (a) screen for marker genes , and (b) suggest new

593 anatomical maps based on

594 whose expression partitions the region of interest into its anatomical sub-

595 structures, and (b) use the natural contours of gene expression to suggest new

596 ways of dividing an organ into

597 The Allen Brain Atlas

598 –

599 to: brooksl@mail.nih.gov

600 Hi, I’m writing to confirm the applicability of a potential research project to

601 the challenge grant topic ”New computational and statistical methods for the

602 analysis of large data sets from next-generation sequencing technologies”.

603 We want to develop methods for the analysis of gene expression datasets that

604 can be used to uncover the relationships between gene expression and anatomical

605 regions. Specifically, we want to develop techniques to (a) given a set of known

606 anatomical areas, identify genetic markers for each of these areas, and (b) given

607 an anatomical structure whose substructure is unknown, suggest a map, that

608 is, a division of the space into anatomical sub-structures, that represents the

609 boundaries inherent in the gene expression data.

610 We propose to develop our techniques on the Allen Brain Atlas mouse brain

611 gene expression dataset by finding genetic markers for anatomical areas within

612 the cerebral cortex. The Allen Brain Atlas contains a registered 3-D map of

613 gene expression data with 200-micron voxel resolution which was created from

614 in situ hybridization data. The dataset contains about 4000 genes which are

615 available at this resolution across the entire cerebral cortex.

616 Despite the distinct roles of different cortical areas in both normal function-

617 ing and disease processes, there are no known marker genes for many cortical

618 areas. This project will be immediately useful for both drug discovery and clini-

619 15

620

621 cal research because once the markers are known, interventions can be designed

622 which selectively target specific cortical areas.

623 This techniques we develop will be useful because they will be applicable to

624 the analysis of other anatomical areas, both in terms of finding marker genes

625 for known areas, and in terms of suggesting new anatomical subdivisions that

626 are based upon the gene expression data.

627 _______________________________

628 It is likely that our study, by showing which areal divisions naturally fol-

629 low from gene expression data, as opposed to traditional histological data, will

630 contribute to the creation of

631 there are clear genetic or chemical markers known for only a few cortical

632 areas. This makes it difficult to target drugs to specific

633 As part of aims (1) and (5), we will discover sets of marker genes that pick

634 out specific cortical areas. This will allow the development of drugs and other

635 interventions that selectively target individual cortical areas. As part of aims

636 (2) and (5), we will also discover small panels of marker genes that can be used

637 to delineate most of the cortical areal map.

638 With aims (2) and (4), we

639 There are five principals

640 In addition to validating the usefulness of the algorithms, the application of

641 these methods to cerebral cortex will produce immediate benefits that are only

642 one step removed from clinical application.

643 todo: remember to check gensat, etc for validation (mention bias/variance)

644 Why it is useful to apply these methods to cortex

645 There is still room for debate as to exactly how the cortex should be parcellated

646 into areas.

647 The best way to divide up rodent cortex into areas has not been completely

648 determined,

649 not yet been accounted for in

650 that the expression of some genes will contain novel spatial patterns which

651 are not account

652 that a genoarchitectonic map

653 This principle is only applicable to aim 1 (marker genes). For aim 2 (partition

654 a structure in into anatomical subregions), we plan to work with many genes at

655 once.

656 tood: aim 2 b+s?

657 Principle 5: Interoperate with existing tools

658 In order for our software to be as useful as possible for our users, it will be

659 able to import and export data to standard formats so that users can use our

660 software in tandem with other software tools created by other teams. We will

661 support the following formats: NIFTI (Neuroimaging Informatics Technology

662 16

663

664 Initiative), SEV (Allen Brain Institute Smoothed Energy Volume), and MAT-

665 LAB. This ensures that our users will not have to exclusively rely on our tools

666 when analyzing data. For example, users will be able to use the data visualiza-

667 tion and analysis capabilities of MATLAB and Caret alongside our software.

668 To our knowledge, there is no currently available software to convert between

669 these formats, so we will also provide a format conversion tool. This may be

670 useful even for groups that don’t use any of our other software.

671 todo: is “marker gene” even a phrase that we should use at all?

672 note for aim 1 apps: combo of genes is for voxel, not within any single cell

673 , as when genetic markers allow the development of selective interventions;

674 the reason that one can be confident that the intervention is selective is that it

675 is only turned on when a certain combination of genes is turned on and off. The

676 result procedure is what assures us that when that combination is present, the

677 local tissue is probably part of a certain subregion.

678 The basic idea is that we want to find a procedure by

679 The task of finding genes that mark anatomical areas can be phrased in

680 terms of what the field of machine learning calls a “supervised learning” task.

681 The goal of this task is to learn a function (the “classifier”) which

682 If a person knows a combination of genes that mark an area, that implies

683 that the person can be told how strong those genes express in any voxel, and

684 the person can use this information to determine how

685 finding how to infer the areal identity of a voxel if given the gene expression

686 profile of that voxel.

687 For each voxel in the cortex, we want to start with data about the gene

688 expression

689 There are various ways to look for marker genes. We will define some terms,

690 and along the way we will describe a few design choices encountered in the

691 process of creating a marker gene finding method, and then we will present four

692 principles that describe which options we have chosen.

693 In developing a procedure for finding marker genes, we are developing a

694 procedure that takes a dataset of experimental observations and produces a

695 result. One can think of the result as merely a list of genes, but really the result

696 is an understanding of a predictive relationship between, on the one hand, the

697 expression levels of genes, and, on the other hand, anatomical subregions.

698 One way to more formally define this understanding is to look at it as a

699 procedure. In this view, the result of the learning procedure is itself a procedure.

700 The result procedure provides a way to use the gene expression profiles of voxels

701 in a tissue sample in order to determine where the subregions are.

702 This result procedure can be used directly, as when an experimenter has

703 a tissue sample and needs to know what subregions are present in it, and,

704 if multiple subregions are present, where they each are. Or it can be used

705 indirectly; imagine that the result procedure tells us that whenever a certain

706 combination of genes are expressed, the local tissue is probably part of a certain

707 subregion. This means that we can then confidentally develop an intervention

708 which is triggered only when that combination of genes are expressed; and to

709 17

710

711 the extent that the result procedure is reliable, we know that the intervention

712 will only be triggered in the target subregion.

713 We said that the result procedure provides “a way to use the gene expression

714 profiles of voxels in a tissue sample” in order to “determine where the subregions

715 are”.

716 Does the result procedure get as input all of the gene expression profiles

717 of each voxel in the entire tissue sample, and produce as output all of the

718 subregional boundaries all at once?

719 it is helpful for the classifier to look at the global “shape” of gene expression

720 patterns over the whole structure, rather than just nearby voxels.

721 there is some small bit of additional information that can be gleaned from

722 knowing the

723 Design choices for a supervised learning procedure

724 After all,

725 there is a small correlation between the gene expression levels from distant

726 voxels and

727 Depending on how we intend to use the classifier, we may want to design it

728 so that

729 It is possible for many things to

730 The choice of which data is made part of an instance

731 what we seek is a procedure

732 partition the tissue sample into subregions.

733 each part of the anatomical structure

734 must be One way to rephrase this task is to say that, instead of searching

735 for the location of the subregions, we are looking to partition the tissue sample

736 into subregions.

737 There are various ways to look for marker genes. We will define some terms,

738 and along the way we will describe a few design choices encountered in the

739 process of creating a marker gene finding method, and then we will present four

740 principles that describe which options we have chosen.

741 In developing a procedure for finding marker genes, we are developing a

742 procedure that takes a dataset of experimental observations and produces a

743 result. One can think of the result as merely a list of genes, but really the result

744 is an understanding of a predictive relationship between, on the one hand, the

745 expression levels of genes, and, on the other hand, anatomical subregions.

746 One way to more formally define this understanding is to look at it as a

747 procedure. In this view, the result of the learning procedure is itself a procedure.

748 The result procedure provides a way to use the gene expression profiles of voxels

749 in a tissue sample in order to determine where the subregions are.

750 This result procedure can be used directly, as when an experimenter has

751 a tissue sample and needs to know what subregions are present in it, and,

752 if multiple subregions are present, where they each are. Or it can be used

753 indirectly; imagine that the result procedure tells us that whenever a certain

754 combination of genes are expressed, the local tissue is probably part of a certain

755 18

756

757 subregion. This means that we can then confidentally develop an intervention

758 which is triggered only when that combination of genes are expressed; and to

759 the extent that the result procedure is reliable, we know that the intervention

760 will only be triggered in the target subregion.

761 We said that the result procedure provides “a way to use the gene expression

762 profiles of voxels in a tissue sample” in order to “determine where the subregions

763 are”.

764 Does the result procedure get as input all of the gene expression profiles

765 of each voxel in the entire tissue sample, and produce as output all of the

766 subregional boundaries all at once?

767 Or are we given one voxel at a time,

768 In the jargon of the field of machine learning, the result procedure is called

769 a classifier.

770 The task of finding genes that mark anatomical areas can be phrased in

771 terms of what the field of machine learning calls a “supervised learning” task.

772 The goal of this task is to learn a function (the “classifier”) which

773 If a person knows a combination of genes that mark an area, that implies

774 that the person can be told how strong those genes express in any voxel, and

775 the person can use this information to determine how

776 finding how to infer the areal identity of a voxel if given the gene expression

777 profile of that voxel.

778 For each voxel in the cortex, we want to start with data about the gene

779 expression

780 single voxels, but rather groups of voxels, such that the groups can be placed

781 in some 2-D space. We will call such instances “pixels”.

782 We have been speaking as if instances necessarily correspond to single voxels.

783 But it is possible for instances to be groupings of many voxels, in which case

784 each grouping must be assigned the same label (that is, each voxel grouping

785 must stay inside a single anatomical subregion).

786 In some but not all cases, the groups are either rows or columns of voxels.

787 This is the case with the cerebral cortex, in which one may assume that columns

788 of voxels which run perpendicular to the cortical surface all share the same areal

789 identity. In the cortex, we call such an instance a “surface pixel”, because such

790 an instance represents the data associated with all voxels underneath a specific

791 patch of the cortical surface.

792 19

793

794