cg
changeset 106:ffa1390e4f39
.
author | bshanks@bshanks.dyndns.org |
---|---|
date | Wed Apr 22 14:51:24 2009 -0700 (16 years ago) |
parents | 6c48f37d0f0c |
children | f26370dc719b |
files | grant.html grant.odt grant.pdf grant.txt |
line diff
1.1 --- a/grant.html Wed Apr 22 07:39:32 2009 -0700
1.2 +++ b/grant.html Wed Apr 22 14:51:24 2009 -0700
1.3 @@ -1,9 +1,13 @@
1.4 Specific aims
1.5 -Massive new datasets obtained with techniques such as in situ hybridization (ISH), immunohistochemistry, in
1.6 -situ transgenic reporter, microarray voxelation, and others, allow the expression levels of many genes at many
1.7 -locations to be compared. Our goal is to develop automated methods to relate spatial variation in gene expres-
1.8 -sion to anatomy. We want to find marker genes for specific anatomical regions, and also to draw new anatomical
1.9 -maps based on gene expression patterns. We have three specific aims:
1.10 +Massive new datasets obtained with techniques such as in situ hybridization (ISH), immunohistochemistry, in situ
1.11 +transgenic reporter, microarray voxelation, and others, allow the expression levels of many genes at many loca-
1.12 +tions to be compared. Our goal is to develop automated methods to relate spatial variation in gene expression to
1.13 +anatomy. We want to find marker genes for specific anatomical regions, and also to draw new anatomical maps
1.14 +based on gene expression patterns. We will validate these methods by applying them to 46 anatomical areas
1.15 +within the cerebral cortex, by using the Allen Mouse Brain Atlas coronal dataset (ABA). This gene expression
1.16 +dataset was generated using ISH, and contains over 4,000 genes. For each gene, a digitized 3-D raster of the
1.17 +expression pattern is available: for each gene, the level of expression at each of 51,533 voxels is recorded.
1.18 +We have three specific aims:
1.19 (1) develop an algorithm to screen spatial gene expression data for combinations of marker genes which
1.20 selectively target anatomical regions
1.21 (2) develop an algorithm to suggest new ways of carving up a structure into anatomically distinct regions,
1.22 @@ -126,23 +130,23 @@
1.23 as gradient similarity, which is discussed in Preliminary Studies) may be necessary in order to achieve the best
1.24 results in this application.
1.25 We now turn to efforts to find marker genes using spatial gene expression data using automated methods.
1.26 -GeneAtlas[5] and EMAGE [26] allow the user to construct a search query by demarcating regions and then
1.27 +GeneAtlas[3] and EMAGE [19] allow the user to construct a search query by demarcating regions and then
1.28 specifying either the strength of expression or the name of another gene or dataset whose expression pattern
1.29 is to be matched. Neither GeneAtlas nor EMAGE allow one to search for combinations of genes that define a
1.30 region in concert but not separately.
1.31 -[15 ] describes AGEA, ”Anatomic Gene Expression Atlas”. AGEA has three components. Gene Finder: The
1.32 +[12 ] describes AGEA, ”Anatomic Gene Expression Atlas”. AGEA has three components. Gene Finder: The
1.33 user selects a seed voxel and the system (1) chooses a cluster which includes the seed voxel, (2) yields a list of
1.34 genes which are overexpressed in that cluster. Correlation: The user selects a seed voxel and the system then
1.35 shows the user how much correlation there is between the gene expression profile of the seed voxel and every
1.36 -other voxel. Clusters: will be described later. [6] looks at the mean expression level of genes within anatomical
1.37 +other voxel. Clusters: will be described later. [4] looks at the mean expression level of genes within anatomical
1.38 regions, and applies a Student’s t-test with Bonferroni correction to determine whether the mean expression
1.39 -level of a gene is significantly higher in the target region. [15] and [6] differ from our Aim 1 in at least three
1.40 -ways. First, [15] and [6] find only single genes, whereas we will also look for combinations of genes. Second,
1.41 -[15 ] and [6] can only use overexpression as a marker, whereas we will also search for underexpression. Third,
1.42 -[15 ] and [6] use scores based on pointwise expression levels, whereas we will also use geometric scores such
1.43 +level of a gene is significantly higher in the target region. [12] and [4] differ from our Aim 1 in at least three
1.44 +ways. First, [12] and [4] find only single genes, whereas we will also look for combinations of genes. Second,
1.45 +[12 ] and [4] can only use overexpression as a marker, whereas we will also search for underexpression. Third,
1.46 +[12 ] and [4] use scores based on pointwise expression levels, whereas we will also use geometric scores such
1.47 as gradient similarity (described in Preliminary Studies). Figures 4, 2, and 3 in the Preliminary Studies section
1.48 contain evidence that each of our three choices is the right one.
1.49 -[10 ] describes a technique to find combinations of marker genes to pick out an anatomical region. They use
1.50 +[8 ] describes a technique to find combinations of marker genes to pick out an anatomical region. They use
1.51 an evolutionary algorithm to evolve logical operators which combine boolean (thresholded) images in order to
1.52 match a target image.
1.53 In summary, there has been fruitful work on finding marker genes, but only one of the previous projects
1.54 @@ -199,20 +203,20 @@
1.55 gene clusters in this fashion.
1.56 Related work
1.57 Some researchers have attempted to parcellate cortex on the basis of non-gene expression data. For example,
1.58 -[18 ], [2 ], [19], and [1] associate spots on the cortex with the radial profile5 of response to some stain ([12] uses
1.59 +[15 ], [2 ], [16], and [1] associate spots on the cortex with the radial profile5 of response to some stain ([10] uses
1.60 MRI), extract features from this profile, and then use similarity between surface pixels to cluster.
1.61 -[23 ] describes an analysis of the anatomy of the hippocampus using the ABA dataset. In addition to manual
1.62 +[18 ] describes an analysis of the anatomy of the hippocampus using the ABA dataset. In addition to manual
1.63 analysis, two clustering methods were employed, a modified Non-negative Matrix Factorization (NNMF), and
1.64 a hierarchical recursive bifurcation clustering scheme based on correlation as the similarity score. The paper
1.65 yielded impressive results, proving the usefulness of computational genomic anatomy. We have run NNMF on
1.66 the cortical dataset
1.67 -AGEA[15] includes a preset hierarchical clustering of voxels based on a recursive bifurcation algorithm with
1.68 -correlation as the similarity metric. EMAGE[26] allows the user to select a dataset from among a large number
1.69 +AGEA[12] includes a preset hierarchical clustering of voxels based on a recursive bifurcation algorithm with
1.70 +correlation as the similarity metric. EMAGE[19] allows the user to select a dataset from among a large number
1.71 of alternatives, or by running a search query, and then to cluster the genes within that dataset. EMAGE clusters
1.72 via hierarchical complete linkage clustering.
1.73 -[6 ] clusters genes. For each cluster, prototypical spatial expression patterns were created by averaging the
1.74 +[4 ] clusters genes. For each cluster, prototypical spatial expression patterns were created by averaging the
1.75 genes in the cluster. The prototypes were analyzed manually, without clustering voxels.
1.76 -[10 ] applies their technique for finding combinations of marker genes for the purpose of clustering genes
1.77 +[8 ] applies their technique for finding combinations of marker genes for the purpose of clustering genes
1.78 around a “seed gene”.
1.79 In summary, although these projects obtained clusterings, there has not been much comparison between
1.80 different algorithms or scoring methods, so it is likely that the best clustering method for this application has not
1.81 @@ -262,36 +266,36 @@
1.82 cortex, and what their arrangement is, are still not completely settled.
1.83 A proposed division of the cortex into areas is called a cortical map.
1.84 In the rodent, the lack of a single agreed-upon map can be seen by
1.85 - contrasting the recent maps given by Swanson[22] on the one hand,
1.86 - and Paxinos and Franklin[17] on the other. While the maps are cer-
1.87 + contrasting the recent maps given by Swanson[17] on the one hand,
1.88 + and Paxinos and Franklin[14] on the other. While the maps are cer-
1.89 tainly very similar in their general arrangement, significant differences
1.90 remain.
1.91 The Allen Mouse Brain Atlas dataset
1.92 - The Allen Mouse Brain Atlas (ABA) data were produced by doing in-
1.93 - situ hybridization on slices of male, 56-day-old C57BL/6J mouse brains.
1.94 - Pictures were taken of the processed slice, and these pictures were
1.95 - semi-automatically analyzed to create a digital measurement of gene
1.96 - expression levels at each location in each slice. Per slice, cellular spa-
1.97 - tial resolution is achieved. Using this method, a single physical slice
1.98 + The Allen Mouse Brain Atlas (ABA) data[11] were produced by do-
1.99 + ing in-situ hybridization on slices of male, 56-day-old C57BL/6J mouse
1.100 + brains. Pictures were taken of the processed slice, and these pictures
1.101 + were semi-automatically analyzed to create a digital measurement of
1.102 + gene expression levels at each location in each slice. Per slice, cellular
1.103 + spatial resolution is achieved. Using this method, a single physical slice
1.104 can only be used to measure one single gene; many different mouse brains were needed in order to measure
1.105 the expression of many genes.
1.106 -An automated nonlinear alignment procedure located the 2D data from the various slices in a single 3D
1.107 -coordinate system. In the final 3D coordinate system, voxels are cubes with 200 microns on a side. There are
1.108 -67x41x58 = 159,326 voxels in the 3D coordinate system, of which 51,533 are in the brain[15].
1.109 -Mus musculus is thought to contain about 22,000 protein-coding genes[28]. The ABA contains data on about
1.110 -20,000 genes in sagittal sections, out of which over 4,000 genes are also measured in coronal sections. Our
1.111 -dataset is derived from only the coronal subset of the ABA7.
1.112 +Mus musculus is thought to contain about 22,000 protein-coding genes[20]. The ABA contains data on
1.113 +about 20,000 genes in sagittal sections, out of which over 4,000 genes are also measured in coronal sections.
1.114 +Our dataset is derived from only the coronal subset of the ABA7. An automated nonlinear alignment procedure
1.115 +located the 2D data from the various slices in a single 3D coordinate system. In the final 3D coordinate system,
1.116 +voxels are cubes with 200 microns on a side. There are 67x41x58 = 159,326 voxels, of which 51,533 are in the
1.117 +brain[12]. For each voxel and each gene, the expression energy[11] within that voxel is made available.
1.118 The ABA is not the only large public spatial gene expression dataset. However, with the exception of the ABA,
1.119 GenePaint, and EMAGE, most of the other resources have not (yet) extracted the expression intensity from the
1.120 ISH images and registered the results into a single 3-D space.
1.121 Related work
1.122 -[15 ] describes the application of AGEA to the cortex. The paper describes interesting results on the structure
1.123 +[12 ] describes the application of AGEA to the cortex. The paper describes interesting results on the structure
1.124 of correlations between voxel gene expression profiles within a handful of cortical areas. However, this sort
1.125 _________________________________________
1.126 6Outside of isocortex, the number of layers varies.
1.127 - 7The sagittal data do not cover the entire cortex, and also have greater registration error[15]. Genes were selected by the Allen
1.128 + 7The sagittal data do not cover the entire cortex, and also have greater registration error[12]. Genes were selected by the Allen
1.129 Institute for coronal sectioning based on, “classes of known neuroscientific interest... or through post hoc identification of a marked
1.130 -non-ubiquitous expression pattern”[15].
1.131 +non-ubiquitous expression pattern”[12].
1.132 of analysis is not related to either of our aims, as it neither finds marker genes, nor does it suggest a cortical
1.133 map based on gene expression data. Neither of the other components of AGEA can be applied to cortical
1.134 areas; AGEA’s Gene Finder cannot be used to find marker genes for the cortical areas; and AGEA’s hierarchical
1.135 @@ -335,11 +339,11 @@
1.136 file formats.
1.137 Flatmap of cortex
1.138 We downloaded the ABA data and applied a mask to select only those voxels which belong to cerebral cortex.
1.139 -We divided the cortex into hemispheres. Using Caret[7], we created a mesh representation of the surface of the
1.140 +We divided the cortex into hemispheres. Using Caret[5], we created a mesh representation of the surface of the
1.141 selected voxels. For each gene, and for each node of the mesh, we calculated an average of the gene expression
1.142 of the voxels “underneath” that mesh node. We then flattened the cortex, creating a two-dimensional mesh. We
1.143 sampled the nodes of the irregular, flat mesh in order to create a regular grid of pixel values. We converted this
1.144 -grid into a MATLAB matrix. We manually traced the boundaries of each of 49 cortical areas from the ABA coronal
1.145 +grid into a MATLAB matrix. We manually traced the boundaries of each of 46 cortical areas from the ABA coronal
1.146 reference atlas slides. We then converted these manual traces into Caret-format regional boundary data on the
1.147 8In both cases, the cause is that pairwise correlations between the gene expression of voxels in different areas but the same layer
1.148 are often stronger than pairwise correlations between the gene expression of voxels in different layers but the same area. Therefore, a
1.149 @@ -517,7 +521,7 @@
1.150 Flatmap cortex and segment cortical layers
1.151 There are multiple ways to flatten 3-D data into 2-D. We will compare
1.152 mappings from manifolds to planes which attempt to preserve size
1.153 - (such as the one used by Caret[7]) with mappings which preserve an-
1.154 + (such as the one used by Caret[5]) with mappings which preserve an-
1.155 gle (conformal maps). Our method will include a statistical test that
1.156 warns the user if the assumption of 2-D structure seems to be wrong.
1.157 We have not yet made use of radial profiles. While the radial pro-
1.158 @@ -608,7 +612,7 @@
1.159 Classifiers We will explore and compare different classifiers. As noted above, this activity is not separate
1.160 from the previous one, because some supervised learning algorithms include feature selection, and any clas-
1.161 sifier can be combined with a stepwise wrapper for use as a feature selection method. We will explore logistic
1.162 -regression (including spatial models[16]), decision trees12, sparse SVMs, generative mixture models (including
1.163 +regression (including spatial models[13]), decision trees12, sparse SVMs, generative mixture models (including
1.164 naive bayes), kernel density estimation, instance-based learning methods (such as k-nearest neighbor), genetic
1.165 algorithms, and artificial neural networks.
1.166 Develop algorithms to suggest a division of a structure into anatomical parts
1.167 @@ -634,7 +638,7 @@
1.168 profiles, the same techniques can be applied instead to the pixels. It is possible that the features generated in
1.169 this way by some dimensionality reduction techniques will directly correspond to interesting spatial regions.
1.170 Clustering and segmentation on pixels We will explore clustering and segmentation algorithms in order to
1.171 -segment the pixels into regions. We will explore k-means, spectral clustering, gene shaving[9], recursive division
1.172 +segment the pixels into regions. We will explore k-means, spectral clustering, gene shaving[7], recursive division
1.173 clustering, multivariate generalizations of edge detectors, multivariate generalizations of watershed transforma-
1.174 tions, region growing, active contours, graph partitioning methods, and recursive agglomerative clustering with
1.175 various linkage functions. These methods can be combined with dimensionality reduction.
1.176 @@ -648,7 +652,7 @@
1.177 reduction step) in order to identify spatial regions. It remains to be seen whether removal of redundancy would
1.178 help or hurt the ultimate goal of identifying interesting spatial regions.
1.179 Co-clustering There are some algorithms which simultaneously incorporate clustering on instances and on
1.180 -features (in our case, genes and pixels), for example, IRM[11]. These are called co-clustering or biclustering
1.181 +features (in our case, genes and pixels), for example, IRM[9]. These are called co-clustering or biclustering
1.182 _________________________________________
1.183 12Actually, we have already begun to explore decision trees. For each cortical area, we have used the C4.5 algorithm to find a decision
1.184 tree for that area. We achieved good classification accuracy on our training set, but the number of genes that appeared in each tree was
1.185 @@ -673,7 +677,7 @@
1.186 combination of genes to seem to identify an area when in fact it is only coincidence. There are two ways we will
1.187 validate our marker genes to guard against this. First, we will confirm that putative combinations of marker genes
1.188 express the same pattern in both hemispheres. Second, we will manually validate our final results on other gene
1.189 -expression datasets such as EMAGE, GeneAtlas, and GENSAT[8].
1.190 +expression datasets such as EMAGE, GeneAtlas, and GENSAT[6].
1.191 Using the methods developed in Aim 2, we will present one or more hierarchical cortical maps. We will identify
1.192 and explain how the statistical structure in the gene expression data led to any unexpected or interesting features
1.193 of these maps, and we will provide biological hypotheses to interpret any new cortical areas, or groupings of
1.194 @@ -708,81 +712,72 @@
1.195 Science, pages 294–301. Springer Berlin / Heidelberg, 2005.
1.196 [2]J. Annese, A. Pitiot, I. D. Dinov, and A. W. Toga. A myelo-architectonic method for the structural classification
1.197 of cortical areas. NeuroImage, 21(1):15–26, 2004.
1.198 -[3]Tanya Barrett, Dennis B. Troup, Stephen E. Wilhite, Pierre Ledoux, Dmitry Rudnev, Carlos Evangelista,
1.199 -Irene F. Kim, Alexandra Soboleva, Maxim Tomashevsky, and Ron Edgar. NCBI GEO: mining tens of millions
1.200 -of expression profiles–database and tools update. Nucl. Acids Res., 35(suppl_1):D760–765, 2007.
1.201 -[4]George W. Bell, Tatiana A. Yatskievych, and Parker B. Antin. GEISHA, a whole-mount in situ hybridization
1.202 -gene expression screen in chicken embryos. Developmental Dynamics, 229(3):677–687, 2004.
1.203 -[5]James P Carson, Tao Ju, Hui-Chen Lu, Christina Thaller, Mei Xu, Sarah L Pallas, Michael C Crair, Joe
1.204 +[3]James P Carson, Tao Ju, Hui-Chen Lu, Christina Thaller, Mei Xu, Sarah L Pallas, Michael C Crair, Joe
1.205 Warren, Wah Chiu, and Gregor Eichele. A digital atlas to characterize the mouse brain transcriptome.
1.206 PLoS Comput Biol, 1(4):e41, 2005.
1.207 -[6]Mark H. Chin, Alex B. Geng, Arshad H. Khan, Wei-Jun Qian, Vladislav A. Petyuk, Jyl Boline, Shawn Levy,
1.208 +[4]Mark H. Chin, Alex B. Geng, Arshad H. Khan, Wei-Jun Qian, Vladislav A. Petyuk, Jyl Boline, Shawn Levy,
1.209 Arthur W. Toga, Richard D. Smith, Richard M. Leahy, and Desmond J. Smith. A genome-scale map of
1.210 expression for a mouse brain section obtained using voxelation. Physiol. Genomics, 30(3):313–321, August
1.211 2007.
1.212 -[7]D C Van Essen, H A Drury, J Dickson, J Harwell, D Hanlon, and C H Anderson. An integrated software suite
1.213 +[5]D C Van Essen, H A Drury, J Dickson, J Harwell, D Hanlon, and C H Anderson. An integrated software suite
1.214 for surface-based analyses of cerebral cortex. Journal of the American Medical Informatics Association:
1.215 JAMIA, 8(5):443–59, 2001. PMID: 11522765.
1.216 -[8]Shiaoching Gong, Chen Zheng, Martin L. Doughty, Kasia Losos, Nicholas Didkovsky, Uta B. Scham-
1.217 +[6]Shiaoching Gong, Chen Zheng, Martin L. Doughty, Kasia Losos, Nicholas Didkovsky, Uta B. Scham-
1.218 bra, Norma J. Nowak, Alexandra Joyner, Gabrielle Leblanc, Mary E. Hatten, and Nathaniel Heintz. A
1.219 gene expression atlas of the central nervous system based on bacterial artificial chromosomes. Nature,
1.220 425(6961):917–925, October 2003.
1.221 -[9]Trevor Hastie, Robert Tibshirani, Michael Eisen, Ash Alizadeh, Ronald Levy, Louis Staudt, Wing Chan,
1.222 +[7]Trevor Hastie, Robert Tibshirani, Michael Eisen, Ash Alizadeh, Ronald Levy, Louis Staudt, Wing Chan,
1.223 David Botstein, and Patrick Brown. ’Gene shaving’ as a method for identifying distinct sets of genes with
1.224 similar expression patterns. Genome Biology, 1(2):research0003.1–research0003.21, 2000.
1.225 -[10]Jano Hemert and Richard Baldock. Matching Spatial Regions with Combinations of Interacting Gene Ex-
1.226 +[8]Jano Hemert and Richard Baldock. Matching Spatial Regions with Combinations of Interacting Gene Ex-
1.227 pression Patterns, volume 13 of Communications in Computer and Information Science, pages 347–361.
1.228 Springer Berlin Heidelberg, 2008.
1.229 -[11]C Kemp, JB Tenenbaum, TL Griffiths, T Yamada, and N Ueda. Learning systems of concepts with an infinite
1.230 +[9]C Kemp, JB Tenenbaum, TL Griffiths, T Yamada, and N Ueda. Learning systems of concepts with an infinite
1.231 relational model. In AAAI, 2006.
1.232 -[12]F. Kruggel, M. K. Brckner, Th. Arendt, C. J. Wiggins, and D. Y. von Cramon. Analyzing the neocortical
1.233 +[10]F. Kruggel, M. K. Brckner, Th. Arendt, C. J. Wiggins, and D. Y. von Cramon. Analyzing the neocortical
1.234 fine-structure. Medical Image Analysis, 7(3):251–264, September 2003.
1.235 -[13]Erh-Fang Lee, Jyl Boline, and Arthur W. Toga. A High-Resolution anatomical framework of the neonatal
1.236 -mouse brain for managing gene expression data. Frontiers in Neuroinformatics, 1:6, 2007. PMC2525996.
1.237 -[14]Susan Magdaleno, Patricia Jensen, Craig L. Brumwell, Anna Seal, Karen Lehman, Andrew Asbury, Tony
1.238 -Cheung, Tommie Cornelius, Diana M. Batten, Christopher Eden, Shannon M. Norland, Dennis S. Rice,
1.239 -Nilesh Dosooye, Sundeep Shakya, Perdeep Mehta, and Tom Curran. BGEM: an in situ hybridization
1.240 -database of gene expression in the embryonic and adult mouse nervous system. PLoS Biology, 4(4):e86
1.241 -EP –, April 2006.
1.242 -[15]Lydia Ng, Amy Bernard, Chris Lau, Caroline C Overly, Hong-Wei Dong, Chihchau Kuan, Sayan Pathak, Su-
1.243 +[11]Ed S. Lein, Michael J. Hawrylycz, Nancy Ao, Mikael Ayres, Amy Bensinger, Amy Bernard, Andrew F. Boe,
1.244 +Mark S. Boguski, Kevin S. Brockway, Emi J. Byrnes, Lin Chen, Li Chen, Tsuey-Ming Chen, Mei Chi Chin,
1.245 +Jimmy Chong, Brian E. Crook, Aneta Czaplinska, Chinh N. Dang, Suvro Datta, Nick R. Dee, Aimee L.
1.246 +Desaki, Tsega Desta, Ellen Diep, Tim A. Dolbeare, Matthew J. Donelan, Hong-Wei Dong, Jennifer G.
1.247 +Dougherty, Ben J. Duncan, Amanda J. Ebbert, Gregor Eichele, Lili K. Estin, Casey Faber, Benjamin A.
1.248 +Facer, Rick Fields, Shanna R. Fischer, Tim P. Fliss, Cliff Frensley, Sabrina N. Gates, Katie J. Glattfelder,
1.249 +Kevin R. Halverson, Matthew R. Hart, John G. Hohmann, Maureen P. Howell, Darren P. Jeung, Rebecca A.
1.250 +Johnson, Patrick T. Karr, Reena Kawal, Jolene M. Kidney, Rachel H. Knapik, Chihchau L. Kuan, James H.
1.251 +Lake, Annabel R. Laramee, Kirk D. Larsen, Christopher Lau, Tracy A. Lemon, Agnes J. Liang, Ying Liu,
1.252 +Lon T. Luong, Jesse Michaels, Judith J. Morgan, Rebecca J. Morgan, Marty T. Mortrud, Nerick F. Mosqueda,
1.253 +Lydia L. Ng, Randy Ng, Geralyn J. Orta, Caroline C. Overly, Tu H. Pak, Sheana E. Parry, Sayan D. Pathak,
1.254 +Owen C. Pearson, Ralph B. Puchalski, Zackery L. Riley, Hannah R. Rockett, Stephen A. Rowland, Joshua J.
1.255 +Royall, Marcos J. Ruiz, Nadia R. Sarno, Katherine Schaffnit, Nadiya V. Shapovalova, Taz Sivisay, Clif-
1.256 +ford R. Slaughterbeck, Simon C. Smith, Kimberly A. Smith, Bryan I. Smith, Andy J. Sodt, Nick N. Stewart,
1.257 +Kenda-Ruth Stumpf, Susan M. Sunkin, Madhavi Sutram, Angelene Tam, Carey D. Teemer, Christina Thaller,
1.258 +Carol L. Thompson, Lee R. Varnam, Axel Visel, Ray M. Whitlock, Paul E. Wohnoutka, Crissa K. Wolkey,
1.259 +Victoria Y. Wong, Matthew Wood, Murat B. Yaylaoglu, Rob C. Young, Brian L. Youngstrom, Xu Feng Yuan,
1.260 +Bin Zhang, Theresa A. Zwingman, and Allan R. Jones. Genome-wide atlas of gene expression in the adult
1.261 +mouse brain. Nature, 445(7124):168–176, 2007.
1.262 +[12]Lydia Ng, Amy Bernard, Chris Lau, Caroline C Overly, Hong-Wei Dong, Chihchau Kuan, Sayan Pathak, Su-
1.263 san M Sunkin, Chinh Dang, Jason W Bohland, Hemant Bokil, Partha P Mitra, Luis Puelles, John Hohmann,
1.264 David J Anderson, Ed S Lein, Allan R Jones, and Michael Hawrylycz. An anatomic gene expression atlas
1.265 of the adult mouse brain. Nat Neurosci, 12(3):356–362, March 2009.
1.266 -[16]Christopher J. Paciorek. Computational techniques for spatial logistic regression with large data sets. Com-
1.267 +[13]Christopher J. Paciorek. Computational techniques for spatial logistic regression with large data sets. Com-
1.268 putational Statistics & Data Analysis, 51(8):3631–3653, May 2007.
1.269 -[17]George Paxinos and Keith B.J. Franklin. The Mouse Brain in Stereotaxic Coordinates. Academic Press, 2
1.270 +[14]George Paxinos and Keith B.J. Franklin. The Mouse Brain in Stereotaxic Coordinates. Academic Press, 2
1.271 edition, July 2001.
1.272 -[18]A. Schleicher, N. Palomero-Gallagher, P. Morosan, S. Eickhoff, T. Kowalski, K. Vos, K. Amunts, and
1.273 +[15]A. Schleicher, N. Palomero-Gallagher, P. Morosan, S. Eickhoff, T. Kowalski, K. Vos, K. Amunts, and
1.274 K. Zilles. Quantitative architectural analysis: a new approach to cortical mapping. Anatomy and Em-
1.275 bryology, 210(5):373–386, December 2005.
1.276 -[19]Oliver Schmitt, Lars Hmke, and Lutz Dmbgen. Detection of cortical transition regions utilizing statistical
1.277 +[16]Oliver Schmitt, Lars Hmke, and Lutz Dmbgen. Detection of cortical transition regions utilizing statistical
1.278 analyses of excess masses. NeuroImage, 19(1):42–63, May 2003.
1.279 -[20]Constance M. Smith, Jacqueline H. Finger, Terry F. Hayamizu, Ingeborg J. McCright, Janan T. Eppig,
1.280 -James A. Kadin, Joel E. Richardson, and Martin Ringwald. The mouse gene expression database (GXD):
1.281 -2007 update. Nucl. Acids Res., 35(suppl_1):D618–623, 2007.
1.282 -[21]Judy Sprague, Leyla Bayraktaroglu, Dave Clements, Tom Conlin, David Fashena, Ken Frazer, Melissa
1.283 -Haendel, Douglas G Howe, Prita Mani, Sridhar Ramachandran, Kevin Schaper, Erik Segerdell, Peiran
1.284 -Song, Brock Sprunger, Sierra Taylor, Ceri E Van Slyke, and Monte Westerfield. The zebrafish information
1.285 -network: the zebrafish model organism database. Nucleic Acids Research, 34(Database issue):D581–5,
1.286 -2006. PMID: 16381936.
1.287 -[22]Larry Swanson. Brain Maps: Structure of the Rat Brain. Academic Press, 3 edition, November 2003.
1.288 -[23]Carol L. Thompson, Sayan D. Pathak, Andreas Jeromin, Lydia L. Ng, Cameron R. MacPherson, Marty T.
1.289 +[17]Larry Swanson. Brain Maps: Structure of the Rat Brain. Academic Press, 3 edition, November 2003.
1.290 +[18]Carol L. Thompson, Sayan D. Pathak, Andreas Jeromin, Lydia L. Ng, Cameron R. MacPherson, Marty T.
1.291 Mortrud, Allison Cusick, Zackery L. Riley, Susan M. Sunkin, Amy Bernard, Ralph B. Puchalski, Fred H.
1.292 Gage, Allan R. Jones, Vladimir B. Bajic, Michael J. Hawrylycz, and Ed S. Lein. Genomic anatomy of the
1.293 hippocampus. Neuron, 60(6):1010–1021, December 2008.
1.294 -[24]Pavel Tomancak, Amy Beaton, Richard Weiszmann, Elaine Kwan, ShengQiang Shu, Suzanna E Lewis,
1.295 -Stephen Richards, Michael Ashburner, Volker Hartenstein, Susan E Celniker, and Gerald M Rubin. Sys-
1.296 -tematic determination of patterns of gene expression during drosophila embryogenesis. Genome Biology,
1.297 -3(12):research008818814, 2002. PMC151190.
1.298 -[25]Jano van Hemert and Richard Baldock. Mining Spatial Gene Expression Data for Association Rules, volume
1.299 -4414/2007 of Lecture Notes in Computer Science, pages 66–76. Springer Berlin / Heidelberg, 2007.
1.300 -[26]Shanmugasundaram Venkataraman, Peter Stevenson, Yiya Yang, Lorna Richardson, Nicholas Burton,
1.301 +[19]Shanmugasundaram Venkataraman, Peter Stevenson, Yiya Yang, Lorna Richardson, Nicholas Burton,
1.302 Thomas P. Perry, Paul Smith, Richard A. Baldock, Duncan R. Davidson, and Jeffrey H. Christiansen.
1.303 EMAGE edinburgh mouse atlas of gene expression: 2008 update. Nucl. Acids Res., 36(suppl_1):D860–
1.304 865, 2008.
1.305 -[27]Axel Visel, Christina Thaller, and Gregor Eichele. GenePaint.org: an atlas of gene expression patterns in
1.306 -the mouse embryo. Nucl. Acids Res., 32(suppl_1):D552–556, 2004.
1.307 -[28]Robert H Waterston, Kerstin Lindblad-Toh, Ewan Birney, Jane Rogers, Josep F Abril, Pankaj Agarwal, Richa
1.308 +[20]Robert H Waterston, Kerstin Lindblad-Toh, Ewan Birney, Jane Rogers, Josep F Abril, Pankaj Agarwal, Richa
1.309 Agarwala, Rachel Ainscough, Marina Alexandersson, Peter An, Stylianos E Antonarakis, John Attwood,
1.310 Robert Baertsch, Jonathon Bailey, Karen Barlow, Stephan Beck, Eric Berry, Bruce Birren, Toby Bloom, Peer
1.311 Bork, Marc Botcherby, Nicolas Bray, Michael R Brent, Daniel G Brown, Stephen D Brown, Carol Bult, John
2.1 Binary file grant.odt has changed
3.1 Binary file grant.pdf has changed
4.1 --- a/grant.txt Wed Apr 22 07:39:32 2009 -0700
4.2 +++ b/grant.txt Wed Apr 22 14:51:24 2009 -0700
4.3 @@ -28,7 +28,9 @@
4.4
4.5 == Specific aims ==
4.6
4.7 -Massive new datasets obtained with techniques such as in situ hybridization (ISH), immunohistochemistry, in situ transgenic reporter, microarray voxelation, and others, allow the expression levels of many genes at many locations to be compared. Our goal is to develop automated methods to relate spatial variation in gene expression to anatomy. We want to find marker genes for specific anatomical regions, and also to draw new anatomical maps based on gene expression patterns. We have three specific aims:\\
4.8 +Massive new datasets obtained with techniques such as in situ hybridization (ISH), immunohistochemistry, in situ transgenic reporter, microarray voxelation, and others, allow the expression levels of many genes at many locations to be compared. Our goal is to develop automated methods to relate spatial variation in gene expression to anatomy. We want to find marker genes for specific anatomical regions, and also to draw new anatomical maps based on gene expression patterns. We will validate these methods by applying them to 46 anatomical areas within the cerebral cortex, by using the Allen Mouse Brain Atlas coronal dataset (ABA). This gene expression dataset was generated using ISH, and contains over 4,000 genes. For each gene, a digitized 3-D raster of the expression pattern is available: for each gene, the level of expression at each of 51,533 voxels is recorded.
4.9 +
4.10 +We have three specific aims:\\
4.11
4.12 (1) develop an algorithm to screen spatial gene expression data for combinations of marker genes which selectively target anatomical regions\\
4.13
4.14 @@ -250,11 +252,11 @@
4.15
4.16 \vspace{0.3cm}**The Allen Mouse Brain Atlas dataset**
4.17
4.18 -The Allen Mouse Brain Atlas (ABA) data were produced by doing in-situ hybridization on slices of male, 56-day-old C57BL/6J mouse brains. Pictures were taken of the processed slice, and these pictures were semi-automatically analyzed to create a digital measurement of gene expression levels at each location in each slice. Per slice, cellular spatial resolution is achieved. Using this method, a single physical slice can only be used to measure one single gene; many different mouse brains were needed in order to measure the expression of many genes.
4.19 -
4.20 -An automated nonlinear alignment procedure located the 2D data from the various slices in a single 3D coordinate system. In the final 3D coordinate system, voxels are cubes with 200 microns on a side. There are 67x41x58 \= 159,326 voxels in the 3D coordinate system, of which 51,533 are in the brain\cite{ng_anatomic_2009}.
4.21 -
4.22 -Mus musculus is thought to contain about 22,000 protein-coding genes\cite{waterston_initial_2002}. The ABA contains data on about 20,000 genes in sagittal sections, out of which over 4,000 genes are also measured in coronal sections. Our dataset is derived from only the coronal subset of the ABA\footnote{The sagittal data do not cover the entire cortex, and also have greater registration error\cite{ng_anatomic_2009}. Genes were selected by the Allen Institute for coronal sectioning based on, "classes of known neuroscientific interest... or through post hoc identification of a marked non-ubiquitous expression pattern"\cite{ng_anatomic_2009}.}.
4.23 +The Allen Mouse Brain Atlas (ABA) data\cite{lein_genome-wide_2007} were produced by doing in-situ hybridization on slices of male, 56-day-old C57BL/6J mouse brains. Pictures were taken of the processed slice, and these pictures were semi-automatically analyzed to create a digital measurement of gene expression levels at each location in each slice. Per slice, cellular spatial resolution is achieved. Using this method, a single physical slice can only be used to measure one single gene; many different mouse brains were needed in order to measure the expression of many genes.
4.24 +
4.25 +Mus musculus is thought to contain about 22,000 protein-coding genes\cite{waterston_initial_2002}. The ABA contains data on about 20,000 genes in sagittal sections, out of which over 4,000 genes are also measured in coronal sections. Our dataset is derived from only the coronal subset of the ABA\footnote{The sagittal data do not cover the entire cortex, and also have greater registration error\cite{ng_anatomic_2009}. Genes were selected by the Allen Institute for coronal sectioning based on, "classes of known neuroscientific interest... or through post hoc identification of a marked non-ubiquitous expression pattern"\cite{ng_anatomic_2009}.}. An automated nonlinear alignment procedure located the 2D data from the various slices in a single 3D coordinate system. In the final 3D coordinate system, voxels are cubes with 200 microns on a side. There are 67x41x58 \= 159,326 voxels, of which 51,533 are in the brain\cite{ng_anatomic_2009}. For each voxel and each gene, the expression energy\cite{lein_genome-wide_2007} within that voxel is made available.
4.26 +
4.27 +
4.28
4.29 %%The ABA is not the only large public spatial gene expression dataset. Other such resources include GENSAT\cite{gong_gene_2003}, GenePaint\cite{visel_genepaint.org:atlas_2004}, its sister project GeneAtlas\cite{carson_digital_2005}, BGEM\cite{magdaleno_bgem:in_2006}, EMAGE\cite{venkataraman_emage_2008}, EurExpress\footnote{http://www.eurexpress.org/ee/; EurExpress data are also entered into EMAGE}, EADHB\footnote{http://www.ncl.ac.uk/ihg/EADHB/database/EADHB_database.html}, MAMEP\footnote{http://mamep.molgen.mpg.de/index.php}, Xenbase\footnote{http://xenbase.org/}, ZFIN\cite{sprague_zebrafish_2006}, Aniseed\footnote{http://aniseed-ibdm.univ-mrs.fr/}, VisiGene\footnote{http://genome.ucsc.edu/cgi-bin/hgVisiGene ; includes data from some the other listed data sources}, GEISHA\cite{bell_geishawhole-mount_2004}, Fruitfly.org\cite{tomancak_systematic_2002}, COMPARE\footnote{http://compare.ibdml.univ-mrs.fr/} GXD\cite{smith_mouse_2007}, GEO\cite{barrett_ncbi_2007}\footnote{GXD and GEO contain spatial data but also non-spatial data. All GXD spatial data are also in EMAGE.}. With the exception of the ABA, GenePaint, and EMAGE, most of these resources have not (yet) extracted the expression intensity from the ISH images and registered the results into a single 3-D space, and to our knowledge only ABA and EMAGE make this form of data available for public download from the website\footnote{without prior offline registration}. Many of these resources focus on developmental gene expression.
4.30
4.31 @@ -319,7 +321,7 @@
4.32 === Flatmap of cortex ===
4.33
4.34
4.35 -We downloaded the ABA data and applied a mask to select only those voxels which belong to cerebral cortex. We divided the cortex into hemispheres. Using Caret\cite{van_essen_integrated_2001}, we created a mesh representation of the surface of the selected voxels. For each gene, and for each node of the mesh, we calculated an average of the gene expression of the voxels "underneath" that mesh node. We then flattened the cortex, creating a two-dimensional mesh. We sampled the nodes of the irregular, flat mesh in order to create a regular grid of pixel values. We converted this grid into a MATLAB matrix. We manually traced the boundaries of each of 49 cortical areas from the ABA coronal reference atlas slides. We then converted these manual traces into Caret-format regional boundary data on the mesh surface. We projected the regions onto the 2-d mesh, and then onto the grid, and then we converted the region data into MATLAB format.
4.36 +We downloaded the ABA data and applied a mask to select only those voxels which belong to cerebral cortex. We divided the cortex into hemispheres. Using Caret\cite{van_essen_integrated_2001}, we created a mesh representation of the surface of the selected voxels. For each gene, and for each node of the mesh, we calculated an average of the gene expression of the voxels "underneath" that mesh node. We then flattened the cortex, creating a two-dimensional mesh. We sampled the nodes of the irregular, flat mesh in order to create a regular grid of pixel values. We converted this grid into a MATLAB matrix. We manually traced the boundaries of each of 46 cortical areas from the ABA coronal reference atlas slides. We then converted these manual traces into Caret-format regional boundary data on the mesh surface. We projected the regions onto the 2-d mesh, and then onto the grid, and then we converted the region data into MATLAB format.
4.37
4.38 At this point, the data are in the form of a number of 2-D matrices, all in registration, with the matrix entries representing a grid of points (pixels) over the cortical surface. There is one 2-D matrix whose entries represent the regional label associated with each surface pixel. And for each gene, there is a 2-D matrix whose entries represent the average expression level underneath each surface pixel. We created a normalized version of the gene expression data by subtracting each gene's mean expression level (over all surface pixels) and dividing the expression level of each gene by its standard deviation. The features and the target area are both functions on the surface pixels. They can be referred to as scalar fields over the space of surface pixels; alternately, they can be thought of as images which can be displayed on the flatmapped surface.
4.39
4.40 @@ -519,7 +521,7 @@
4.41 \label{dimReduc}\end{wrapfigure}
4.42
4.43 \vspace{0.3cm}**Scoring measures and feature selection**
4.44 -%%We will develop scoring methods for evaluating how good individual genes are at marking areas. We will compare pointwise, geometric, and information-theoretic measures. We already developed one entirely new scoring method (gradient similarity), but we may develop more. Scoring measures that we will explore will include the L1 norm, correlation, expression energy ratio, conditional entropy, gradient similarity, Jaccard similarity, Dice similarity, Hough transform, and statistical tests such as Hotelling's T-square test (a multivariate generalization of Student's t-test), ANOVA, and a multivariate version of the Mann-Whitney U test (a non-parametric test).
4.45 +%%We will develop scoring methods for evaluating how good individual genes are at marking areas. We will compare pointwise, geometric, and information-theoretic measures. We already developed one entirely new scoring method (gradient similarity), and we plan to develop more. Scoring measures that we will explore will include the L1 norm, correlation, expression energy ratio, conditional entropy, gradient similarity, Jaccard similarity, Dice similarity, Hough transform, and statistical tests such as Hotelling's T-square test (a multivariate generalization of Student's t-test), ANOVA, and a multivariate version of the Mann-Whitney U test (a non-parametric test).
4.46 We will develop scoring methods for evaluating how good individual genes are at marking areas. We will compare pointwise, geometric, and information-theoretic measures. We already developed one entirely new scoring method (gradient similarity), but we may develop more. Scoring measures that we will explore will include the L1 norm, correlation, expression energy ratio, conditional entropy, gradient similarity, Jaccard similarity, Dice similarity, Hough transform, and statistical tests such as Student's t-test, and the Mann-Whitney U test (a non-parametric test). In addition, any classifier induces a scoring measure on genes by taking the prediction error when using that gene to predict the target.
4.47
4.48 Using some combination of these measures, we will develop a procedure to find single marker genes for anatomical regions: for each cortical area, we will rank the genes by their ability to delineate each area. We will quantitatively compare the list of single genes generated by our method to the lists generated by previous methods which are mentioned in Aim 1 Related Work.