nsf
changeset 106:ffa1390e4f39
.
| author | bshanks@bshanks.dyndns.org | 
|---|---|
| date | Wed Apr 22 14:51:24 2009 -0700 (16 years ago) | 
| parents | 6c48f37d0f0c | 
| children | f26370dc719b | 
| files | grant.html grant.odt grant.pdf grant.txt | 
   line diff
     1.1 --- a/grant.html	Wed Apr 22 07:39:32 2009 -0700
     1.2 +++ b/grant.html	Wed Apr 22 14:51:24 2009 -0700
     1.3 @@ -1,9 +1,13 @@
     1.4  Specific aims
     1.5 -Massive new datasets obtained with techniques such as in situ hybridization (ISH), immunohistochemistry, in
     1.6 -situ transgenic reporter, microarray voxelation, and others, allow the expression levels of many genes at many
     1.7 -locations to be compared. Our goal is to develop automated methods to relate spatial variation in gene expres-
     1.8 -sion to anatomy. We want to find marker genes for specific anatomical regions, and also to draw new anatomical
     1.9 -maps based on gene expression patterns. We have three specific aims:
    1.10 +Massive new datasets obtained with techniques such as in situ hybridization (ISH), immunohistochemistry, in situ
    1.11 +transgenic reporter, microarray voxelation, and others, allow the expression levels of many genes at many loca-
    1.12 +tions to be compared. Our goal is to develop automated methods to relate spatial variation in gene expression to
    1.13 +anatomy. We want to find marker genes for specific anatomical regions, and also to draw new anatomical maps
    1.14 +based on gene expression patterns.  We will validate these methods by applying them to 46 anatomical areas
    1.15 +within the cerebral cortex, by using the Allen Mouse Brain Atlas coronal dataset (ABA). This gene expression
    1.16 +dataset was generated using ISH, and contains over 4,000 genes.  For each gene, a digitized 3-D raster of the
    1.17 +expression pattern is available: for each gene, the level of expression at each of 51,533 voxels is recorded.
    1.18 +We have three specific aims:
    1.19  (1) develop an algorithm to screen spatial gene expression data for combinations of marker genes which
    1.20  selectively target anatomical regions
    1.21  (2) develop an algorithm to suggest new ways of carving up a structure into anatomically distinct regions,
    1.22 @@ -126,23 +130,23 @@
    1.23  as gradient similarity, which is discussed in Preliminary Studies) may be necessary in order to achieve the best
    1.24  results in this application.
    1.25  We now turn to efforts to find marker genes using spatial gene expression data using automated methods.
    1.26 -GeneAtlas[5] and EMAGE [26] allow the user to construct a search query by demarcating regions and then
    1.27 +GeneAtlas[3] and EMAGE [19] allow the user to construct a search query by demarcating regions and then
    1.28  specifying either the strength of expression or the name of another gene or dataset whose expression pattern
    1.29  is to be matched.  Neither GeneAtlas nor EMAGE allow one to search for combinations of genes that define a
    1.30  region in concert but not separately.
    1.31 -[15 ] describes AGEA, ”Anatomic Gene Expression Atlas”. AGEA has three components. Gene Finder: The
    1.32 +[12 ] describes AGEA, ”Anatomic Gene Expression Atlas”. AGEA has three components. Gene Finder: The
    1.33  user selects a seed voxel and the system (1) chooses a cluster which includes the seed voxel, (2) yields a list of
    1.34  genes which are overexpressed in that cluster. Correlation: The user selects a seed voxel and the system then
    1.35  shows the user how much correlation there is between the gene expression profile of the seed voxel and every
    1.36 -other voxel. Clusters: will be described later. [6] looks at the mean expression level of genes within anatomical
    1.37 +other voxel. Clusters: will be described later. [4] looks at the mean expression level of genes within anatomical
    1.38  regions,  and applies a Student’s t-test with Bonferroni correction to determine whether the mean expression
    1.39 -level of a gene is significantly higher in the target region.  [15] and [6] differ from our Aim 1 in at least three
    1.40 -ways. First, [15] and [6] find only single genes, whereas we will also look for combinations of genes.  Second,
    1.41 -[15 ] and [6] can only use overexpression as a marker, whereas we will also search for underexpression.  Third,
    1.42 -[15 ] and [6] use scores based on pointwise expression levels, whereas we will also use geometric scores such
    1.43 +level of a gene is significantly higher in the target region.  [12] and [4] differ from our Aim 1 in at least three
    1.44 +ways. First, [12] and [4] find only single genes, whereas we will also look for combinations of genes.  Second,
    1.45 +[12 ] and [4] can only use overexpression as a marker, whereas we will also search for underexpression.  Third,
    1.46 +[12 ] and [4] use scores based on pointwise expression levels, whereas we will also use geometric scores such
    1.47  as gradient similarity (described in Preliminary Studies).  Figures 4, 2, and 3 in the Preliminary Studies section
    1.48  contain evidence that each of our three choices is the right one.
    1.49 -[10 ] describes a technique to find combinations of marker genes to pick out an anatomical region. They use
    1.50 +[8 ] describes a technique to find combinations of marker genes to pick out an anatomical region.  They use
    1.51  an evolutionary algorithm to evolve logical operators which combine boolean (thresholded) images in order to
    1.52  match a target image.
    1.53  In summary,  there has been fruitful work on finding marker genes,  but only one of the previous projects
    1.54 @@ -199,20 +203,20 @@
    1.55  gene clusters in this fashion.
    1.56  Related work
    1.57  Some researchers have attempted to parcellate cortex on the basis of non-gene expression data. For example,
    1.58 -[18 ], [2 ], [19], and [1] associate spots on the cortex with the radial profile5  of response to some stain ([12] uses
    1.59 +[15 ], [2 ], [16], and [1] associate spots on the cortex with the radial profile5  of response to some stain ([10] uses
    1.60  MRI), extract features from this profile, and then use similarity between surface pixels to cluster.
    1.61 -[23 ] describes an analysis of the anatomy of the hippocampus using the ABA dataset. In addition to manual
    1.62 +[18 ] describes an analysis of the anatomy of the hippocampus using the ABA dataset. In addition to manual
    1.63  analysis, two clustering methods were employed, a modified Non-negative Matrix Factorization (NNMF), and
    1.64  a hierarchical recursive bifurcation clustering scheme based on correlation as the similarity score.  The paper
    1.65  yielded impressive results, proving the usefulness of computational genomic anatomy.  We have run NNMF on
    1.66  the cortical dataset
    1.67 -AGEA[15] includes a preset hierarchical clustering of voxels based on a recursive bifurcation algorithm with
    1.68 -correlation as the similarity metric.  EMAGE[26] allows the user to select a dataset from among a large number
    1.69 +AGEA[12] includes a preset hierarchical clustering of voxels based on a recursive bifurcation algorithm with
    1.70 +correlation as the similarity metric.  EMAGE[19] allows the user to select a dataset from among a large number
    1.71  of alternatives, or by running a search query, and then to cluster the genes within that dataset. EMAGE clusters
    1.72  via hierarchical complete linkage clustering.
    1.73 -[6 ] clusters genes.  For each cluster, prototypical spatial expression patterns were created by averaging the
    1.74 +[4 ] clusters genes.  For each cluster, prototypical spatial expression patterns were created by averaging the
    1.75  genes in the cluster. The prototypes were analyzed manually, without clustering voxels.
    1.76 -[10 ] applies their technique for finding combinations of marker genes for the purpose of clustering genes
    1.77 +[8 ] applies  their  technique  for  finding  combinations  of  marker  genes  for  the  purpose  of  clustering  genes
    1.78  around a “seed gene”.
    1.79  In summary, although these projects obtained clusterings, there has not been much comparison between
    1.80  different algorithms or scoring methods, so it is likely that the best clustering method for this application has not
    1.81 @@ -262,36 +266,36 @@
    1.82                                 cortex, and what their arrangement is, are still not completely settled.
    1.83                                 A proposed division of the cortex into areas is called a cortical map.
    1.84                                 In the rodent, the lack of a single agreed-upon map can be seen by
    1.85 -                               contrasting the recent maps given by Swanson[22] on the one hand,
    1.86 -                               and Paxinos and Franklin[17] on the other.  While the maps are cer-
    1.87 +                               contrasting the recent maps given by Swanson[17] on the one hand,
    1.88 +                               and Paxinos and Franklin[14] on the other.  While the maps are cer-
    1.89                                 tainly very similar in their general arrangement, significant differences
    1.90                                 remain.
    1.91                                    The Allen Mouse Brain Atlas dataset
    1.92 -                                  The Allen Mouse Brain Atlas (ABA) data were produced by doing in-
    1.93 -                               situ hybridization on slices of male, 56-day-old C57BL/6J mouse brains.
    1.94 -                               Pictures were taken of the processed slice,  and these pictures were
    1.95 -                               semi-automatically analyzed to create a digital measurement of gene
    1.96 -                               expression levels at each location in each slice. Per slice, cellular spa-
    1.97 -                               tial resolution is achieved.  Using this method, a single physical slice
    1.98 +                                  The Allen Mouse Brain Atlas (ABA) data[11] were produced by do-
    1.99 +                               ing in-situ hybridization on slices of male, 56-day-old C57BL/6J mouse
   1.100 +                               brains.  Pictures were taken of the processed slice, and these pictures
   1.101 +                               were semi-automatically analyzed to create a digital measurement of
   1.102 +                               gene expression levels at each location in each slice. Per slice, cellular
   1.103 +                               spatial resolution is achieved. Using this method, a single physical slice
   1.104  can only be used to measure one single gene; many different mouse brains were needed in order to measure
   1.105  the expression of many genes.
   1.106 -An automated nonlinear alignment procedure located the 2D data from the various slices in a single 3D
   1.107 -coordinate system.  In the final 3D coordinate system, voxels are cubes with 200 microns on a side.  There are
   1.108 -67x41x58 = 159,326 voxels in the 3D coordinate system, of which 51,533 are in the brain[15].
   1.109 -Mus musculus is thought to contain about 22,000 protein-coding genes[28]. The ABA contains data on about
   1.110 -20,000 genes in sagittal sections, out of which over 4,000 genes are also measured in coronal sections.  Our
   1.111 -dataset is derived from only the coronal subset of the ABA7.
   1.112 +Mus  musculus  is  thought  to  contain  about  22,000  protein-coding  genes[20].   The  ABA  contains  data  on
   1.113 +about 20,000 genes in sagittal sections, out of which over 4,000 genes are also measured in coronal sections.
   1.114 +Our dataset is derived from only the coronal subset of the ABA7. An automated nonlinear alignment procedure
   1.115 +located the 2D data from the various slices in a single 3D coordinate system. In the final 3D coordinate system,
   1.116 +voxels are cubes with 200 microns on a side. There are 67x41x58 = 159,326 voxels, of which 51,533 are in the
   1.117 +brain[12]. For each voxel and each gene, the expression energy[11] within that voxel is made available.
   1.118  The ABA is not the only large public spatial gene expression dataset. However, with the exception of the ABA,
   1.119  GenePaint, and EMAGE, most of the other resources have not (yet) extracted the expression intensity from the
   1.120  ISH images and registered the results into a single 3-D space.
   1.121  Related work
   1.122 -[15 ] describes the application of AGEA to the cortex.  The paper describes interesting results on the structure
   1.123 +[12 ] describes the application of AGEA to the cortex.  The paper describes interesting results on the structure
   1.124  of correlations between voxel gene expression profiles within a handful of cortical areas.   However,  this sort
   1.125  _________________________________________
   1.126      6Outside of isocortex, the number of layers varies.
   1.127 -     7The sagittal data do not cover the entire cortex,  and also have greater registration error[15].  Genes were selected by the Allen
   1.128 +     7The sagittal data do not cover the entire cortex,  and also have greater registration error[12].  Genes were selected by the Allen
   1.129  Institute for coronal sectioning based on, “classes of known neuroscientific interest... or through post hoc identification of a marked
   1.130 -non-ubiquitous expression pattern”[15].
   1.131 +non-ubiquitous expression pattern”[12].
   1.132  of analysis is not related to either of our aims, as it neither finds marker genes, nor does it suggest a cortical
   1.133  map based on gene expression data.   Neither of the other components of AGEA can be applied to cortical
   1.134  areas; AGEA’s Gene Finder cannot be used to find marker genes for the cortical areas; and AGEA’s hierarchical
   1.135 @@ -335,11 +339,11 @@
   1.136  file formats.
   1.137   Flatmap of cortex
   1.138  We downloaded the ABA data and applied a mask to select only those voxels which belong to cerebral cortex.
   1.139 -We divided the cortex into hemispheres. Using Caret[7], we created a mesh representation of the surface of the
   1.140 +We divided the cortex into hemispheres. Using Caret[5], we created a mesh representation of the surface of the
   1.141  selected voxels. For each gene, and for each node of the mesh, we calculated an average of the gene expression
   1.142  of the voxels “underneath” that mesh node. We then flattened the cortex, creating a two-dimensional mesh. We
   1.143  sampled the nodes of the irregular, flat mesh in order to create a regular grid of pixel values. We converted this
   1.144 -grid into a MATLAB matrix. We manually traced the boundaries of each of 49 cortical areas from the ABA coronal
   1.145 +grid into a MATLAB matrix. We manually traced the boundaries of each of 46 cortical areas from the ABA coronal
   1.146  reference atlas slides. We then converted these manual traces into Caret-format regional boundary data on the
   1.147      8In both cases, the cause is that pairwise correlations between the gene expression of voxels in different areas but the same layer
   1.148  are often stronger than pairwise correlations between the gene expression of voxels in different layers but the same area.  Therefore, a
   1.149 @@ -517,7 +521,7 @@
   1.150                                 Flatmap cortex and segment cortical layers
   1.151                                 There are multiple ways to flatten 3-D data into 2-D. We will compare
   1.152                                 mappings  from  manifolds  to  planes  which  attempt  to  preserve  size
   1.153 -                               (such as the one used by Caret[7]) with mappings which preserve an-
   1.154 +                               (such as the one used by Caret[5]) with mappings which preserve an-
   1.155                                 gle (conformal maps).   Our method will include a statistical test that
   1.156                                 warns the user if the assumption of 2-D structure seems to be wrong.
   1.157                                    We have not yet made use of radial profiles.  While the radial pro-
   1.158 @@ -608,7 +612,7 @@
   1.159  Classifiers We will explore and compare different classifiers.  As noted above, this activity is not separate
   1.160  from the previous one, because some supervised learning algorithms include feature selection, and any clas-
   1.161  sifier can be combined with a stepwise wrapper for use as a feature selection method.  We will explore logistic
   1.162 -regression (including spatial models[16]), decision trees12, sparse SVMs, generative mixture models (including
   1.163 +regression (including spatial models[13]), decision trees12, sparse SVMs, generative mixture models (including
   1.164  naive bayes), kernel density estimation, instance-based learning methods (such as k-nearest neighbor), genetic
   1.165  algorithms, and artificial neural networks.
   1.166  Develop algorithms to suggest a division of a structure into anatomical parts
   1.167 @@ -634,7 +638,7 @@
   1.168  profiles, the same techniques can be applied instead to the pixels.  It is possible that the features generated in
   1.169  this way by some dimensionality reduction techniques will directly correspond to interesting spatial regions.
   1.170  Clustering and segmentation on pixels We will explore clustering and segmentation algorithms in order to
   1.171 -segment the pixels into regions. We will explore k-means, spectral clustering, gene shaving[9], recursive division
   1.172 +segment the pixels into regions. We will explore k-means, spectral clustering, gene shaving[7], recursive division
   1.173  clustering, multivariate generalizations of edge detectors, multivariate generalizations of watershed transforma-
   1.174  tions, region growing, active contours, graph partitioning methods, and recursive agglomerative clustering with
   1.175  various linkage functions. These methods can be combined with dimensionality reduction.
   1.176 @@ -648,7 +652,7 @@
   1.177  reduction step) in order to identify spatial regions.  It remains to be seen whether removal of redundancy would
   1.178  help or hurt the ultimate goal of identifying interesting spatial regions.
   1.179  Co-clustering There are some algorithms which simultaneously incorporate clustering on instances and on
   1.180 -features (in our case, genes and pixels), for example, IRM[11].  These are called co-clustering or biclustering
   1.181 +features (in our case,  genes and pixels),  for example,  IRM[9].  These are called co-clustering or biclustering
   1.182  _________________________________________
   1.183     12Actually, we have already begun to explore decision trees. For each cortical area, we have used the C4.5 algorithm to find a decision
   1.184  tree for that area. We achieved good classification accuracy on our training set, but the number of genes that appeared in each tree was
   1.185 @@ -673,7 +677,7 @@
   1.186  combination of genes to seem to identify an area when in fact it is only coincidence. There are two ways we will
   1.187  validate our marker genes to guard against this. First, we will confirm that putative combinations of marker genes
   1.188  express the same pattern in both hemispheres. Second, we will manually validate our final results on other gene
   1.189 -expression datasets such as EMAGE, GeneAtlas, and GENSAT[8].
   1.190 +expression datasets such as EMAGE, GeneAtlas, and GENSAT[6].
   1.191  Using the methods developed in Aim 2, we will present one or more hierarchical cortical maps. We will identify
   1.192  and explain how the statistical structure in the gene expression data led to any unexpected or interesting features
   1.193  of these maps, and we will provide biological hypotheses to interpret any new cortical areas, or groupings of
   1.194 @@ -708,81 +712,72 @@
   1.195  Science, pages 294–301. Springer Berlin / Heidelberg, 2005.
   1.196  [2]J. Annese, A. Pitiot, I. D. Dinov, and A. W. Toga. A myelo-architectonic method for the structural classification
   1.197  of cortical areas. NeuroImage, 21(1):15–26, 2004.
   1.198 -[3]Tanya Barrett,  Dennis B. Troup,  Stephen E. Wilhite,  Pierre Ledoux,  Dmitry Rudnev,  Carlos Evangelista,
   1.199 -Irene F. Kim, Alexandra Soboleva, Maxim Tomashevsky, and Ron Edgar. NCBI GEO: mining tens of millions
   1.200 -of expression profiles–database and tools update. Nucl. Acids Res., 35(suppl_1):D760–765, 2007.
   1.201 -[4]George W. Bell, Tatiana A. Yatskievych, and Parker B. Antin.  GEISHA, a whole-mount in situ hybridization
   1.202 -gene expression screen in chicken embryos. Developmental Dynamics, 229(3):677–687, 2004.
   1.203 -[5]James P Carson, Tao Ju, Hui-Chen Lu, Christina Thaller, Mei Xu, Sarah L Pallas, Michael C Crair, Joe
   1.204 +[3]James P Carson, Tao Ju, Hui-Chen Lu, Christina Thaller, Mei Xu, Sarah L Pallas, Michael C Crair, Joe
   1.205  Warren,  Wah Chiu,  and Gregor Eichele.   A digital atlas to characterize the mouse brain transcriptome.
   1.206  PLoS Comput Biol, 1(4):e41, 2005.
   1.207 -[6]Mark H. Chin, Alex B. Geng, Arshad H. Khan, Wei-Jun Qian, Vladislav A. Petyuk, Jyl Boline, Shawn Levy,
   1.208 +[4]Mark H. Chin, Alex B. Geng, Arshad H. Khan, Wei-Jun Qian, Vladislav A. Petyuk, Jyl Boline, Shawn Levy,
   1.209  Arthur W. Toga,  Richard D. Smith,  Richard M. Leahy,  and Desmond J. Smith.   A genome-scale map of
   1.210  expression for a mouse brain section obtained using voxelation. Physiol. Genomics, 30(3):313–321, August
   1.211  2007.
   1.212 -[7]D C Van Essen, H A Drury, J Dickson, J Harwell, D Hanlon, and C H Anderson. An integrated software suite
   1.213 +[5]D C Van Essen, H A Drury, J Dickson, J Harwell, D Hanlon, and C H Anderson. An integrated software suite
   1.214  for surface-based analyses of cerebral cortex.  Journal of the American Medical Informatics Association:
   1.215  JAMIA, 8(5):443–59, 2001. PMID: 11522765.
   1.216 -[8]Shiaoching  Gong,  Chen  Zheng,  Martin  L.  Doughty,  Kasia  Losos,  Nicholas  Didkovsky,  Uta  B.  Scham-
   1.217 +[6]Shiaoching  Gong,  Chen  Zheng,  Martin  L.  Doughty,  Kasia  Losos,  Nicholas  Didkovsky,  Uta  B.  Scham-
   1.218  bra,  Norma  J.  Nowak,  Alexandra  Joyner,  Gabrielle  Leblanc,  Mary  E.  Hatten,  and  Nathaniel  Heintz.   A
   1.219  gene expression atlas of the central nervous system based on bacterial artificial chromosomes.  Nature,
   1.220  425(6961):917–925, October 2003.
   1.221 -[9]Trevor Hastie,  Robert Tibshirani,  Michael Eisen,  Ash Alizadeh,  Ronald Levy,  Louis Staudt,  Wing Chan,
   1.222 +[7]Trevor Hastie,  Robert Tibshirani,  Michael Eisen,  Ash Alizadeh,  Ronald Levy,  Louis Staudt,  Wing Chan,
   1.223  David Botstein, and Patrick Brown.  ’Gene shaving’ as a method for identifying distinct sets of genes with
   1.224  similar expression patterns. Genome Biology, 1(2):research0003.1–research0003.21, 2000.
   1.225 -[10]Jano Hemert and Richard Baldock.  Matching Spatial Regions with Combinations of Interacting Gene Ex-
   1.226 +[8]Jano Hemert and Richard Baldock.  Matching Spatial Regions with Combinations of Interacting Gene Ex-
   1.227  pression Patterns, volume 13 of Communications in Computer and Information Science, pages 347–361.
   1.228  Springer Berlin Heidelberg, 2008.
   1.229 -[11]C Kemp, JB Tenenbaum, TL Griffiths, T Yamada, and N Ueda. Learning systems of concepts with an infinite
   1.230 +[9]C Kemp, JB Tenenbaum, TL Griffiths, T Yamada, and N Ueda. Learning systems of concepts with an infinite
   1.231  relational model. In AAAI, 2006.
   1.232 -[12]F. Kruggel,  M. K. Brckner,  Th. Arendt,  C. J. Wiggins,  and D. Y. von Cramon.   Analyzing the neocortical
   1.233 +[10]F. Kruggel,  M. K. Brckner,  Th. Arendt,  C. J. Wiggins,  and D. Y. von Cramon.   Analyzing the neocortical
   1.234  fine-structure. Medical Image Analysis, 7(3):251–264, September 2003.
   1.235 -[13]Erh-Fang Lee, Jyl Boline, and Arthur W. Toga.  A High-Resolution anatomical framework of the neonatal
   1.236 -mouse brain for managing gene expression data. Frontiers in Neuroinformatics, 1:6, 2007. PMC2525996.
   1.237 -[14]Susan Magdaleno, Patricia Jensen, Craig L. Brumwell, Anna Seal, Karen Lehman, Andrew Asbury, Tony
   1.238 -Cheung,  Tommie Cornelius,  Diana M. Batten,  Christopher Eden,  Shannon M. Norland,  Dennis S. Rice,
   1.239 -Nilesh  Dosooye,  Sundeep  Shakya,  Perdeep  Mehta,  and  Tom  Curran.   BGEM:  an  in  situ  hybridization
   1.240 -database of gene expression in the embryonic and adult mouse nervous system.  PLoS Biology, 4(4):e86
   1.241 -EP –, April 2006.
   1.242 -[15]Lydia Ng, Amy Bernard, Chris Lau, Caroline C Overly, Hong-Wei Dong, Chihchau Kuan, Sayan Pathak, Su-
   1.243 +[11]Ed S. Lein, Michael J. Hawrylycz, Nancy Ao, Mikael Ayres, Amy Bensinger, Amy Bernard, Andrew F. Boe,
   1.244 +Mark S. Boguski, Kevin S. Brockway, Emi J. Byrnes, Lin Chen, Li Chen, Tsuey-Ming Chen, Mei Chi Chin,
   1.245 +Jimmy Chong,  Brian E. Crook,  Aneta Czaplinska,  Chinh N. Dang,  Suvro Datta,  Nick R. Dee,  Aimee L.
   1.246 +Desaki,  Tsega  Desta,  Ellen  Diep,  Tim  A.  Dolbeare,  Matthew  J.  Donelan,  Hong-Wei  Dong,  Jennifer  G.
   1.247 +Dougherty,  Ben J. Duncan,  Amanda J. Ebbert,  Gregor Eichele,  Lili K. Estin,  Casey Faber,  Benjamin A.
   1.248 +Facer, Rick Fields, Shanna R. Fischer, Tim P. Fliss, Cliff Frensley, Sabrina N. Gates, Katie J. Glattfelder,
   1.249 +Kevin R. Halverson, Matthew R. Hart, John G. Hohmann, Maureen P. Howell, Darren P. Jeung, Rebecca A.
   1.250 +Johnson, Patrick T. Karr, Reena Kawal, Jolene M. Kidney, Rachel H. Knapik, Chihchau L. Kuan, James H.
   1.251 +Lake, Annabel R. Laramee, Kirk D. Larsen, Christopher Lau, Tracy A. Lemon, Agnes J. Liang, Ying Liu,
   1.252 +Lon T. Luong, Jesse Michaels, Judith J. Morgan, Rebecca J. Morgan, Marty T. Mortrud, Nerick F. Mosqueda,
   1.253 +Lydia L. Ng, Randy Ng, Geralyn J. Orta, Caroline C. Overly, Tu H. Pak, Sheana E. Parry, Sayan D. Pathak,
   1.254 +Owen C. Pearson, Ralph B. Puchalski, Zackery L. Riley, Hannah R. Rockett, Stephen A. Rowland, Joshua J.
   1.255 +Royall,  Marcos  J.  Ruiz,  Nadia  R.  Sarno,  Katherine  Schaffnit,  Nadiya  V.  Shapovalova,  Taz  Sivisay,  Clif-
   1.256 +ford R. Slaughterbeck, Simon C. Smith, Kimberly A. Smith, Bryan I. Smith, Andy J. Sodt, Nick N. Stewart,
   1.257 +Kenda-Ruth Stumpf, Susan M. Sunkin, Madhavi Sutram, Angelene Tam, Carey D. Teemer, Christina Thaller,
   1.258 +Carol L. Thompson, Lee R. Varnam, Axel Visel, Ray M. Whitlock, Paul E. Wohnoutka, Crissa K. Wolkey,
   1.259 +Victoria Y. Wong, Matthew Wood, Murat B. Yaylaoglu, Rob C. Young, Brian L. Youngstrom, Xu Feng Yuan,
   1.260 +Bin Zhang, Theresa A. Zwingman, and Allan R. Jones. Genome-wide atlas of gene expression in the adult
   1.261 +mouse brain. Nature, 445(7124):168–176, 2007.
   1.262 +[12]Lydia Ng, Amy Bernard, Chris Lau, Caroline C Overly, Hong-Wei Dong, Chihchau Kuan, Sayan Pathak, Su-
   1.263  san M Sunkin, Chinh Dang, Jason W Bohland, Hemant Bokil, Partha P Mitra, Luis Puelles, John Hohmann,
   1.264  David J Anderson, Ed S Lein, Allan R Jones, and Michael Hawrylycz.  An anatomic gene expression atlas
   1.265  of the adult mouse brain. Nat Neurosci, 12(3):356–362, March 2009.
   1.266 -[16]Christopher J. Paciorek. Computational techniques for spatial logistic regression with large data sets. Com-
   1.267 +[13]Christopher J. Paciorek. Computational techniques for spatial logistic regression with large data sets. Com-
   1.268  putational Statistics & Data Analysis, 51(8):3631–3653, May 2007.
   1.269 -[17]George Paxinos and Keith B.J. Franklin.  The Mouse Brain in Stereotaxic Coordinates.  Academic Press, 2
   1.270 +[14]George Paxinos and Keith B.J. Franklin.  The Mouse Brain in Stereotaxic Coordinates.  Academic Press, 2
   1.271  edition, July 2001.
   1.272 -[18]A.  Schleicher,  N.  Palomero-Gallagher,  P.  Morosan,  S.  Eickhoff,  T.  Kowalski,  K.  Vos,  K.  Amunts,  and
   1.273 +[15]A.  Schleicher,  N.  Palomero-Gallagher,  P.  Morosan,  S.  Eickhoff,  T.  Kowalski,  K.  Vos,  K.  Amunts,  and
   1.274  K.  Zilles.   Quantitative  architectural  analysis:  a  new  approach  to  cortical  mapping.   Anatomy  and  Em-
   1.275  bryology, 210(5):373–386, December 2005.
   1.276 -[19]Oliver Schmitt, Lars Hmke, and Lutz Dmbgen.   Detection of cortical transition regions utilizing statistical
   1.277 +[16]Oliver Schmitt, Lars Hmke, and Lutz Dmbgen.   Detection of cortical transition regions utilizing statistical
   1.278  analyses of excess masses. NeuroImage, 19(1):42–63, May 2003.
   1.279 -[20]Constance  M.  Smith,  Jacqueline  H.  Finger,  Terry  F.  Hayamizu,  Ingeborg  J.  McCright,  Janan  T.  Eppig,
   1.280 -James A. Kadin, Joel E. Richardson, and Martin Ringwald.  The mouse gene expression database (GXD):
   1.281 -2007 update. Nucl. Acids Res., 35(suppl_1):D618–623, 2007.
   1.282 -[21]Judy  Sprague,  Leyla  Bayraktaroglu,  Dave  Clements,  Tom  Conlin,  David  Fashena,  Ken  Frazer,  Melissa
   1.283 -Haendel,  Douglas  G  Howe,  Prita  Mani,  Sridhar  Ramachandran,  Kevin  Schaper,  Erik  Segerdell,  Peiran
   1.284 -Song, Brock Sprunger, Sierra Taylor, Ceri E Van Slyke, and Monte Westerfield.  The zebrafish information
   1.285 -network:  the zebrafish model organism database.  Nucleic Acids Research, 34(Database issue):D581–5,
   1.286 -2006. PMID: 16381936.
   1.287 -[22]Larry Swanson. Brain Maps: Structure of the Rat Brain. Academic Press, 3 edition, November 2003.
   1.288 -[23]Carol L. Thompson, Sayan D. Pathak, Andreas Jeromin, Lydia L. Ng, Cameron R. MacPherson, Marty T.
   1.289 +[17]Larry Swanson. Brain Maps: Structure of the Rat Brain. Academic Press, 3 edition, November 2003.
   1.290 +[18]Carol L. Thompson, Sayan D. Pathak, Andreas Jeromin, Lydia L. Ng, Cameron R. MacPherson, Marty T.
   1.291  Mortrud,  Allison Cusick,  Zackery L. Riley,  Susan M. Sunkin,  Amy Bernard,  Ralph B. Puchalski,  Fred H.
   1.292  Gage, Allan R. Jones, Vladimir B. Bajic, Michael J. Hawrylycz, and Ed S. Lein.  Genomic anatomy of the
   1.293  hippocampus. Neuron, 60(6):1010–1021, December 2008.
   1.294 -[24]Pavel Tomancak,  Amy Beaton,  Richard Weiszmann,  Elaine Kwan,  ShengQiang Shu,  Suzanna E Lewis,
   1.295 -Stephen Richards, Michael Ashburner, Volker Hartenstein, Susan E Celniker, and Gerald M Rubin.  Sys-
   1.296 -tematic determination of patterns of gene expression during drosophila embryogenesis.  Genome Biology,
   1.297 -3(12):research008818814, 2002. PMC151190.
   1.298 -[25]Jano van Hemert and Richard Baldock. Mining Spatial Gene Expression Data for Association Rules, volume
   1.299 -4414/2007 of Lecture Notes in Computer Science, pages 66–76. Springer Berlin / Heidelberg, 2007.
   1.300 -[26]Shanmugasundaram  Venkataraman,  Peter  Stevenson,  Yiya  Yang,  Lorna  Richardson,  Nicholas  Burton,
   1.301 +[19]Shanmugasundaram  Venkataraman,  Peter  Stevenson,  Yiya  Yang,  Lorna  Richardson,  Nicholas  Burton,
   1.302  Thomas  P.  Perry,  Paul  Smith,  Richard  A.  Baldock,  Duncan  R.  Davidson,  and  Jeffrey  H.  Christiansen.
   1.303  EMAGE edinburgh mouse atlas of gene expression:  2008 update.  Nucl. Acids Res., 36(suppl_1):D860–
   1.304  865, 2008.
   1.305 -[27]Axel Visel, Christina Thaller, and Gregor Eichele.  GenePaint.org:  an atlas of gene expression patterns in
   1.306 -the mouse embryo. Nucl. Acids Res., 32(suppl_1):D552–556, 2004.
   1.307 -[28]Robert H Waterston, Kerstin Lindblad-Toh, Ewan Birney, Jane Rogers, Josep F Abril, Pankaj Agarwal, Richa
   1.308 +[20]Robert H Waterston, Kerstin Lindblad-Toh, Ewan Birney, Jane Rogers, Josep F Abril, Pankaj Agarwal, Richa
   1.309  Agarwala,  Rachel Ainscough,  Marina Alexandersson,  Peter An,  Stylianos E Antonarakis,  John Attwood,
   1.310  Robert Baertsch, Jonathon Bailey, Karen Barlow, Stephan Beck, Eric Berry, Bruce Birren, Toby Bloom, Peer
   1.311  Bork, Marc Botcherby, Nicolas Bray, Michael R Brent, Daniel G Brown, Stephen D Brown, Carol Bult, John
     2.1 Binary file grant.odt has changed
     3.1 Binary file grant.pdf has changed
     4.1 --- a/grant.txt	Wed Apr 22 07:39:32 2009 -0700
     4.2 +++ b/grant.txt	Wed Apr 22 14:51:24 2009 -0700
     4.3 @@ -28,7 +28,9 @@
     4.4  
     4.5  == Specific aims ==
     4.6  
     4.7 -Massive new datasets obtained with techniques such as in situ hybridization (ISH), immunohistochemistry, in situ transgenic reporter, microarray voxelation, and others, allow the expression levels of many genes at many locations to be compared. Our goal is to develop automated methods to relate spatial variation in gene expression to anatomy. We want to find marker genes for specific anatomical regions, and also to draw new anatomical maps based on gene expression patterns. We have three specific aims:\\
     4.8 +Massive new datasets obtained with techniques such as in situ hybridization (ISH), immunohistochemistry, in situ transgenic reporter, microarray voxelation, and others, allow the expression levels of many genes at many locations to be compared. Our goal is to develop automated methods to relate spatial variation in gene expression to anatomy. We want to find marker genes for specific anatomical regions, and also to draw new anatomical maps based on gene expression patterns. We will validate these methods by applying them to 46 anatomical areas within the cerebral cortex, by using the Allen Mouse Brain Atlas coronal dataset (ABA). This gene expression dataset was generated using ISH, and contains over 4,000 genes. For each gene, a digitized 3-D raster of the expression pattern is available: for each gene, the level of expression at each of 51,533 voxels is recorded.
     4.9 +
    4.10 +We have three specific aims:\\
    4.11  
    4.12  (1) develop an algorithm to screen spatial gene expression data for combinations of marker genes which selectively target anatomical regions\\
    4.13  
    4.14 @@ -250,11 +252,11 @@
    4.15  
    4.16  \vspace{0.3cm}**The Allen Mouse Brain Atlas dataset**
    4.17  
    4.18 -The Allen Mouse Brain Atlas (ABA) data were produced by doing in-situ hybridization on slices of male, 56-day-old C57BL/6J mouse brains. Pictures were taken of the processed slice, and these pictures were semi-automatically analyzed to create a digital measurement of gene expression levels at each location in each slice. Per slice, cellular spatial resolution is achieved. Using this method, a single physical slice can only be used to measure one single gene; many different mouse brains were needed in order to measure the expression of many genes. 
    4.19 -
    4.20 -An automated nonlinear alignment procedure located the 2D data from the various slices in a single 3D coordinate system. In the final 3D coordinate system, voxels are cubes with 200 microns on a side. There are 67x41x58 \= 159,326 voxels in the 3D coordinate system, of which 51,533 are in the brain\cite{ng_anatomic_2009}.
    4.21 -
    4.22 -Mus musculus is thought to contain about 22,000 protein-coding genes\cite{waterston_initial_2002}. The ABA contains data on about 20,000 genes in sagittal sections, out of which over 4,000 genes are also measured in coronal sections. Our dataset is derived from only the coronal subset of the ABA\footnote{The sagittal data do not cover the entire cortex, and also have greater registration error\cite{ng_anatomic_2009}. Genes were selected by the Allen Institute for coronal sectioning based on, "classes of known neuroscientific interest... or through post hoc identification of a marked non-ubiquitous expression pattern"\cite{ng_anatomic_2009}.}.
    4.23 +The Allen Mouse Brain Atlas (ABA) data\cite{lein_genome-wide_2007} were produced by doing in-situ hybridization on slices of male, 56-day-old C57BL/6J mouse brains. Pictures were taken of the processed slice, and these pictures were semi-automatically analyzed to create a digital measurement of gene expression levels at each location in each slice. Per slice, cellular spatial resolution is achieved. Using this method, a single physical slice can only be used to measure one single gene; many different mouse brains were needed in order to measure the expression of many genes. 
    4.24 +
    4.25 +Mus musculus is thought to contain about 22,000 protein-coding genes\cite{waterston_initial_2002}. The ABA contains data on about 20,000 genes in sagittal sections, out of which over 4,000 genes are also measured in coronal sections. Our dataset is derived from only the coronal subset of the ABA\footnote{The sagittal data do not cover the entire cortex, and also have greater registration error\cite{ng_anatomic_2009}. Genes were selected by the Allen Institute for coronal sectioning based on, "classes of known neuroscientific interest... or through post hoc identification of a marked non-ubiquitous expression pattern"\cite{ng_anatomic_2009}.}. An automated nonlinear alignment procedure located the 2D data from the various slices in a single 3D coordinate system. In the final 3D coordinate system, voxels are cubes with 200 microns on a side. There are 67x41x58 \= 159,326 voxels, of which 51,533 are in the brain\cite{ng_anatomic_2009}. For each voxel and each gene, the expression energy\cite{lein_genome-wide_2007} within that voxel is made available.
    4.26 +
    4.27 +
    4.28  
    4.29  %%The ABA is not the only large public spatial gene expression dataset. Other such resources include GENSAT\cite{gong_gene_2003}, GenePaint\cite{visel_genepaint.org:atlas_2004}, its sister project GeneAtlas\cite{carson_digital_2005}, BGEM\cite{magdaleno_bgem:in_2006}, EMAGE\cite{venkataraman_emage_2008}, EurExpress\footnote{http://www.eurexpress.org/ee/; EurExpress data are also entered into EMAGE}, EADHB\footnote{http://www.ncl.ac.uk/ihg/EADHB/database/EADHB_database.html}, MAMEP\footnote{http://mamep.molgen.mpg.de/index.php}, Xenbase\footnote{http://xenbase.org/}, ZFIN\cite{sprague_zebrafish_2006}, Aniseed\footnote{http://aniseed-ibdm.univ-mrs.fr/}, VisiGene\footnote{http://genome.ucsc.edu/cgi-bin/hgVisiGene ; includes data from some  the other listed data sources}, GEISHA\cite{bell_geishawhole-mount_2004}, Fruitfly.org\cite{tomancak_systematic_2002}, COMPARE\footnote{http://compare.ibdml.univ-mrs.fr/} GXD\cite{smith_mouse_2007}, GEO\cite{barrett_ncbi_2007}\footnote{GXD and GEO contain spatial data but also non-spatial data. All GXD spatial data are also in EMAGE.}. With the exception of the ABA, GenePaint, and EMAGE, most of these resources have not (yet) extracted the expression intensity from the ISH images and registered the results into a single 3-D space, and to our knowledge only ABA and EMAGE make this form of data available for public download from the website\footnote{without prior offline registration}. Many of these resources focus on developmental gene expression.
    4.30  
    4.31 @@ -319,7 +321,7 @@
    4.32  === Flatmap of cortex ===
    4.33  
    4.34  
    4.35 -We downloaded the ABA data and applied a mask to select only those voxels which belong to cerebral cortex. We divided the cortex into hemispheres. Using Caret\cite{van_essen_integrated_2001}, we created a mesh representation of the surface of the selected voxels. For each gene, and for each node of the mesh, we calculated an average of the gene expression of the voxels "underneath" that mesh node. We then flattened the cortex, creating a two-dimensional mesh. We sampled the nodes of the irregular, flat mesh in order to create a regular grid of pixel values. We converted this grid into a MATLAB matrix. We manually traced the boundaries of each of 49 cortical areas from the ABA coronal reference atlas slides. We then converted these manual traces into Caret-format regional boundary data on the mesh surface. We projected the regions onto the 2-d mesh, and then onto the grid, and then we converted the region data into MATLAB format.
    4.36 +We downloaded the ABA data and applied a mask to select only those voxels which belong to cerebral cortex. We divided the cortex into hemispheres. Using Caret\cite{van_essen_integrated_2001}, we created a mesh representation of the surface of the selected voxels. For each gene, and for each node of the mesh, we calculated an average of the gene expression of the voxels "underneath" that mesh node. We then flattened the cortex, creating a two-dimensional mesh. We sampled the nodes of the irregular, flat mesh in order to create a regular grid of pixel values. We converted this grid into a MATLAB matrix. We manually traced the boundaries of each of 46 cortical areas from the ABA coronal reference atlas slides. We then converted these manual traces into Caret-format regional boundary data on the mesh surface. We projected the regions onto the 2-d mesh, and then onto the grid, and then we converted the region data into MATLAB format.
    4.37  
    4.38  At this point, the data are in the form of a number of 2-D matrices, all in registration, with the matrix entries representing a grid of points (pixels) over the cortical surface. There is one 2-D matrix whose entries represent the regional label associated with each surface pixel. And for each gene, there is a 2-D matrix whose entries represent the average expression level underneath each surface pixel. We created a normalized version of the gene expression data by subtracting each gene's mean expression level (over all surface pixels) and dividing the expression level of each gene by its standard deviation. The features and the target area are both functions on the surface pixels. They can be referred to as scalar fields over the space of surface pixels; alternately, they can be thought of as images which can be displayed on the flatmapped surface. 
    4.39  
    4.40 @@ -519,7 +521,7 @@
    4.41  \label{dimReduc}\end{wrapfigure}
    4.42  
    4.43  \vspace{0.3cm}**Scoring measures and feature selection** 
    4.44 -%%We will develop scoring methods for evaluating how good individual genes are at marking areas. We will compare pointwise, geometric, and information-theoretic measures. We already developed one entirely new scoring method (gradient similarity), but we may develop more. Scoring measures that we will explore will include the L1 norm, correlation, expression energy ratio, conditional entropy, gradient similarity, Jaccard similarity, Dice similarity, Hough transform, and statistical tests such as Hotelling's T-square test (a multivariate generalization of Student's t-test), ANOVA, and a multivariate version of the Mann-Whitney U test (a non-parametric test). 
    4.45 +%%We will develop scoring methods for evaluating how good individual genes are at marking areas. We will compare pointwise, geometric, and information-theoretic measures. We already developed one entirely new scoring method (gradient similarity), and we plan to develop more. Scoring measures that we will explore will include the L1 norm, correlation, expression energy ratio, conditional entropy, gradient similarity, Jaccard similarity, Dice similarity, Hough transform, and statistical tests such as Hotelling's T-square test (a multivariate generalization of Student's t-test), ANOVA, and a multivariate version of the Mann-Whitney U test (a non-parametric test). 
    4.46  We will develop scoring methods for evaluating how good individual genes are at marking areas. We will compare pointwise, geometric, and information-theoretic measures. We already developed one entirely new scoring method (gradient similarity), but we may develop more. Scoring measures that we will explore will include the L1 norm, correlation, expression energy ratio, conditional entropy, gradient similarity, Jaccard similarity, Dice similarity, Hough transform, and statistical tests such as Student's t-test, and the Mann-Whitney U test (a non-parametric test). In addition, any classifier induces a scoring measure on genes by taking the prediction error when using that gene to predict the target. 
    4.47  
    4.48  Using some combination of these measures, we will develop a procedure to find single marker genes for anatomical regions: for each cortical area, we will rank the genes by their ability to delineate each area. We will quantitatively compare the list of single genes generated by our method to the lists generated by previous methods which are mentioned in Aim 1 Related Work.
