cg
diff grant.txt @ 106:ffa1390e4f39
.
author | bshanks@bshanks.dyndns.org |
---|---|
date | Wed Apr 22 14:51:24 2009 -0700 (16 years ago) |
parents | 6c48f37d0f0c |
children | a38cc9a46200 |
line diff
1.1 --- a/grant.txt Wed Apr 22 07:39:32 2009 -0700
1.2 +++ b/grant.txt Wed Apr 22 14:51:24 2009 -0700
1.3 @@ -28,7 +28,9 @@
1.4
1.5 == Specific aims ==
1.6
1.7 -Massive new datasets obtained with techniques such as in situ hybridization (ISH), immunohistochemistry, in situ transgenic reporter, microarray voxelation, and others, allow the expression levels of many genes at many locations to be compared. Our goal is to develop automated methods to relate spatial variation in gene expression to anatomy. We want to find marker genes for specific anatomical regions, and also to draw new anatomical maps based on gene expression patterns. We have three specific aims:\\
1.8 +Massive new datasets obtained with techniques such as in situ hybridization (ISH), immunohistochemistry, in situ transgenic reporter, microarray voxelation, and others, allow the expression levels of many genes at many locations to be compared. Our goal is to develop automated methods to relate spatial variation in gene expression to anatomy. We want to find marker genes for specific anatomical regions, and also to draw new anatomical maps based on gene expression patterns. We will validate these methods by applying them to 46 anatomical areas within the cerebral cortex, by using the Allen Mouse Brain Atlas coronal dataset (ABA). This gene expression dataset was generated using ISH, and contains over 4,000 genes. For each gene, a digitized 3-D raster of the expression pattern is available: for each gene, the level of expression at each of 51,533 voxels is recorded.
1.9 +
1.10 +We have three specific aims:\\
1.11
1.12 (1) develop an algorithm to screen spatial gene expression data for combinations of marker genes which selectively target anatomical regions\\
1.13
1.14 @@ -250,11 +252,11 @@
1.15
1.16 \vspace{0.3cm}**The Allen Mouse Brain Atlas dataset**
1.17
1.18 -The Allen Mouse Brain Atlas (ABA) data were produced by doing in-situ hybridization on slices of male, 56-day-old C57BL/6J mouse brains. Pictures were taken of the processed slice, and these pictures were semi-automatically analyzed to create a digital measurement of gene expression levels at each location in each slice. Per slice, cellular spatial resolution is achieved. Using this method, a single physical slice can only be used to measure one single gene; many different mouse brains were needed in order to measure the expression of many genes.
1.19 -
1.20 -An automated nonlinear alignment procedure located the 2D data from the various slices in a single 3D coordinate system. In the final 3D coordinate system, voxels are cubes with 200 microns on a side. There are 67x41x58 \= 159,326 voxels in the 3D coordinate system, of which 51,533 are in the brain\cite{ng_anatomic_2009}.
1.21 -
1.22 -Mus musculus is thought to contain about 22,000 protein-coding genes\cite{waterston_initial_2002}. The ABA contains data on about 20,000 genes in sagittal sections, out of which over 4,000 genes are also measured in coronal sections. Our dataset is derived from only the coronal subset of the ABA\footnote{The sagittal data do not cover the entire cortex, and also have greater registration error\cite{ng_anatomic_2009}. Genes were selected by the Allen Institute for coronal sectioning based on, "classes of known neuroscientific interest... or through post hoc identification of a marked non-ubiquitous expression pattern"\cite{ng_anatomic_2009}.}.
1.23 +The Allen Mouse Brain Atlas (ABA) data\cite{lein_genome-wide_2007} were produced by doing in-situ hybridization on slices of male, 56-day-old C57BL/6J mouse brains. Pictures were taken of the processed slice, and these pictures were semi-automatically analyzed to create a digital measurement of gene expression levels at each location in each slice. Per slice, cellular spatial resolution is achieved. Using this method, a single physical slice can only be used to measure one single gene; many different mouse brains were needed in order to measure the expression of many genes.
1.24 +
1.25 +Mus musculus is thought to contain about 22,000 protein-coding genes\cite{waterston_initial_2002}. The ABA contains data on about 20,000 genes in sagittal sections, out of which over 4,000 genes are also measured in coronal sections. Our dataset is derived from only the coronal subset of the ABA\footnote{The sagittal data do not cover the entire cortex, and also have greater registration error\cite{ng_anatomic_2009}. Genes were selected by the Allen Institute for coronal sectioning based on, "classes of known neuroscientific interest... or through post hoc identification of a marked non-ubiquitous expression pattern"\cite{ng_anatomic_2009}.}. An automated nonlinear alignment procedure located the 2D data from the various slices in a single 3D coordinate system. In the final 3D coordinate system, voxels are cubes with 200 microns on a side. There are 67x41x58 \= 159,326 voxels, of which 51,533 are in the brain\cite{ng_anatomic_2009}. For each voxel and each gene, the expression energy\cite{lein_genome-wide_2007} within that voxel is made available.
1.26 +
1.27 +
1.28
1.29 %%The ABA is not the only large public spatial gene expression dataset. Other such resources include GENSAT\cite{gong_gene_2003}, GenePaint\cite{visel_genepaint.org:atlas_2004}, its sister project GeneAtlas\cite{carson_digital_2005}, BGEM\cite{magdaleno_bgem:in_2006}, EMAGE\cite{venkataraman_emage_2008}, EurExpress\footnote{http://www.eurexpress.org/ee/; EurExpress data are also entered into EMAGE}, EADHB\footnote{http://www.ncl.ac.uk/ihg/EADHB/database/EADHB_database.html}, MAMEP\footnote{http://mamep.molgen.mpg.de/index.php}, Xenbase\footnote{http://xenbase.org/}, ZFIN\cite{sprague_zebrafish_2006}, Aniseed\footnote{http://aniseed-ibdm.univ-mrs.fr/}, VisiGene\footnote{http://genome.ucsc.edu/cgi-bin/hgVisiGene ; includes data from some the other listed data sources}, GEISHA\cite{bell_geishawhole-mount_2004}, Fruitfly.org\cite{tomancak_systematic_2002}, COMPARE\footnote{http://compare.ibdml.univ-mrs.fr/} GXD\cite{smith_mouse_2007}, GEO\cite{barrett_ncbi_2007}\footnote{GXD and GEO contain spatial data but also non-spatial data. All GXD spatial data are also in EMAGE.}. With the exception of the ABA, GenePaint, and EMAGE, most of these resources have not (yet) extracted the expression intensity from the ISH images and registered the results into a single 3-D space, and to our knowledge only ABA and EMAGE make this form of data available for public download from the website\footnote{without prior offline registration}. Many of these resources focus on developmental gene expression.
1.30
1.31 @@ -319,7 +321,7 @@
1.32 === Flatmap of cortex ===
1.33
1.34
1.35 -We downloaded the ABA data and applied a mask to select only those voxels which belong to cerebral cortex. We divided the cortex into hemispheres. Using Caret\cite{van_essen_integrated_2001}, we created a mesh representation of the surface of the selected voxels. For each gene, and for each node of the mesh, we calculated an average of the gene expression of the voxels "underneath" that mesh node. We then flattened the cortex, creating a two-dimensional mesh. We sampled the nodes of the irregular, flat mesh in order to create a regular grid of pixel values. We converted this grid into a MATLAB matrix. We manually traced the boundaries of each of 49 cortical areas from the ABA coronal reference atlas slides. We then converted these manual traces into Caret-format regional boundary data on the mesh surface. We projected the regions onto the 2-d mesh, and then onto the grid, and then we converted the region data into MATLAB format.
1.36 +We downloaded the ABA data and applied a mask to select only those voxels which belong to cerebral cortex. We divided the cortex into hemispheres. Using Caret\cite{van_essen_integrated_2001}, we created a mesh representation of the surface of the selected voxels. For each gene, and for each node of the mesh, we calculated an average of the gene expression of the voxels "underneath" that mesh node. We then flattened the cortex, creating a two-dimensional mesh. We sampled the nodes of the irregular, flat mesh in order to create a regular grid of pixel values. We converted this grid into a MATLAB matrix. We manually traced the boundaries of each of 46 cortical areas from the ABA coronal reference atlas slides. We then converted these manual traces into Caret-format regional boundary data on the mesh surface. We projected the regions onto the 2-d mesh, and then onto the grid, and then we converted the region data into MATLAB format.
1.37
1.38 At this point, the data are in the form of a number of 2-D matrices, all in registration, with the matrix entries representing a grid of points (pixels) over the cortical surface. There is one 2-D matrix whose entries represent the regional label associated with each surface pixel. And for each gene, there is a 2-D matrix whose entries represent the average expression level underneath each surface pixel. We created a normalized version of the gene expression data by subtracting each gene's mean expression level (over all surface pixels) and dividing the expression level of each gene by its standard deviation. The features and the target area are both functions on the surface pixels. They can be referred to as scalar fields over the space of surface pixels; alternately, they can be thought of as images which can be displayed on the flatmapped surface.
1.39
1.40 @@ -519,7 +521,7 @@
1.41 \label{dimReduc}\end{wrapfigure}
1.42
1.43 \vspace{0.3cm}**Scoring measures and feature selection**
1.44 -%%We will develop scoring methods for evaluating how good individual genes are at marking areas. We will compare pointwise, geometric, and information-theoretic measures. We already developed one entirely new scoring method (gradient similarity), but we may develop more. Scoring measures that we will explore will include the L1 norm, correlation, expression energy ratio, conditional entropy, gradient similarity, Jaccard similarity, Dice similarity, Hough transform, and statistical tests such as Hotelling's T-square test (a multivariate generalization of Student's t-test), ANOVA, and a multivariate version of the Mann-Whitney U test (a non-parametric test).
1.45 +%%We will develop scoring methods for evaluating how good individual genes are at marking areas. We will compare pointwise, geometric, and information-theoretic measures. We already developed one entirely new scoring method (gradient similarity), and we plan to develop more. Scoring measures that we will explore will include the L1 norm, correlation, expression energy ratio, conditional entropy, gradient similarity, Jaccard similarity, Dice similarity, Hough transform, and statistical tests such as Hotelling's T-square test (a multivariate generalization of Student's t-test), ANOVA, and a multivariate version of the Mann-Whitney U test (a non-parametric test).
1.46 We will develop scoring methods for evaluating how good individual genes are at marking areas. We will compare pointwise, geometric, and information-theoretic measures. We already developed one entirely new scoring method (gradient similarity), but we may develop more. Scoring measures that we will explore will include the L1 norm, correlation, expression energy ratio, conditional entropy, gradient similarity, Jaccard similarity, Dice similarity, Hough transform, and statistical tests such as Student's t-test, and the Mann-Whitney U test (a non-parametric test). In addition, any classifier induces a scoring measure on genes by taking the prediction error when using that gene to predict the target.
1.47
1.48 Using some combination of these measures, we will develop a procedure to find single marker genes for anatomical regions: for each cortical area, we will rank the genes by their ability to delineate each area. We will quantitatively compare the list of single genes generated by our method to the lists generated by previous methods which are mentioned in Aim 1 Related Work.