cg

diff grant.txt @ 108:a38cc9a46200

.
author bshanks@bshanks-salk.dyndns.org
date Wed Apr 22 22:24:24 2009 -0700 (16 years ago)
parents ffa1390e4f39
children a6b99bc50476
line diff
1.1 --- a/grant.txt Wed Apr 22 14:51:24 2009 -0700 1.2 +++ b/grant.txt Wed Apr 22 22:24:24 2009 -0700 1.3 @@ -47,7 +47,7 @@ 1.4 1.5 \newpage 1.6 1.7 -== The challenge topic == 1.8 +== Analysis of high dimensional data for genomic anatomy in the brain == 1.9 1.10 This proposal addresses challenge topic 06-HG-101. Massive new datasets obtained with techniques such as in situ hybridization (ISH), immunohistochemistry, in situ transgenic reporter, microarray voxelation, and others, allow the expression levels of many genes at many locations to be compared. Our goal is to develop automated methods to relate spatial variation in gene expression to anatomy. We want to find marker genes for specific anatomical regions, and also to draw new anatomical maps based on gene expression patterns. 1.11 1.12 @@ -252,9 +252,15 @@ 1.13 1.14 \vspace{0.3cm}**The Allen Mouse Brain Atlas dataset** 1.15 1.16 -The Allen Mouse Brain Atlas (ABA) data\cite{lein_genome-wide_2007} were produced by doing in-situ hybridization on slices of male, 56-day-old C57BL/6J mouse brains. Pictures were taken of the processed slice, and these pictures were semi-automatically analyzed to create a digital measurement of gene expression levels at each location in each slice. Per slice, cellular spatial resolution is achieved. Using this method, a single physical slice can only be used to measure one single gene; many different mouse brains were needed in order to measure the expression of many genes. 1.17 - 1.18 -Mus musculus is thought to contain about 22,000 protein-coding genes\cite{waterston_initial_2002}. The ABA contains data on about 20,000 genes in sagittal sections, out of which over 4,000 genes are also measured in coronal sections. Our dataset is derived from only the coronal subset of the ABA\footnote{The sagittal data do not cover the entire cortex, and also have greater registration error\cite{ng_anatomic_2009}. Genes were selected by the Allen Institute for coronal sectioning based on, "classes of known neuroscientific interest... or through post hoc identification of a marked non-ubiquitous expression pattern"\cite{ng_anatomic_2009}.}. An automated nonlinear alignment procedure located the 2D data from the various slices in a single 3D coordinate system. In the final 3D coordinate system, voxels are cubes with 200 microns on a side. There are 67x41x58 \= 159,326 voxels, of which 51,533 are in the brain\cite{ng_anatomic_2009}. For each voxel and each gene, the expression energy\cite{lein_genome-wide_2007} within that voxel is made available. 1.19 +%%The Allen Mouse Brain Atlas (ABA) data\cite{lein_genome-wide_2007} 1.20 + 1.21 +The Allen Mouse Brain Atlas (ABA) data were produced by doing in-situ hybridization on slices of male, 56-day-old C57BL/6J mouse brains. Pictures were taken of the processed slice, and these pictures were semi-automatically analyzed to create a digital measurement of gene expression levels at each location in each slice. Per slice, cellular spatial resolution is achieved. Using this method, a single physical slice can only be used to measure one single gene; many different mouse brains were needed in order to measure the expression of many genes. 1.22 + 1.23 +%%Mus musculus is thought to contain about 22,000 protein-coding genes\cite{waterston_initial_2002}. 1.24 + 1.25 +Mus musculus is thought to contain about 22,000 protein-coding genes. The ABA contains data on about 20,000 genes in sagittal sections, out of which over 4,000 genes are also measured in coronal sections. Our dataset is derived from only the coronal subset of the ABA\footnote{The sagittal data do not cover the entire cortex, and also have greater registration error\cite{ng_anatomic_2009}. Genes were selected by the Allen Institute for coronal sectioning based on, "classes of known neuroscientific interest... or through post hoc identification of a marked non-ubiquitous expression pattern"\cite{ng_anatomic_2009}.}. An automated nonlinear alignment procedure located the 2D data from the various slices in a single 3D coordinate system. In the final 3D coordinate system, voxels are cubes with 200 microns on a side. There are 67x41x58 \= 159,326 voxels, of which 51,533 are in the brain\cite{ng_anatomic_2009}. For each voxel and each gene, the expression energy within that voxel is made available. 1.26 + 1.27 +%% For each voxel and each gene, the expression energy\cite{lein_genome-wide_2007} within that voxel is made available. 1.28 1.29 1.30 1.31 @@ -537,8 +543,10 @@ 1.32 1.33 A future publication on the method that we develop in Aim 1 will review the scoring measures and quantitatively compare their performance in order to provide a foundation for future research of methods of marker gene finding. We will measure the robustness of the scoring measures as well as their absolute performance on our dataset. 1.34 1.35 +%% (including spatial models\cite{paciorek_computational_2007}) 1.36 + 1.37 \vspace{0.3cm}**Classifiers** 1.38 -We will explore and compare different classifiers. As noted above, this activity is not separate from the previous one, because some supervised learning algorithms include feature selection, and any classifier can be combined with a stepwise wrapper for use as a feature selection method. We will explore logistic regression (including spatial models\cite{paciorek_computational_2007}), decision trees\footnote{Actually, we have already begun to explore decision trees. For each cortical area, we have used the C4.5 algorithm to find a decision tree for that area. We achieved good classification accuracy on our training set, but the number of genes that appeared in each tree was too large. We plan to implement a pruning procedure to generate trees that use fewer genes.}, sparse SVMs, generative mixture models (including naive bayes), kernel density estimation, instance-based learning methods (such as k-nearest neighbor), genetic algorithms, and artificial neural networks. 1.39 +We will explore and compare different classifiers. As noted above, this activity is not separate from the previous one, because some supervised learning algorithms include feature selection, and any classifier can be combined with a stepwise wrapper for use as a feature selection method. We will explore logistic regression (including spatial models), decision trees\footnote{Actually, we have already begun to explore decision trees. For each cortical area, we have used the C4.5 algorithm to find a decision tree for that area. We achieved good classification accuracy on our training set, but the number of genes that appeared in each tree was too large. We plan to implement a pruning procedure to generate trees that use fewer genes.}, sparse SVMs, generative mixture models (including naive bayes), kernel density estimation, instance-based learning methods (such as k-nearest neighbor), genetic algorithms, and artificial neural networks. 1.40 1.41 1.42 1.43 @@ -567,7 +575,9 @@ 1.44 In addition to using the cluster expression prototypes directly to identify spatial regions, this might be useful as a component of dimensionality reduction. For example, one could imagine clustering similar genes and then replacing their expression levels with a single average expression level, thereby removing some redundancy from the gene expression profiles. One could then perform clustering on pixels (possibly after a second dimensionality reduction step) in order to identify spatial regions. It remains to be seen whether removal of redundancy would help or hurt the ultimate goal of identifying interesting spatial regions. 1.45 1.46 \vspace{0.3cm}**Co-clustering** 1.47 -There are some algorithms which simultaneously incorporate clustering on instances and on features (in our case, genes and pixels), for example, IRM\cite{kemp_learning_2006}. These are called co-clustering or biclustering algorithms. 1.48 +There are some algorithms which simultaneously incorporate clustering on instances and on features (in our case, genes and pixels), for example, IRM. These are called co-clustering or biclustering algorithms. 1.49 + 1.50 +%%IRM\cite{kemp_learning_2006}. 1.51 1.52 \vspace{0.3cm}**Radial profiles** 1.53 We wil explore the use of the radial profile of gene expression under each pixel. 1.54 @@ -583,7 +593,9 @@ 1.55 === Apply the new methods to the cortex === 1.56 Using the methods developed in Aim 1, we will present, for each cortical area, a short list of markers to identify that area; and we will also present lists of "panels" of genes that can be used to delineate many areas at once. 1.57 1.58 -Because in most cases the ABA coronal dataset only contains one ISH per gene, it is possible for an unrelated combination of genes to seem to identify an area when in fact it is only coincidence. There are two ways we will validate our marker genes to guard against this. First, we will confirm that putative combinations of marker genes express the same pattern in both hemispheres. Second, we will manually validate our final results on other gene expression datasets such as EMAGE, GeneAtlas, and GENSAT\cite{gong_gene_2003}. 1.59 +%% GENSAT\cite{gong_gene_2003} 1.60 + 1.61 +Because in most cases the ABA coronal dataset only contains one ISH per gene, it is possible for an unrelated combination of genes to seem to identify an area when in fact it is only coincidence. There are two ways we will validate our marker genes to guard against this. First, we will confirm that putative combinations of marker genes express the same pattern in both hemispheres. Second, we will manually validate our final results on other gene expression datasets such as EMAGE, GeneAtlas, and GENSAT. 1.62 1.63 Using the methods developed in Aim 2, we will present one or more hierarchical cortical maps. We will identify and explain how the statistical structure in the gene expression data led to any unexpected or interesting features of these maps, and we will provide biological hypotheses to interpret any new cortical areas, or groupings of areas, which are discovered. 1.64