cg

diff grant.txt @ 44:c4a887af9b0b

.
author bshanks@bshanks.dyndns.org
date Wed Apr 15 03:19:01 2009 -0700 (16 years ago)
parents 8cce366da1e5
children a44e9ad61efa
line diff
1.1 --- a/grant.txt Wed Apr 15 00:50:34 2009 -0700 1.2 +++ b/grant.txt Wed Apr 15 03:19:01 2009 -0700 1.3 @@ -67,11 +67,11 @@ 1.4 1.5 1.6 === Related work === 1.7 -There is a substantial body of work on the analysis of gene expression data, however, most of this concerns gene expression data which is not fundamentally spatial. 1.8 +There is a substantial body of work on the analysis of gene expression data, most of this concerns gene expression data which is not fundamentally spatial\footnote{By "__fundamentally__ spatial" we mean that there is information from a large number of spatial locations; not just data which has only a few different locations.}. 1.9 1.10 As noted above, there has been much work on both supervised learning and there are many available algorithms for each. However, the algorithms require the scientist to provide a framework for representing the problem domain, and the way that this framework is set up has a large impact on performance. Creating a good framework can require creatively reconceptualizing the problem domain, and is not merely a mechanical "fine-tuning" of numerical parameters. For example, we believe that domain-specific scoring measures (such as gradient similarity, which is discussed in Preliminary Work) may be necessary in order to achieve the best results in this application. 1.11 1.12 -We are aware of three existing efforts to find marker genes using spatial gene expression data using automated methods. 1.13 +We are aware of four existing efforts to find marker genes using spatial gene expression data using automated methods. 1.14 1.15 \cite{carson_data_2005} describes GeneAtlas. GeneAtlas allows the user to construct a search query by freely demarcating one or two 2-D regions on sagittal slices, and then to specify either the strength of expression or the name of another gene whose expression pattern is to be matched. GeneAtlas differs from our Aim 1 in at least two ways. First, GeneAtlas finds only single genes, whereas we will also look for combinations of genes\footnote{See Preliminary Data for an example of an area which cannot be marked by any single gene in the dataset, but which can be marked by a combination.}. Second, at least for the custom spatial search, Gene Atlas appears to use a simple pointwise scoring method (strength of expression), whereas we will also use geometric metrics such as gradient similarity. 1.16 1.17 @@ -91,6 +91,11 @@ 1.18 1.19 Gene Finder is different from our Aim 1 in at least three ways. First, Gene Finder finds only single genes, whereas we will also look for combinations of genes. Second, gene finder can only use overexpression as a marker, whereas we will also search for underexpression. Third, Gene Finder uses a simple pointwise score\footnote{"Expression energy ratio", which captures overexpression.}, whereas we will also use geometric scores such as gradient similarity. The Preliminary Data section contains evidence that each of our three choices is the right one. 1.20 1.21 +\cite{venkataraman_emage_2008} todo 1.22 + 1.23 + 1.24 +\cite{hemert_matching_2008} todo 1.25 + 1.26 In summary, none of the previous projects explores combinations of marker genes, and none of their publications compare the results obtained by using different algorithms or scoring methods. 1.27 1.28 1.29 @@ -135,7 +140,7 @@ 1.30 1.31 Gene clusters could be used as part of dimensionality reduction: rather than have one feature for each gene, we could have one reduced feature for each gene cluster. 1.32 1.33 -Gene clusters could also be used to directly yield a clustering on instances. This is because many genes have an expression pattern which seems to pick out a single, spatially continguous region. Therefore, it seems likely that an anatomically interesting region will have multiple genes which each individually pick it out\footnote{This would seem to contradict our finding in aim 1 that some cortical areas are combinatorially coded by multiple genes. However, it is possible that the currently accepted cortical maps divide the cortex into regions which are unnatural from the point of view of gene expression; perhaps there is some other way to map the cortex for which each region can be identified by single genes.}. This suggests the following procedure: cluster together genes which pick out similar regions, and then to use the more popular common regions as the final clusters. In the Preliminary Data we show that a number of anatomically recognized cortical regions, as well as some "superregions" formed by lumping together a few regions, are associated with gene clusters in this fashion. 1.34 +Gene clusters could also be used to directly yield a clustering on instances. This is because many genes have an expression pattern which seems to pick out a single, spatially continguous region. Therefore, it seems likely that an anatomically interesting region will have multiple genes which each individually pick it out\footnote{This would seem to contradict our finding in aim 1 that some cortical areas are combinatorially coded by multiple genes. However, it is possible that the currently accepted cortical maps divide the cortex into regions which are unnatural from the point of view of gene expression; perhaps there is some other way to map the cortex for which each region can be identified by single genes. Another possibility is that, although the cluster prototype fits an anatomical region, the individual genes are each somewhat different from the prototype.}. This suggests the following procedure: cluster together genes which pick out similar regions, and then to use the more popular common regions as the final clusters. In the Preliminary Data we show that a number of anatomically recognized cortical regions, as well as some "superregions" formed by lumping together a few regions, are associated with gene clusters in this fashion. 1.35 1.36 1.37 === Related work === 1.38 @@ -145,7 +150,9 @@ 1.39 \cite{thompson_genomic_2008} describes an analysis of the anatomy of 1.40 the hippocampus using the ABA dataset. In addition to manual analysis, 1.41 two clustering methods were employed, a modified Non-negative Matrix 1.42 -Factorization (NNMF), and a hierarchial recursive bifurcation clustering scheme based on correlation as the similarity score. The paper yielded impressive results, proving the usefulness of such research. We have run NNMF on the cortical dataset\footnote{We ran "vanilla" NNMF, whereas the paper under discussion used a modified method. Their main modification consisted of adding a soft spatial contiguity constraint. However, on our dataset, NNMF naturally produced spatially contiguous clusters, so no additional constraint was needed. The paper under discussion also mentions that they tried a hierarchial variant of NNMF, which we have not yet tried.} and while the results are promising (see Preliminary Data), we think that it will be possible to find an even better method. In addition, this paper described a visual screening of the data, specifically, a visual analysis of 6000 genes with the primary purpose of observing how the spatial pattern of their expression coincided with the regions that had been identified by NNMF. We propose to do this sort of screening automatically, which would yield an objective, quantifiable result, rather than qualitative observations. 1.43 +Factorization (NNMF), and a hierarchial recursive bifurcation clustering scheme based on correlation as the similarity score. The paper yielded impressive results, proving the usefulness of computational genomic anatomy. We have run NNMF on the cortical dataset\footnote{We ran "vanilla" NNMF, whereas the paper under discussion used a modified method. Their main modification consisted of adding a soft spatial contiguity constraint. However, on our dataset, NNMF naturally produced spatially contiguous clusters, so no additional constraint was needed. The paper under discussion also mentions that they tried a hierarchial variant of NNMF, which we have not yet tried.} and while the results are promising (see Preliminary Data), we think that it will be possible to find an even better method. 1.44 + 1.45 +%% In addition, this paper described a visual screening of the data, specifically, a visual analysis of 6000 genes with the primary purpose of observing how the spatial pattern of their expression coincided with the regions that had been identified by NNMF. We propose to do this sort of screening automatically, which would yield an objective, quantifiable result, rather than qualitative observations. 1.46 1.47 1.48 1.49 @@ -181,7 +188,7 @@ 1.50 1.51 Mus musculus, the common house mouse, is thought to contain about 22,000 protein-coding genes\cite{waterston_initial_2002}. The ABA contains data on about 20,000 genes in sagittal sections, out of which over 4,000 genes are also measured in coronal sections. Our dataset is derived from only the coronal subset of the ABA, because the sagittal data does not cover the entire cortex, and has greater registration error\cite{ng_anatomic_2009}. Genes were selected by the Allen Institute for coronal sectioning based on, "classes of known neuroscientific interest... or through post hoc identification of a marked non-ubiquitous expression pattern"\cite{ng_anatomic_2009}. 1.52 1.53 -The ABA is not the only large public spatial gene expression dataset. Other such resources include GENSAT\cite{gong_gene_2003}, GenePaint\cite{visel_genepaint_2004}, its sister project GeneAtlas\cite{carson_data_2005}, BGEM\cite{magdaleno_bgem_2006}, EMAGE\cite{?}, EurExpress (http://www.eurexpress.org/ee/; EurExpress data is also entered into EMAGE), todo. With the exception of the ABA, GenePaint, and EMAGE, most of these resources, have not (yet) extracted the expression intensity from the ISH images and registered the results into a single 3-D space, and only ABA and EMAGE make this form of data available for public download from the website. Many of these resources focus on developmental gene expression. 1.54 +The ABA is not the only large public spatial gene expression dataset. Other such resources include GENSAT\cite{gong_gene_2003}, GenePaint\cite{visel_genepaint.org:atlas_2004}, its sister project GeneAtlas\cite{carson_data_2005}, BGEM\cite{magdaleno_bgem:in_2006}, EMAGE\cite{?}, EurExpress (http://www.eurexpress.org/ee/; EurExpress data is also entered into EMAGE), todo. With the exception of the ABA, GenePaint, and EMAGE, most of these resources, have not (yet) extracted the expression intensity from the ISH images and registered the results into a single 3-D space, and only ABA and EMAGE make this form of data available for public download from the website. Many of these resources focus on developmental gene expression. 1.55 1.56 1.57 1.58 @@ -198,15 +205,16 @@ 1.59 1.60 === Related work === 1.61 1.62 -\cite{ng_anatomic_2009} describes the application of AGEA to the cortex. The paper describes interesting results on the structure of correlations between voxel gene expression profiles within a handful of cortical areas. However, this sort of analysis is not related to either of our aims, as it neither finds marker genes, nor does it suggest a cortical map based on gene expression data. Neither of the other components of AGEA can be applied to cortical areas; AGEA's Gene Finder cannot be used to find marker genes for cortical areas; and AGEA's hierarchial clustering does not produce clusters corresponding to cortical areas. 1.63 - 1.64 -In both cases, the root cause is that pairwise correlations between the gene expression of voxels in different areas but the same layer are stronger than pairwise correlations between the gene expression of voxels in different layers but the same area. Therefore a pairwise voxel correlation clustering algorithm will always create clusters representing cortical layers, not areas. This is why the hierarchial clustering does not find cortical areas\footnote{There are clusters which presumably correspond to the intersection of a layer and an area, but since one area will have many layer-area intersection clusters, further work is needed to make sense of these.}. The reason that Gene Finder cannot find marker genes for cortical areas is that in Gene Finder, although the user chooses a seed voxel, Gene Finder chooses the ROI for which genes will be found, and it creates that ROI by (pairwise voxel correlation) clustering around the seed. 1.65 +\cite{ng_anatomic_2009} describes the application of AGEA to the cortex. The paper describes interesting results on the structure of correlations between voxel gene expression profiles within a handful of cortical areas. However, this sort of analysis is not related to either of our aims, as it neither finds marker genes, nor does it suggest a cortical map based on gene expression data. Neither of the other components of AGEA can be applied to cortical areas; AGEA's Gene Finder cannot be used to find marker genes for most cortical areas; and AGEA's hierarchial clustering does not produce clusters corresponding to most cortical areas\footnote{In both cases, the root cause is that pairwise correlations between the gene expression of voxels in different areas but the same layer are often stronger than pairwise correlations between the gene expression of voxels in different layers but the same area. Therefore, a pairwise voxel correlation clustering algorithm will often create clusters representing cortical layers, not areas. This is why the hierarchial clustering does not find most cortical areas (there are clusters which presumably correspond to the intersection of a layer and an area, but since one area will have many layer-area intersection clusters, further work is needed to make sense of these). The reason that Gene Finder cannot find marker genes for most cortical areas is that in Gene Finder, although the user chooses a seed voxel, Gene Finder chooses the ROI for which genes will be found, and it creates that ROI by (pairwise voxel correlation) clustering around the seed.}. 1.66 + 1.67 1.68 In summary, for all three aims, (a) none of the previous projects explores combinations of marker genes, (b) there has been almost no comparison of different algorithms or scoring methods, and (c) there has been no work on computationally finding marker genes for cortical areas, or on finding a hierarchial clustering that will yield a map of cortical areas de novo from gene expression data. 1.69 1.70 Our project is guided by a concrete application with a well-specified criterion of success (how well we can find marker genes for \begin{latex}/\end{latex} reproduce the layout of cortical areas), which will provide a solid basis for comparing different methods. 1.71 1.72 1.73 +%% todo: poster; check AGEA cortical data 1.74 + 1.75 \newpage 1.76 1.77 == Preliminary work == 1.78 @@ -446,5 +454,6 @@ 1.79 two hemis 1.80 1.81 1.82 -"genomic anatomy" is a name found in the titles of one of the cited papers which seems good 1.83 - 1.84 +%%"genomic anatomy" is a name found in the titles of one of the cited papers which seems good; maybe "computational genomic anatomy" 1.85 + 1.86 +%% todo: actually i'm pretty sure AGEA doesn't find ANY areas, but i said "most" and "often" to be cautious.