cg

diff grant.txt @ 46:a44e9ad61efa

.
author bshanks@bshanks.dyndns.org
date Wed Apr 15 13:57:53 2009 -0700 (16 years ago)
parents c4a887af9b0b
children 33c10c13f9a3
line diff
1.1 --- a/grant.txt Wed Apr 15 03:19:01 2009 -0700 1.2 +++ b/grant.txt Wed Apr 15 13:57:53 2009 -0700 1.3 @@ -94,9 +94,9 @@ 1.4 \cite{venkataraman_emage_2008} todo 1.5 1.6 1.7 -\cite{hemert_matching_2008} todo 1.8 - 1.9 -In summary, none of the previous projects explores combinations of marker genes, and none of their publications compare the results obtained by using different algorithms or scoring methods. 1.10 +\cite{hemert_matching_2008} describes a technique to find combinations of marker genes to pick out an anatomical region. They use an evolutionary algorithm to evolve logical operators which combine boolean (thresholded) images in order to match a target image. Their match score is Jaccard similarity, which is equal to the number of true pixels in the intersection of the two images, divided by the number of pixels in their union. 1.11 + 1.12 +In summary, only one of the previous projects explores combinations of marker genes, and none of their publications compare the results obtained by using different algorithms or scoring methods. 1.13 1.14 1.15 1.16 @@ -144,7 +144,7 @@ 1.17 1.18 1.19 === Related work === 1.20 -We are aware of three existing efforts to cluster spatial gene expression data. 1.21 +We are aware of four existing efforts to cluster spatial gene expression data. 1.22 1.23 1.24 \cite{thompson_genomic_2008} describes an analysis of the anatomy of 1.25 @@ -159,6 +159,7 @@ 1.26 1.27 %% todo \cite{thompson_genomic_2008} reports that both mNNMF and hierarchial mNNMF clustering were useful, and that hierarchial recursive bifurcation gave similar results. 1.28 1.29 +In an interesting twist, \cite{hemert_matching_2008} applies their technique for finding combinations of marker genes for the purpose of clustering genes around a "seed gene". The way they do this is by using the pattern of expression of the seed gene as the target image, and then searching for other genes which can be combined to reproduce this pattern. Those other genes which are found are considered to be related to the seed. The same team also describes a method\cite{van_hemert_mining_2007} for finding "association rules" such as, "if this voxel is expressed in by any gene, then that voxel is probably also expressed in by the same gene". This could be useful as part of a procedure for clustering voxels. 1.30 1.31 1.32 AGEA's\cite{ng_anatomic_2009} hierarchial clustering differs from our Aim 2 in at least two ways. First, AGEA uses perhaps the simplest possible similarity score (correlation), and does no dimensionality reduction before calculating similarity. While it is possible that a more complex system will not do any better than this, we believe further exploration of alternative methods of scoring and dimensionality reduction is warranted. Second, AGEA did not look at clusters of genes; in Preliminary Data we have shown that clusters of genes may identify interesting spatial regions such as cortical areas. 1.33 @@ -166,7 +167,7 @@ 1.34 \cite{venkataraman_emage_2008} todo 1.35 1.36 1.37 -In summary, although these projects obtained hierarchial clusterings, there has not been much comparison between different algorithms or scoring methods, so it is likely that the best clustering method for this application has not yet been found. 1.38 +In summary, although these projects obtained clusterings, there has not been much comparison between different algorithms or scoring methods, so it is likely that the best clustering method for this application has not yet been found. 1.39 1.40 1.41 1.42 @@ -186,9 +187,9 @@ 1.43 1.44 Next, an automated nonlinear alignment procedure located the 2D data from the various slices in a single 3D coordinate system. In the final 3D coordinate system, voxels are cubes with 200 microns on a side. There are 67x41x58 \= 159,326 voxels in the 3D coordinate system, of which 51,533 are in the brain\cite{ng_anatomic_2009}. 1.45 1.46 -Mus musculus, the common house mouse, is thought to contain about 22,000 protein-coding genes\cite{waterston_initial_2002}. The ABA contains data on about 20,000 genes in sagittal sections, out of which over 4,000 genes are also measured in coronal sections. Our dataset is derived from only the coronal subset of the ABA, because the sagittal data does not cover the entire cortex, and has greater registration error\cite{ng_anatomic_2009}. Genes were selected by the Allen Institute for coronal sectioning based on, "classes of known neuroscientific interest... or through post hoc identification of a marked non-ubiquitous expression pattern"\cite{ng_anatomic_2009}. 1.47 - 1.48 -The ABA is not the only large public spatial gene expression dataset. Other such resources include GENSAT\cite{gong_gene_2003}, GenePaint\cite{visel_genepaint.org:atlas_2004}, its sister project GeneAtlas\cite{carson_data_2005}, BGEM\cite{magdaleno_bgem:in_2006}, EMAGE\cite{?}, EurExpress (http://www.eurexpress.org/ee/; EurExpress data is also entered into EMAGE), todo. With the exception of the ABA, GenePaint, and EMAGE, most of these resources, have not (yet) extracted the expression intensity from the ISH images and registered the results into a single 3-D space, and only ABA and EMAGE make this form of data available for public download from the website. Many of these resources focus on developmental gene expression. 1.49 +Mus musculus, the common house mouse, is thought to contain about 22,000 protein-coding genes\cite{waterston_initial_2002}. The ABA contains data on about 20,000 genes in sagittal sections, out of which over 4,000 genes are also measured in coronal sections. Our dataset is derived from only the coronal subset of the ABA, because the sagittal data does not cover the entire cortex, and also has greater registration error\cite{ng_anatomic_2009}. Genes were selected by the Allen Institute for coronal sectioning based on, "classes of known neuroscientific interest... or through post hoc identification of a marked non-ubiquitous expression pattern"\cite{ng_anatomic_2009}. 1.50 + 1.51 +The ABA is not the only large public spatial gene expression dataset. Other such resources include GENSAT\cite{gong_gene_2003}, GenePaint\cite{visel_genepaint.org:atlas_2004}, its sister project GeneAtlas\cite{carson_data_2005}, BGEM\cite{magdaleno_bgem:in_2006}, EMAGE\cite{venkataraman_emage_2008}, EurExpress\footnote{http://www.eurexpress.org/ee/; EurExpress data is also entered into EMAGE}, todo. With the exception of the ABA, GenePaint, and EMAGE, most of these resources, have not (yet) extracted the expression intensity from the ISH images and registered the results into a single 3-D space, and only ABA and EMAGE make this form of data available for public download from the website. Many of these resources focus on developmental gene expression. 1.52 1.53 1.54 1.55 @@ -205,10 +206,12 @@ 1.56 1.57 === Related work === 1.58 1.59 -\cite{ng_anatomic_2009} describes the application of AGEA to the cortex. The paper describes interesting results on the structure of correlations between voxel gene expression profiles within a handful of cortical areas. However, this sort of analysis is not related to either of our aims, as it neither finds marker genes, nor does it suggest a cortical map based on gene expression data. Neither of the other components of AGEA can be applied to cortical areas; AGEA's Gene Finder cannot be used to find marker genes for most cortical areas; and AGEA's hierarchial clustering does not produce clusters corresponding to most cortical areas\footnote{In both cases, the root cause is that pairwise correlations between the gene expression of voxels in different areas but the same layer are often stronger than pairwise correlations between the gene expression of voxels in different layers but the same area. Therefore, a pairwise voxel correlation clustering algorithm will often create clusters representing cortical layers, not areas. This is why the hierarchial clustering does not find most cortical areas (there are clusters which presumably correspond to the intersection of a layer and an area, but since one area will have many layer-area intersection clusters, further work is needed to make sense of these). The reason that Gene Finder cannot find marker genes for most cortical areas is that in Gene Finder, although the user chooses a seed voxel, Gene Finder chooses the ROI for which genes will be found, and it creates that ROI by (pairwise voxel correlation) clustering around the seed.}. 1.60 - 1.61 - 1.62 -In summary, for all three aims, (a) none of the previous projects explores combinations of marker genes, (b) there has been almost no comparison of different algorithms or scoring methods, and (c) there has been no work on computationally finding marker genes for cortical areas, or on finding a hierarchial clustering that will yield a map of cortical areas de novo from gene expression data. 1.63 +\cite{ng_anatomic_2009} describes the application of AGEA to the cortex. The paper describes interesting results on the structure of correlations between voxel gene expression profiles within a handful of cortical areas. However, this sort of analysis is not related to either of our aims, as it neither finds marker genes, nor does it suggest a cortical map based on gene expression data. Neither of the other components of AGEA can be applied to cortical areas; AGEA's Gene Finder cannot be used to find marker genes for the cortical areas; and AGEA's hierarchial clustering does not produce clusters corresponding to the cortical areas\footnote{In both cases, the root cause is that pairwise correlations between the gene expression of voxels in different areas but the same layer are often stronger than pairwise correlations between the gene expression of voxels in different layers but the same area. Therefore, a pairwise voxel correlation clustering algorithm will tend to create clusters representing cortical layers, not areas. This is why the hierarchial clustering does not find most cortical areas (there are clusters which presumably correspond to the intersection of a layer and an area, but since one area will have many layer-area intersection clusters, further work is needed to make sense of these). The reason that Gene Finder cannot find marker genes for most cortical areas is that in Gene Finder, although the user chooses a seed voxel, Gene Finder chooses the ROI for which genes will be found, and it creates that ROI by (pairwise voxel correlation) clustering around the seed.}. 1.64 + 1.65 + 1.66 +%% Most of the projects which have been discussed have been done by the same groups that develop the public datasets. Although these projects make their algorithms available for use on their own website, none of them have released an open-source software toolkit; instead, users are restricted to using the provided algorithms only on their own dataset. 1.67 + 1.68 +In summary, for all three aims, (a) only one of the previous projects explores combinations of marker genes, (b) there has been almost no comparison of different algorithms or scoring methods, and (c) there has been no work on computationally finding marker genes for cortical areas, or on finding a hierarchial clustering that will yield a map of cortical areas de novo from gene expression data. 1.69 1.70 Our project is guided by a concrete application with a well-specified criterion of success (how well we can find marker genes for \begin{latex}/\end{latex} reproduce the layout of cortical areas), which will provide a solid basis for comparing different methods. 1.71 1.72 @@ -255,7 +258,7 @@ 1.73 1.74 1.75 \vspace{0.3cm}**Correlation** 1.76 -Recall that the instances are surface pixels, and consider the problem of attempting to classify each instance as either a member of a particular anatomical area, or not. The target area can be represented as a binary mask over the surface pixels. 1.77 +Recall that the instances are surface pixels, and consider the problem of attempting to classify each instance as either a member of a particular anatomical area, or not. The target area can be represented as a boolean mask over the surface pixels. 1.78 1.79 One class of feature selection scoring method are those which calculate some sort of "match" between each gene image and the target image. Those genes which match the best are good candidates for features. 1.80 1.81 @@ -266,9 +269,9 @@ 1.82 \vspace{0.3cm}**Conditional entropy** 1.83 An information-theoretic scoring method is to find features such that, if the features (gene expression levels) are known, uncertainty about the target (the regional identity) is reduced. Entropy measures uncertainty, so what we want is to find features such that the conditional distribution of the target has minimal entropy. The distribution to which we are referring is the probability distribution over the population of surface pixels. 1.84 1.85 -The simplest way to use information theory is on discrete data, so we discretized our gene expression data by creating, for each gene, five thresholded binary masks of the gene data. For each gene, we created a binary mask of its expression levels using each of these thresholds: the mean of that gene, the mean minus one standard deviation, the mean minus two standard deviations, the mean plus one standard deviation, the mean plus two standard deviations. 1.86 - 1.87 -Now, for each region, we created and ran a forward stepwise procedure which attempted to find pairs of gene expression binary masks such that the conditional entropy of the target area's binary mask, conditioned upon the pair of gene expression binary masks, is minimized. 1.88 +The simplest way to use information theory is on discrete data, so we discretized our gene expression data by creating, for each gene, five thresholded boolean masks of the gene data. For each gene, we created a boolean mask of its expression levels using each of these thresholds: the mean of that gene, the mean minus one standard deviation, the mean minus two standard deviations, the mean plus one standard deviation, the mean plus two standard deviations. 1.89 + 1.90 +Now, for each region, we created and ran a forward stepwise procedure which attempted to find pairs of gene expression boolean masks such that the conditional entropy of the target area's boolean mask, conditioned upon the pair of gene expression boolean masks, is minimized. 1.91 1.92 This finds pairs of genes which are most informative (at least at these discretization thresholds) relative to the question, "Is this surface pixel a member of the target area?". 1.93