cg

changeset 43:8cce366da1e5
.
author: bshanks@bshanks.dyndns.org
date: Wed Apr 15 00:50:34 2009 -0700 (16 years ago)
parents: 282ba15dcfbe
children: c4a887af9b0b
files: grant.doc grant.html grant.odt grant.pdf grant.txt
--- a/grant.html	Tue Apr 14 23:33:43 2009 -0700
+++ b/grant.html	Wed Apr 15 00:50:34 2009 -0700
@@ -81,6 +81,36 @@
+Related work
+There is a substantial body of work on the analysis of gene expression data, however, most of this concerns gene expression
+data which is not fundamentally spatial.
+As noted above, there has been much work on both supervised learning and there are many available algorithms for
+each. However, the algorithms require the scientist to provide a framework for representing the problem domain, and the
+way that this framework is set up has a large impact on performance.  Creating a good framework can require creatively
+reconceptualizing the problem domain, and is not merely a mechanical &#8220;fine-tuning&#8221; of numerical parameters. For example,
+we believe that domain-specific scoring measures (such as gradient similarity, which is discussed in Preliminary Work) may
+be necessary in order to achieve the best results in this application.
+We are aware of three existing efforts to find marker genes using spatial gene expression data using automated methods.
+[? ] describes GeneAtlas.  GeneAtlas allows the user to construct a search query by freely demarcating one or two 2-D
+regions on sagittal slices, and then to specify either the strength of expression or the name of another gene whose expression
+pattern is to be matched. GeneAtlas differs from our Aim 1 in at least two ways. First, GeneAtlas finds only single genes,
+whereas we will also look for combinations of genes2. Second, at least for the custom spatial search, Gene Atlas appears to
+use a simple pointwise scoring method (strength of expression), whereas we will also use geometric metrics such as gradient
+similarity.
+[2 ] describes AGEA, &#8221;Anatomic Gene Expression Atlas&#8221;. AGEA has three components:
+* Gene Finder:  The user selects a seed voxel and the system (1) chooses a cluster which includes the seed voxel, (2)
+yields a list of genes which are overexpressed in that cluster.  (note:  the ABA website also contains pre-prepared lists of
+overexpressed genes for selected structures)
+* Correlation:  The user selects a seed voxel and the shows the user how much correlation there is between the gene
+expression profile of the seed voxel and every other voxel.
+* Clusters:  AGEA includes a precomputed hierarchial clustering of voxels based on a recursive bifurcation algorithm
+with correlation as the similarity metric.
+Gene Finder is different from our Aim 1 in at least three ways.  First, Gene Finder finds only single genes, whereas we
+will also look for combinations of genes. Second, gene finder can only use overexpression as a marker, whereas we will also
+search for underexpression.  Third, Gene Finder uses a simple pointwise score3, whereas we will also use geometric scores
+such as gradient similarity. The Preliminary Data section contains evidence that each of our three choices is the right one.
+In summary, none of the previous projects explores combinations of marker genes, and none of their publications compare
+the results obtained by using different algorithms or scoring methods.
@@ -92,8 +122,12 @@
+_________________
+   2See Preliminary Data for an example of an area which cannot be marked by any single gene in the dataset, but which can be marked by a
+combination.
+    3&#8220;Expression energy ratio&#8221;, which captures overexpression.
-some of the regions are more similar to each other than to the rest, suggesting that, although at a fine spatial scale they
+some ofthe regions are more similar to each other than to the rest, suggesting that, although at a fine spatial scale they
@@ -138,10 +172,33 @@
-interesting region will have multiple genes which each individually pick it out2.  This suggests the following procedure:
+interesting region will have multiple genes which each individually pick it out4.  This suggests the following procedure:
+Related work
+We are aware of three existing efforts to cluster spatial gene expression data.
+_________________________________________
+   4This would seem to contradict our finding in aim 1 that some cortical areas are combinatorially coded by multiple genes.  However, it is
+possible that the currently accepted cortical maps divide the cortex into regions which are unnatural from the point of view of gene expression;
+perhaps there is some other way to map the cortex for which each region can be identified by single genes.
+[5 ] describes an analysis of the anatomy of the hippocampus using the ABA dataset.  In addition to manual analysis,
+two clustering methods were employed, a modified Non-negative Matrix Factorization (NNMF), and a hierarchial recursive
+bifurcation clustering scheme based on correlation as the similarity score. The paper yielded impressive results, proving the
+usefulness of such research. We have run NNMF on the cortical dataset5 and while the results are promising (see Preliminary
+Data), we think that it will be possible to find an even better method. In addition, this paper described a visual screening
+of the data, specifically, a visual analysis of 6000 genes with the primary purpose of observing how the spatial pattern of
+their expression coincided with the regions that had been identified by NNMF. We propose to do this sort of screening
+automatically, which would yield an objective, quantifiable result, rather than qualitative observations.
+AGEA&#8217;s[2] hierarchial clustering differs from our Aim 2 in at least two ways.  First, AGEA uses perhaps the simplest
+possible similarity score (correlation), and does no dimensionality reduction before calculating similarity. While it is possible
+that a more complex system will not do any better than this, we believe further exploration of alternative methods of scoring
+and dimensionality reduction is warranted.  Second, AGEA did not look at clusters of genes; in Preliminary Data we have
+shown that clusters of genes may identify interesting spatial regions such as cortical areas.
+[? ] todo
+In summary, although these projects obtained hierarchial clusterings, there has not been much comparison between
+different algorithms or scoring methods, so it is likely that the best clustering method for this application has not yet been
+found.
@@ -177,12 +234,12 @@
-_________________________________________
-   2This would seem to contradict our finding in aim 1 that some cortical areas are combinatorially coded by multiple genes.  However, it is
-possible that the currently accepted cortical maps divide the cortex into regions which are unnatural from the point of view of gene expression;
-perhaps there is some other way to map the cortex for which each region can be identified by single genes.
-Themethod developed in aim (1) will be applied to each cortical area to find a set of marker genes such that the
+___________________________
+   5We ran &#8220;vanilla&#8221; NNMF, whereas the paper under discussion used a modified method.  Their main modification consisted of adding a soft
+spatial contiguity constraint.  However, on our dataset, NNMF naturally produced spatially contiguous clusters, so no additional constraint was
+needed. The paper under discussion also mentions that they tried a hierarchial variant of NNMF, which we have not yet tried.
+The method developed in aim (1) will be applied to each cortical area to find a set of marker genes such that the
@@ -200,72 +257,25 @@
-There is a substantial body of work on the analysis of gene expression data, however, most of this concerns gene expression
-data which is not fundamentally spatial.
-As noted above, there has been much work on both supervised learning and clustering, and there are many available
-algorithms for each.  However, the completion of Aims 1 and 2 involves more than just choosing between a set of existing
-algorithms, and will constitute a substantial contribution to biology.  The algorithms require the scientist to provide a
-framework for representing the problem domain, and the way that this framework is set up has a large impact on performance.
-Creating a good framework can require creatively reconceptualizing the problem domain, and is not merely a mechanical
-&#8220;fine-tuning&#8221; of numerical parameters.  For example, we believe that domain-specific scoring measures (such as gradient
-similarity, which is discussed in Preliminary Work) may be necessary in order to achieve the best results in this application.
-We are aware of four existing efforts to relate spatial gene expression data to anatomy through computational methods.
-[? ] refers to GeneAtlas.  GeneAtlas allows the user to construct a search query by freely demarcating one or two 2-D
-regions on sagittal slices, and then to specify either the strength of expression or the name of another gene whose expression
-pattern is to be matched. GeneAtlas differs from our Aim 1 in at least two ways. First, GeneAtlas finds only single genes,
-whereas we will also look for combinations of genes3. Second, at least for the custom spatial search, Gene Atlas appears to
-use a simple pointwise scoring method (strength of expression), whereas we will also use geometric metrics such as gradient
-similarity.
-[? ] todo
-[5 ] describes an analysis of the anatomy of the hippocampus using the ABA dataset.  In addition to manual analysis,
-two clustering methods were employed, a modified Non-negative Matrix Factorization (NNMF), and a hierarchial recursive
-bifurcation clustering scheme based on correlation as the similarity score. The paper yielded impressive results, proving the
-usefulness of such research. We have run NNMF on the cortical dataset4 and while the results are promising (see Preliminary
-Data), we think that it will be possible to find a better method (we also think that more automation of the parts that this
-paper&#8217;s authors did manually will be possible).
-[2 ] describes AGEA, &#8221;Anatomic Gene Expression Atlas&#8221;.  AGEA is an analysis tool for the ABA dataset.  AGEA has
-three components:
-* Gene Finder:  The user selects a seed voxel and the system (1) chooses a cluster which includes the seed voxel, (2)
-yields a list of genes which are overexpressed in that cluster.  (note:  the ABA website also contains pre-prepared lists of
-overexpressed genes for selected structures)
-* Correlation:  The user selects a seed voxel and the shows the user how much correlation there is between the gene
-expression profile of the seed voxel and every other voxel.
-* Clusters:  AGEA includes a precomputed hierarchial clustering of voxels based on a recursive bifurcation algorithm
-with correlation as the similarity metric.
-Gene Finder is different from our Aim 1 in at least four ways.  First, although the user chooses a seed voxel, Gene
-Finder, not the user, chooses the cluster for which genes will be found, and in our experience it never chooses cortical areas,
+[2 ] describes the application of AGEA to the cortex. The paper describes interesting results on the structure of correlations
+between voxel gene expression profiles within a handful of cortical areas.  However, this sort of analysis is not related to
+either of our aims, as it neither finds marker genes, nor does it suggest a cortical map based on gene expression data. Neither
+of the other components of AGEA can be applied to cortical areas; AGEA&#8217;s Gene Finder cannot be used to find marker
+genes for cortical areas; and AGEA&#8217;s hierarchial clustering does not produce clusters corresponding to cortical areas.
+In both cases, the root cause is that pairwise correlations between the gene expression of voxels in different areas but
+the same layer are stronger than pairwise correlations between the gene expression of voxels in different layers but the same
+area. Therefore a pairwise voxel correlation clustering algorithm will always create clusters representing cortical layers, not
+areas. This is why the hierarchial clustering does not find cortical areas6. The reason that Gene Finder cannot find marker
+genes for cortical areas is that in Gene Finder, although the user chooses a seed voxel, Gene Finder chooses the ROI for
+which genes will be found, and it creates that ROI by (pairwise voxel correlation) clustering around the seed.
+In summary, for all three aims, (a) none of the previous projects explores combinations of marker genes, (b) there has
+been almost no comparison of different algorithms or scoring methods, and (c) there has been no work on computationally
+finding marker genes for cortical areas, or on finding a hierarchial clustering that will yield a map of cortical areas de novo
+from gene expression data.
+Our project is guided by a concrete application with a well-specified criterion of success (how well we can find marker
+genes for / reproduce the layout of cortical areas), which will provide a solid basis for comparing different methods.
-   3See Preliminary Data for an example of an area which cannot be marked by any single gene in the dataset, but which can be marked by a
-combination.
-    4We ran &#8220;vanilla&#8221; NNMF, whereas the paper under discussion used a modified method.  Their main modification consisted of adding a soft
-spatial contiguity constraint.  However, on our dataset, NNMF naturally produced spatially contiguous clusters, so no additional constraint was
-needed. The paper under discussion also mentions that they tried a hierarchial variant of NNMF, which we have not yet tried.
-instead preferring cortical layers5.  Therefore, Gene Finder cannot be used to find marker genes for cortical areas.  Second,
-Gene Finder finds only single genes, whereas we will also look for combinations of genes6.  Third, gene finder can only use
-overexpression as a marker, whereas in the Preliminary Data we show that underexpression can also be used. Fourth, Gene
-Finder uses a simple pointwise score7, whereas we will also use geometric metrics such as gradient similarity.
-The hierarchial clustering is different from our Aim 2 in at least three ways.  First, the clustering finds clusters corre-
-sponding to layers, but no clusters corresponding to cortical areas8 9 Our Aim 2 will not be accomplished until a clustering
-is produced which yields areas.  Second, AGEA uses perhaps the simplest possible similarity score (correlation), and does
-no dimensionality reduction before calculating similarity.  While it is possible that a more complex system will not do any
-better than this, we believe further exploration of alternative methods of scoring and dimensionality reduction is warranted.
-Third, AGEA did not look at clusters of genes; in Preliminary Data we have shown that clusters of genes may identify
-intersting spatial regions such as cortical areas.
-Finally, with the except of [5], none of the publications discussed above compare the results obtained by using different
-algorithms or scoring methods.  [5] reports that both mNNMF and hierarchial mNNMF clustering were useful, and that
-hierarchial recursive bifurcation gave similar results.
-To summarize, in comparison to our Aim 1, none of the previous projects explores combinations of marker genes, and
-w/r/t both aims, there has been almost no experimentation with or comparison of different algorithms or scoring methods.
-todo
-_________________________________________
-   5Because of the way in which Gene Finder chooses a cluster, layers will always be preferred to areas if pairwise correlations between the gene
-expression of voxels in different areas but the same layer are stronger than pairwise correlatios between the gene expression of voxels in different
-layers but the same area. This appears to be the case.
-    6See Preliminary Data for an example of an area which cannot be marked by any single gene in the dataset, but which can be marked by a
-combination.
-    7&#8220;Expression energy ratio&#8221;, which captures overexpression.
-    8This is for the same reason as in footnote 5.
-    9There are clusters which presumably correspond to the intersection of a layer and an area, but since one area will have many layer-area
+   6There are clusters which presumably correspond to the intersection of a layer and an area, but since one area will have many layer-area
@@ -345,26 +355,26 @@
-Geometric and pointwise scoring methods provide complementary information
+Gradient similarity provides information complementary to correlation
-Fig. . The top row of Fig.   displays the 3 genes which most match area AUD, according to a pointwise method10.  The
-bottom row displays the 3 genes which most match AUD according to a method which considers local geometry11  The
-pointwise method in the top row identifies genes which express more strongly in AUD than outside of it; its weakness is
-that this includes many areas which don&#8217;t have a salient border matching the areal border. The geometric method identifies
-genes whose salient expression border seems to partially line up with the border of AUD; its weakness is that this includes
-genes which don&#8217;t express over the entire area.  Genes which have high rankings using both pointwise and border criteria,
-such as Aph1a in the example, may be particularly good markers.  None of these genes are, individually, a perfect marker
-for AUD; we deliberately chose a &#8220;difficult&#8221; area in order to better contrast pointwise with geometric methods.
-Using combinations of multiple genes is necessary and sufficient to delineate some cortical areas
+Fig. . The top row of Fig.  displays the 3 genes which most match area AUD, according to a pointwise method7. The bottom
+row displays the 3 genes which most match AUD according to a method which considers local geometry8  The pointwise
+method in the top row identifies genes which express more strongly in AUD than outside of it; its weakness is that this
+includes many areas which don&#8217;t have a salient border matching the areal border.  The geometric method identifies genes
+whose salient expression border seems to partially line up with the border of AUD; its weakness is that this includes genes
+which don&#8217;t express over the entire area. Genes which have high rankings using both pointwise and border criteria, such as
+Aph1a in the example, may be particularly good markers. None of these genes are, individually, a perfect marker for AUD;
+we deliberately chose a &#8220;difficult&#8221; area in order to better contrast pointwise with geometric methods.
+Combinations of multiple genes are useful
-natorially.  according to logistic regression, gene wwc112  is the best fit single gene for predicting whether or not a pixel on
+natorially.  according to logistic regression, gene wwc19  is the best fit single gene for predicting whether or not a pixel on
-  10For each gene, a logistic regression in which the response variable was whether or not a surface pixel was within area AUD, and the predictor
+   7For each gene, a logistic regression in which the response variable was whether or not a surface pixel was within area AUD, and the predictor
-   11For each gene the gradient similarity (see section ??) between (a) a map of the expression of each gene on the cortical surface and (b) the
+    8For each gene the gradient similarity (see section ??) between (a) a map of the expression of each gene on the cortical surface and (b) the
-   12&#8220;WW, C2 and coiled-coil domain containing 1&#8221;; EntrezGene ID 211652
+    9&#8220;WW, C2 and coiled-coil domain containing 1&#8221;; EntrezGene ID 211652
@@ -377,18 +387,18 @@
-Gene mtif213 is shown in figure the upper-right of Fig. . Mtif2 captures MO&#8217;s upper-left boundary, but not its lower-right
+Gene mtif210 is shown in figure the upper-right of Fig. . Mtif2 captures MO&#8217;s upper-left boundary, but not its lower-right
-Areas can sometimes be marked by underexpression
+Underexpression of a gene can serve as a marker
-surface pixels based on their gene expression profiles. We achieved classification accuracy of about 81%14. As noted above,
+surface pixels based on their gene expression profiles. We achieved classification accuracy of about 81%11. As noted above,
@@ -400,8 +410,8 @@
-  13&#8220;mitochondrial translational initiation factor 2&#8221;; EntrezGene ID 76784
-   145-fold cross-validation.
+  10&#8220;mitochondrial translational initiation factor 2&#8221;; EntrezGene ID 76784
+   115-fold cross-validation.
--- a/grant.txt	Tue Apr 14 23:33:43 2009 -0700
+++ b/grant.txt	Wed Apr 15 00:50:34 2009 -0700
@@ -66,122 +66,146 @@
-=== Aim 2 ===
-
-\vspace{0.3cm}**Machine learning terminology: clustering**
-
-If one is given a dataset consisting merely of instances, with no class labels, then analysis of the dataset is referred to as __unsupervised learning__ in the jargon of machine learning. One thing that you can do with such a dataset is to group instances together. A set of similar instances is called a __cluster__, and the activity of finding grouping the data into clusters is called clustering or cluster analysis.
-
-The task of deciding how to carve up a structure into anatomical regions can be put into these terms. The instances are once again voxels (or pixels) along with their associated gene expression profiles. We make the assumption that voxels from the same region have similar gene expression profiles, at least compared to the other regions. This means that clustering voxels is the same as finding potential regions; we seek a partitioning of the voxels into regions, that is, into clusters of voxels with similar gene expression.
-
-It is desirable to determine not just one set of regions, but also how these regions relate to each other, if at all; perhaps some of the regions are more similar to each other than to the rest, suggesting that, although at a fine spatial scale they could be considered separate, on a coarser spatial scale they could be grouped together into one large region. This suggests the outcome of clustering may be a hierarchial tree of clusters, rather than a single set of clusters which partition the voxels. This is called hierarchial clustering.
-
-
-\vspace{0.3cm}**Similarity scores**
-
-A crucial choice when designing a clustering method is how to measure similarity, across either pairs of instances, or clusters, or both. There is much overlap between scoring methods for feature selection (discussed above under Aim 1) and scoring methods for similarity. 
-
-
-\vspace{0.3cm}**Spatially contiguous clusters; image segmentation**
-
-
-We have shown that aim 2 is a type of clustering task. In fact, it is a special type of clustering task because we have an additional constraint on clusters; voxels grouped together into a cluster must be spatially contiguous. In Preliminary Results, we show that one can get reasonable results without enforcing this constraint, however, we plan to compare these results against other methods which guarantee contiguous clusters.
-
-Perhaps the biggest source of continguous clustering algorithms is the field of computer vision, which has produced a variety of image segmentation algorithms. Image segmentation is the task of partitioning the pixels in a digital image into clusters, usually contiguous clusters. Aim 2 is similar to an image segmentation task. There are two main differences; in our task, there are thousands of color channels (one for each gene), rather than just three. There are imaging tasks which use more than three colors, however, for example multispectral imaging and hyperspectral imaging, which are often used to process satellite imagery. A more crucial difference is that there are various cues which are appropriate for detecting sharp object boundaries in a visual scene but which are not appropriate for segmenting abstract spatial data such as gene expression. Although many image segmentation algorithms can be expected to work well for segmenting other sorts of spatially arranged data, some of these algorithms are specialized for visual images.
-
-
-\vspace{0.3cm}**Dimensionality reduction**
-
-
-Unlike aim 1, there is no externally-imposed need to select only a handful of informative genes for inclusion in the instances. However, some clustering algorithms perform better on small numbers of features. There are techniques which "summarize" a larger number of features using a smaller number of features; these techniques go by the name of feature extraction or dimensionality reduction. The small set of features that such a technique yields is called the __reduced feature set__. After the reduced feature set is created, the instances may be replaced by __reduced instances__, which have as their features the reduced feature set rather than the original feature set of all gene expression levels. Note that the features in the reduced feature set do not necessarily correspond to genes; each feature in the reduced set may be any function of the set of gene expression levels.
-
-Another use for dimensionality reduction is to visualize the relationships between regions. For example, one might want to make a 2-D plot upon which each region is represented by a single point, and with the property that regions with similar gene expression profiles should be nearby on the plot (that is, the property that distance between pairs of points in the plot should be proportional to some measure of dissimilarity in gene expression). It is likely that no arrangement of the points on a 2-D plan will exactly satisfy this property -- however, dimensionality reduction techniques allow one to find arrangements of points that approximately satisfy that property. Note that in this application, dimensionality reduction is being applied after clustering; whereas in the previous paragraph, we were talking about using dimensionality reduction before clustering.
-
-
-\vspace{0.3cm}**Clustering genes rather than voxels**
-
-
-Although the ultimate goal is to cluster the instances (voxels or pixels), one strategy to achieve this goal is to first cluster the features (genes). There are two ways that clusters of genes could be used. 
-
-Gene clusters could be used as part of dimensionality reduction: rather than have one feature for each gene, we could have one reduced feature for each gene cluster.
-
-Gene clusters could also be used to directly yield a clustering on instances. This is because many genes have an expression pattern which seems to pick out a single, spatially continguous region. Therefore, it seems likely that an anatomically interesting region will have multiple genes which each individually pick it out\footnote{This would seem to contradict our finding in aim 1 that some cortical areas are combinatorially coded by multiple genes. However, it is possible that the currently accepted cortical maps divide the cortex into regions which are unnatural from the point of view of gene expression; perhaps there is some other way to map the cortex for which each region can be identified by single genes.}. This suggests the following procedure: cluster together genes which pick out similar regions, and then to use the more popular common regions as the final clusters. In the Preliminary Data we show that a number of anatomically recognized cortical regions, as well as some "superregions" formed by lumping together a few regions, are associated with gene clusters in this fashion.
-
-
-
-
-
-=== Aim 3 ===
-
-\vspace{0.3cm}**Background**
-
-The cortex is divided into areas and layers. To a first approximation, the parcellation of the cortex into areas can be drawn as a 2-D map on the surface of the cortex. In the third dimension, the boundaries between the areas continue downwards into the cortical depth, perpendicular to the surface. The layer boundaries run parallel to the surface. One can picture an area of the cortex as a slice of many-layered cake.
-
-Although it is known that different cortical areas have distinct roles in both normal functioning and in disease processes, there are no known marker genes for many cortical areas. When it is necessary to divide a tissue sample into cortical areas, this is a manual process that requires a skilled human to combine multiple visual cues and interpret them in the context of their approximate location upon the cortical surface. 
-
-Even the questions of how many areas should be recognized in cortex, and what their arrangement is, are still not completely settled. A proposed division of the cortex into areas is called a cortical map. In the rodent, the lack of a single agreed-upon map can be seen by contrasting the recent maps given by Swanson\cite{swanson_brain_2003} on the one hand, and Paxinos and Franklin\cite{paxinos_mouse_2001} on the other. While the maps are certainly very similar in their general arrangement, significant differences remain in the details.
-
-\vspace{0.3cm}**The Allen Mouse Brain Atlas dataset**
-
-The Allen Mouse Brain Atlas (ABA) data was produced by doing in-situ hybridization on slices of male, 56-day-old C57BL/6J mouse brains. Pictures were taken of the processed slice, and these pictures were semi-automatically analyzed in order to create a digital measurement of gene expression levels at each location in each slice. Per slice, cellular spatial resolution is achieved. Using this method, a single physical slice can only be used to measure one single gene; many different mouse brains were needed in order to measure the expression of many genes. 
-
-Next, an automated nonlinear alignment procedure located the 2D data from the various slices in a single 3D coordinate system. In the final 3D coordinate system, voxels are cubes with 200 microns on a side. There are 67x41x58 \= 159,326 voxels in the 3D coordinate system, of which 51,533 are in the brain\cite{ng_anatomic_2009}.
-
-Mus musculus, the common house mouse, is thought to contain about 22,000 protein-coding genes\cite{waterston_initial_2002}. The ABA contains data on about 20,000 genes in sagittal sections, out of which over 4,000 genes are also measured in coronal sections. Our dataset is derived from only the coronal subset of the ABA, because the sagittal data does not cover the entire cortex, and has greater registration error\cite{ng_anatomic_2009}. Genes were selected by the Allen Institute for coronal sectioning based on, "classes of known neuroscientific interest... or through post hoc identification of a marked non-ubiquitous expression pattern"\cite{ng_anatomic_2009}. 
-
-The ABA is not the only large public spatial gene expression dataset. Other such resources include GENSAT\cite{gong_gene_2003}, GenePaint\cite{visel_genepaint_2004}, its sister project GeneAtlas\cite{carson_data_2005}, BGEM\cite{magdaleno_bgem_2006}, EMAGE\cite{?}, EurExpress (http://www.eurexpress.org/ee/; EurExpress data is also entered into EMAGE), todo. With the exception of the ABA, GenePaint, and EMAGE, most of these resources, have not (yet) extracted the expression intensity from the ISH images and registered the results into a single 3-D space, and only ABA and EMAGE make this form of data available for public download from the website. Many of these resources focus on developmental gene expression.
-
-
-
-\vspace{0.3cm}**Significance**
-
-The method developed in aim (1) will be applied to each cortical area to find a set of marker genes such that the combinatorial expression pattern of those genes uniquely picks out the target area. Finding marker genes will be useful for drug discovery as well as for experimentation because marker genes can be used to design interventions which selectively target individual cortical areas.
-
-The application of the marker gene finding algorithm to the cortex will also support the development of new neuroanatomical methods. In addition to finding markers for each individual cortical areas, we will find a small panel of genes that can find many of the areal boundaries at once. This panel of marker genes will allow the development of an ISH protocol that will allow experimenters to more easily identify which anatomical areas are present in small samples of cortex.
-
-The method developed in aim (3) will provide a genoarchitectonic viewpoint that will contribute to the creation of a better map. The development of present-day cortical maps was driven by the application of histological stains. It is conceivable that if a different set of stains had been available which identified a different set of features, then the today's cortical maps would have come out differently. Since the number of classes of stains is small compared to the number of genes, it is likely that there are many repeated, salient spatial patterns in the gene expression which have not yet been captured by any stain. Therefore, current ideas about cortical anatomy need to incorporate what we can learn from looking at the patterns of gene expression.
-
-While we do not here propose to analyze human gene expression data, it is conceivable that the methods we propose to develop could be used to suggest modifications to the human cortical map as well.  
-
-
-As noted above, there has been much work on both supervised learning and clustering, and there are many available algorithms for each. However, the completion of Aims 1 and 2 involves more than just choosing between a set of existing algorithms, and will constitute a substantial contribution to biology. The algorithms require the scientist to provide a framework for representing the problem domain, and the way that this framework is set up has a large impact on performance. Creating a good framework can require creatively reconceptualizing the problem domain, and is not merely a mechanical "fine-tuning" of numerical parameters. For example, we believe that domain-specific scoring measures (such as gradient similarity, which is discussed in Preliminary Work) may be necessary in order to achieve the best results in this application.
-
-We are aware of four existing efforts to relate spatial gene expression data to anatomy through computational methods.
-
-\cite{carson_data_2005} refers to GeneAtlas. GeneAtlas allows the user to construct a search query by freely demarcating one or two 2-D regions on sagittal slices, and then to specify either the strength of expression or the name of another gene whose expression pattern is to be matched. GeneAtlas differs from our Aim 1 in at least two ways. First, GeneAtlas finds only single genes, whereas we will also look for combinations of genes\footnote{See Preliminary Data for an example of an area which cannot be marked by any single gene in the dataset, but which can be marked by a combination.}. Second, at least for the custom spatial search, Gene Atlas appears to use a simple pointwise scoring method (strength of expression), whereas we will also use geometric metrics such as gradient similarity. 
-
-\cite{venkataraman_emage_2008} todo
+As noted above, there has been much work on both supervised learning and there are many available algorithms for each. However, the algorithms require the scientist to provide a framework for representing the problem domain, and the way that this framework is set up has a large impact on performance. Creating a good framework can require creatively reconceptualizing the problem domain, and is not merely a mechanical "fine-tuning" of numerical parameters. For example, we believe that domain-specific scoring measures (such as gradient similarity, which is discussed in Preliminary Work) may be necessary in order to achieve the best results in this application.
+
+We are aware of three existing efforts to find marker genes using spatial gene expression data using automated methods.
+
+\cite{carson_data_2005} describes GeneAtlas. GeneAtlas allows the user to construct a search query by freely demarcating one or two 2-D regions on sagittal slices, and then to specify either the strength of expression or the name of another gene whose expression pattern is to be matched. GeneAtlas differs from our Aim 1 in at least two ways. First, GeneAtlas finds only single genes, whereas we will also look for combinations of genes\footnote{See Preliminary Data for an example of an area which cannot be marked by any single gene in the dataset, but which can be marked by a combination.}. Second, at least for the custom spatial search, Gene Atlas appears to use a simple pointwise scoring method (strength of expression), whereas we will also use geometric metrics such as gradient similarity. 
+
+\cite{ng_anatomic_2009} describes AGEA, "Anatomic Gene Expression
+Atlas". AGEA has three
+components:
+
+* Gene Finder: The user selects a seed voxel and the system (1) chooses a
+cluster which includes the seed voxel, (2) yields a list of genes
+which are overexpressed in that cluster. (note: the ABA website also contains pre-prepared lists of overexpressed genes for selected structures)
+
+* Correlation: The user selects a seed voxel and
+the shows the user how much correlation there is between the gene
+expression profile of the seed voxel and every other voxel.
+
+* Clusters: AGEA includes a precomputed hierarchial clustering of voxels based on a recursive bifurcation algorithm with correlation as the similarity metric. 
+
+Gene Finder is different from our Aim 1 in at least three ways. First, Gene Finder finds only single genes, whereas we will also look for combinations of genes. Second, gene finder can only use overexpression as a marker, whereas we will also search for underexpression. Third, Gene Finder uses a simple pointwise score\footnote{"Expression energy ratio", which captures overexpression.}, whereas we will also use geometric scores such as gradient similarity. The Preliminary Data section contains evidence that each of our three choices is the right one.
+
+In summary, none of the previous projects explores combinations of marker genes, and none of their publications compare the results obtained by using different algorithms or scoring methods.
+
+
+
+
+=== Aim 2 ===
+
+\vspace{0.3cm}**Machine learning terminology: clustering**
+
+If one is given a dataset consisting merely of instances, with no class labels, then analysis of the dataset is referred to as __unsupervised learning__ in the jargon of machine learning. One thing that you can do with such a dataset is to group instances together. A set of similar instances is called a __cluster__, and the activity of finding grouping the data into clusters is called clustering or cluster analysis.
+
+The task of deciding how to carve up a structure into anatomical regions can be put into these terms. The instances are once again voxels (or pixels) along with their associated gene expression profiles. We make the assumption that voxels from the same region have similar gene expression profiles, at least compared to the other regions. This means that clustering voxels is the same as finding potential regions; we seek a partitioning of the voxels into regions, that is, into clusters of voxels with similar gene expression.
+
+It is desirable to determine not just one set of regions, but also how these regions relate to each other, if at all; perhaps some of the regions are more similar to each other than to the rest, suggesting that, although at a fine spatial scale they could be considered separate, on a coarser spatial scale they could be grouped together into one large region. This suggests the outcome of clustering may be a hierarchial tree of clusters, rather than a single set of clusters which partition the voxels. This is called hierarchial clustering.
+
+
+\vspace{0.3cm}**Similarity scores**
+
+A crucial choice when designing a clustering method is how to measure similarity, across either pairs of instances, or clusters, or both. There is much overlap between scoring methods for feature selection (discussed above under Aim 1) and scoring methods for similarity. 
+
+
+\vspace{0.3cm}**Spatially contiguous clusters; image segmentation**
+
+
+We have shown that aim 2 is a type of clustering task. In fact, it is a special type of clustering task because we have an additional constraint on clusters; voxels grouped together into a cluster must be spatially contiguous. In Preliminary Results, we show that one can get reasonable results without enforcing this constraint, however, we plan to compare these results against other methods which guarantee contiguous clusters.
+
+Perhaps the biggest source of continguous clustering algorithms is the field of computer vision, which has produced a variety of image segmentation algorithms. Image segmentation is the task of partitioning the pixels in a digital image into clusters, usually contiguous clusters. Aim 2 is similar to an image segmentation task. There are two main differences; in our task, there are thousands of color channels (one for each gene), rather than just three. There are imaging tasks which use more than three colors, however, for example multispectral imaging and hyperspectral imaging, which are often used to process satellite imagery. A more crucial difference is that there are various cues which are appropriate for detecting sharp object boundaries in a visual scene but which are not appropriate for segmenting abstract spatial data such as gene expression. Although many image segmentation algorithms can be expected to work well for segmenting other sorts of spatially arranged data, some of these algorithms are specialized for visual images.
+
+
+\vspace{0.3cm}**Dimensionality reduction**
+
+
+Unlike aim 1, there is no externally-imposed need to select only a handful of informative genes for inclusion in the instances. However, some clustering algorithms perform better on small numbers of features. There are techniques which "summarize" a larger number of features using a smaller number of features; these techniques go by the name of feature extraction or dimensionality reduction. The small set of features that such a technique yields is called the __reduced feature set__. After the reduced feature set is created, the instances may be replaced by __reduced instances__, which have as their features the reduced feature set rather than the original feature set of all gene expression levels. Note that the features in the reduced feature set do not necessarily correspond to genes; each feature in the reduced set may be any function of the set of gene expression levels.
+
+Another use for dimensionality reduction is to visualize the relationships between regions. For example, one might want to make a 2-D plot upon which each region is represented by a single point, and with the property that regions with similar gene expression profiles should be nearby on the plot (that is, the property that distance between pairs of points in the plot should be proportional to some measure of dissimilarity in gene expression). It is likely that no arrangement of the points on a 2-D plan will exactly satisfy this property -- however, dimensionality reduction techniques allow one to find arrangements of points that approximately satisfy that property. Note that in this application, dimensionality reduction is being applied after clustering; whereas in the previous paragraph, we were talking about using dimensionality reduction before clustering.
+
+
+\vspace{0.3cm}**Clustering genes rather than voxels**
+
+
+Although the ultimate goal is to cluster the instances (voxels or pixels), one strategy to achieve this goal is to first cluster the features (genes). There are two ways that clusters of genes could be used. 
+
+Gene clusters could be used as part of dimensionality reduction: rather than have one feature for each gene, we could have one reduced feature for each gene cluster.
+
+Gene clusters could also be used to directly yield a clustering on instances. This is because many genes have an expression pattern which seems to pick out a single, spatially continguous region. Therefore, it seems likely that an anatomically interesting region will have multiple genes which each individually pick it out\footnote{This would seem to contradict our finding in aim 1 that some cortical areas are combinatorially coded by multiple genes. However, it is possible that the currently accepted cortical maps divide the cortex into regions which are unnatural from the point of view of gene expression; perhaps there is some other way to map the cortex for which each region can be identified by single genes.}. This suggests the following procedure: cluster together genes which pick out similar regions, and then to use the more popular common regions as the final clusters. In the Preliminary Data we show that a number of anatomically recognized cortical regions, as well as some "superregions" formed by lumping together a few regions, are associated with gene clusters in this fashion.
+
+
+=== Related work ===
+We are aware of three existing efforts to cluster spatial gene expression data.
+
-Factorization (NNMF), and a hierarchial recursive bifurcation clustering scheme based on correlation as the similarity score. The paper yielded impressive results, proving the usefulness of such research. We have run NNMF on the cortical dataset\footnote{We ran "vanilla" NNMF, whereas the paper under discussion used a modified method. Their main modification consisted of adding a soft spatial contiguity constraint. However, on our dataset, NNMF naturally produced spatially contiguous clusters, so no additional constraint was needed. The paper under discussion also mentions that they tried a hierarchial variant of NNMF, which we have not yet tried.} and while the results are promising (see Preliminary Data), we think that it will be possible to find a better method (we also think that more automation of the parts that this paper's authors did manually will be possible).
-
-
-\cite{ng_anatomic_2009} describes AGEA, "Anatomic Gene Expression
-Atlas". AGEA is an analysis tool for the ABA dataset. AGEA has three
-components:
-
-* Gene Finder: The user selects a seed voxel and the system (1) chooses a
-cluster which includes the seed voxel, (2) yields a list of genes
-which are overexpressed in that cluster. (note: the ABA website also contains pre-prepared lists of overexpressed genes for selected structures)
-
-* Correlation: The user selects a seed voxel and
-the shows the user how much correlation there is between the gene
-expression profile of the seed voxel and every other voxel.
-
-* Clusters: AGEA includes a precomputed hierarchial clustering of voxels based on a recursive bifurcation algorithm with correlation as the similarity metric. 
-
-Gene Finder is different from our Aim 1 in at least four ways. First, although the user chooses a seed voxel, Gene Finder, not the user, chooses the cluster for which genes will be found, and in our experience it never chooses cortical areas, instead preferring cortical layers\footnote{\label{layersNotAreas}Because of the way in which Gene Finder chooses a cluster, layers will always be preferred to areas if pairwise correlations between the gene expression of voxels in different areas but the same layer are stronger than pairwise correlatios between the gene expression of voxels in different layers but the same area. This appears to be the case.}. Therefore, Gene Finder cannot be used to find marker genes for cortical areas. Second, Gene Finder finds only single genes, whereas we will also look for combinations of genes\footnote{See Preliminary Data for an example of an area which cannot be marked by any single gene in the dataset, but which can be marked by a combination.}. Third, gene finder can only use overexpression as a marker, whereas in the Preliminary Data we show that underexpression can also be used. Fourth, Gene Finder uses a simple pointwise score\footnote{"Expression energy ratio", which captures overexpression.}, whereas we will also use geometric metrics such as gradient similarity. 
-
-The hierarchial clustering is different from our Aim 2 in at least three ways. First, the clustering finds clusters corresponding to layers, but no clusters corresponding to cortical areas\footnote{This is for the same reason as in footnote \ref{layersNotAreas}.} \footnote{There are clusters which presumably correspond to the intersection of a layer and an area, but since one area will have many layer-area intersection clusters, further work is needed to make sense of these.} Our Aim 2 will not be accomplished until a clustering is produced which yields areas. Second, AGEA uses perhaps the simplest possible similarity score (correlation), and does no dimensionality reduction before calculating similarity. While it is possible that a more complex system will not do any better than this, we believe further exploration of alternative methods of scoring and dimensionality reduction is warranted. Third, AGEA did not look at clusters of genes; in Preliminary Data we have shown that clusters of genes may identify intersting spatial regions such as cortical areas.
-
-Finally, with the except of \cite{thompson_genomic_2008}, none of the publications discussed above compare the results obtained by using different algorithms or scoring methods. \cite{thompson_genomic_2008} reports that both mNNMF and hierarchial mNNMF clustering were useful, and that hierarchial recursive bifurcation gave similar results.
-
-To summarize, in comparison to our Aim 1, none of the previous projects explores combinations of marker genes, and w/r/t both aims, there has been almost no experimentation with or comparison of different algorithms or scoring methods. todo
+Factorization (NNMF), and a hierarchial recursive bifurcation clustering scheme based on correlation as the similarity score. The paper yielded impressive results, proving the usefulness of such research. We have run NNMF on the cortical dataset\footnote{We ran "vanilla" NNMF, whereas the paper under discussion used a modified method. Their main modification consisted of adding a soft spatial contiguity constraint. However, on our dataset, NNMF naturally produced spatially contiguous clusters, so no additional constraint was needed. The paper under discussion also mentions that they tried a hierarchial variant of NNMF, which we have not yet tried.} and while the results are promising (see Preliminary Data), we think that it will be possible to find an even better method. In addition, this paper described a visual screening of the data, specifically, a visual analysis of 6000 genes with the primary purpose of observing how the spatial pattern of their expression coincided with the regions that had been identified by NNMF. We propose to do this sort of screening automatically, which would yield an objective, quantifiable result, rather than qualitative observations.
+
+
+
+
+%% todo \cite{thompson_genomic_2008} reports that both mNNMF and hierarchial mNNMF clustering were useful, and that hierarchial recursive bifurcation gave similar results.
+
+
+
+AGEA's\cite{ng_anatomic_2009} hierarchial clustering differs from our Aim 2 in at least two ways. First, AGEA uses perhaps the simplest possible similarity score (correlation), and does no dimensionality reduction before calculating similarity. While it is possible that a more complex system will not do any better than this, we believe further exploration of alternative methods of scoring and dimensionality reduction is warranted. Second, AGEA did not look at clusters of genes; in Preliminary Data we have shown that clusters of genes may identify interesting spatial regions such as cortical areas. 
+
+\cite{venkataraman_emage_2008} todo
+
+
+In summary, although these projects obtained hierarchial clusterings, there has not been much comparison between different algorithms or scoring methods, so it is likely that the best clustering method for this application has not yet been found.
+
+
+
+=== Aim 3 ===
+
+\vspace{0.3cm}**Background**
+
+The cortex is divided into areas and layers. To a first approximation, the parcellation of the cortex into areas can be drawn as a 2-D map on the surface of the cortex. In the third dimension, the boundaries between the areas continue downwards into the cortical depth, perpendicular to the surface. The layer boundaries run parallel to the surface. One can picture an area of the cortex as a slice of many-layered cake.
+
+Although it is known that different cortical areas have distinct roles in both normal functioning and in disease processes, there are no known marker genes for many cortical areas. When it is necessary to divide a tissue sample into cortical areas, this is a manual process that requires a skilled human to combine multiple visual cues and interpret them in the context of their approximate location upon the cortical surface. 
+
+Even the questions of how many areas should be recognized in cortex, and what their arrangement is, are still not completely settled. A proposed division of the cortex into areas is called a cortical map. In the rodent, the lack of a single agreed-upon map can be seen by contrasting the recent maps given by Swanson\cite{swanson_brain_2003} on the one hand, and Paxinos and Franklin\cite{paxinos_mouse_2001} on the other. While the maps are certainly very similar in their general arrangement, significant differences remain in the details.
+
+\vspace{0.3cm}**The Allen Mouse Brain Atlas dataset**
+
+The Allen Mouse Brain Atlas (ABA) data was produced by doing in-situ hybridization on slices of male, 56-day-old C57BL/6J mouse brains. Pictures were taken of the processed slice, and these pictures were semi-automatically analyzed in order to create a digital measurement of gene expression levels at each location in each slice. Per slice, cellular spatial resolution is achieved. Using this method, a single physical slice can only be used to measure one single gene; many different mouse brains were needed in order to measure the expression of many genes. 
+
+Next, an automated nonlinear alignment procedure located the 2D data from the various slices in a single 3D coordinate system. In the final 3D coordinate system, voxels are cubes with 200 microns on a side. There are 67x41x58 \= 159,326 voxels in the 3D coordinate system, of which 51,533 are in the brain\cite{ng_anatomic_2009}.
+
+Mus musculus, the common house mouse, is thought to contain about 22,000 protein-coding genes\cite{waterston_initial_2002}. The ABA contains data on about 20,000 genes in sagittal sections, out of which over 4,000 genes are also measured in coronal sections. Our dataset is derived from only the coronal subset of the ABA, because the sagittal data does not cover the entire cortex, and has greater registration error\cite{ng_anatomic_2009}. Genes were selected by the Allen Institute for coronal sectioning based on, "classes of known neuroscientific interest... or through post hoc identification of a marked non-ubiquitous expression pattern"\cite{ng_anatomic_2009}. 
+
+The ABA is not the only large public spatial gene expression dataset. Other such resources include GENSAT\cite{gong_gene_2003}, GenePaint\cite{visel_genepaint_2004}, its sister project GeneAtlas\cite{carson_data_2005}, BGEM\cite{magdaleno_bgem_2006}, EMAGE\cite{?}, EurExpress (http://www.eurexpress.org/ee/; EurExpress data is also entered into EMAGE), todo. With the exception of the ABA, GenePaint, and EMAGE, most of these resources, have not (yet) extracted the expression intensity from the ISH images and registered the results into a single 3-D space, and only ABA and EMAGE make this form of data available for public download from the website. Many of these resources focus on developmental gene expression.
+
+
+
+\vspace{0.3cm}**Significance**
+
+The method developed in aim (1) will be applied to each cortical area to find a set of marker genes such that the combinatorial expression pattern of those genes uniquely picks out the target area. Finding marker genes will be useful for drug discovery as well as for experimentation because marker genes can be used to design interventions which selectively target individual cortical areas.
+
+The application of the marker gene finding algorithm to the cortex will also support the development of new neuroanatomical methods. In addition to finding markers for each individual cortical areas, we will find a small panel of genes that can find many of the areal boundaries at once. This panel of marker genes will allow the development of an ISH protocol that will allow experimenters to more easily identify which anatomical areas are present in small samples of cortex.
+
+The method developed in aim (3) will provide a genoarchitectonic viewpoint that will contribute to the creation of a better map. The development of present-day cortical maps was driven by the application of histological stains. It is conceivable that if a different set of stains had been available which identified a different set of features, then the today's cortical maps would have come out differently. Since the number of classes of stains is small compared to the number of genes, it is likely that there are many repeated, salient spatial patterns in the gene expression which have not yet been captured by any stain. Therefore, current ideas about cortical anatomy need to incorporate what we can learn from looking at the patterns of gene expression.
+
+While we do not here propose to analyze human gene expression data, it is conceivable that the methods we propose to develop could be used to suggest modifications to the human cortical map as well.  
+
+
+=== Related work ===
+
+\cite{ng_anatomic_2009} describes the application of AGEA to the cortex. The paper describes interesting results on the structure of correlations between voxel gene expression profiles within a handful of cortical areas. However, this sort of analysis is not related to either of our aims, as it neither finds marker genes, nor does it suggest a cortical map based on gene expression data. Neither of the other components of AGEA can be applied to cortical areas; AGEA's Gene Finder cannot be used to find marker genes for cortical areas; and AGEA's hierarchial clustering does not produce clusters corresponding to cortical areas.
+
+In both cases, the root cause is that pairwise correlations between the gene expression of voxels in different areas but the same layer are stronger than pairwise correlations between the gene expression of voxels in different layers but the same area. Therefore a pairwise voxel correlation clustering algorithm will always create clusters representing cortical layers, not areas. This is why the hierarchial clustering does not find cortical areas\footnote{There are clusters which presumably correspond to the intersection of a layer and an area, but since one area will have many layer-area intersection clusters, further work is needed to make sense of these.}. The reason that Gene Finder cannot find marker genes for cortical areas is that in Gene Finder, although the user chooses a seed voxel, Gene Finder chooses the ROI for which genes will be found, and it creates that ROI by (pairwise voxel correlation) clustering around the seed.
+
+In summary, for all three aims, (a) none of the previous projects explores combinations of marker genes, (b) there has been almost no comparison of different algorithms or scoring methods, and (c) there has been no work on computationally finding marker genes for cortical areas, or on finding a hierarchial clustering that will yield a map of cortical areas de novo from gene expression data.
+
+Our project is guided by a concrete application with a well-specified criterion of success (how well we can find marker genes for \begin{latex}/\end{latex} reproduce the layout of cortical areas), which will provide a solid basis for comparing different methods.
+
@@ -255,7 +279,7 @@
-\vspace{0.3cm}**Geometric and pointwise scoring methods provide complementary information**
+\vspace{0.3cm}**Gradient similarity provides information complementary to correlation**
@@ -272,7 +296,7 @@
-\vspace{0.3cm}**Using combinations of multiple genes is necessary and sufficient to delineate some cortical areas**
+\vspace{0.3cm}**Combinations of multiple genes are useful**
@@ -294,7 +318,7 @@
-\vspace{0.3cm}**Areas can sometimes be marked by underexpression**
+\vspace{0.3cm}**Underexpression of a gene can serve as a marker**
author	bshanks@bshanks.dyndns.org
date	Wed Apr 15 00:50:34 2009 -0700 (16 years ago)
parents	282ba15dcfbe
children	c4a887af9b0b
files	grant.doc grant.html grant.odt grant.pdf grant.txt