cg

changeset 46:a44e9ad61efa
.
author: bshanks@bshanks.dyndns.org
date: Wed Apr 15 13:57:53 2009 -0700 (16 years ago)
parents: 354ea5edb5f6
children: 33c10c13f9a3
files: grant.html grant.odt grant.pdf grant.txt
--- a/grant.html	Wed Apr 15 03:20:19 2009 -0700
+++ b/grant.html	Wed Apr 15 13:57:53 2009 -0700
@@ -109,24 +109,27 @@
-[10 ] todo
-[4 ] todo
-In summary, none of the previous projects explores combinations of marker genes, and none of their publications compare
-the results obtained by using different algorithms or scoring methods.
+[11 ] todo
+[4 ] describes a technique to find combinations of marker genes to pick out an anatomical region. They use an evolutionary
+algorithm to evolve logical operators which combine boolean (thresholded) images in order to match a target image. Their
+match score is Jaccard similarity, which is equal to the number of true pixels in the intersection of the two images, divided
+by the number of pixels in their union.
+In summary, only one of the previous projects explores combinations of marker genes, and none of their publications
+compare the results obtained by using different algorithms or scoring methods.
-together.  A set of similar instances is called a cluster, and the activity of finding grouping the data into clusters is called
-clustering or cluster analysis.
-The task of deciding how to carve up a structure into anatomical regions can be put into these terms. The instances are
-once again voxels (or pixels) along with their associated gene expression profiles. We make the assumption that voxels from
+together.  A set of similar instances is called a cluster, and the activity of finding grouping the data into clusters is called
+clustering or cluster analysis.
+The task of deciding how to carve up a structure into anatomical regions can be put into these terms. The instances are
+once again voxels (or pixels) along with their associated gene expression profiles. We make the assumption that voxels from
@@ -177,30 +180,34 @@
+_________________________________________
+   5This would seem to contradict our finding in aim 1 that some cortical areas are combinatorially coded by multiple genes.  However, it is
+possible that the currently accepted cortical maps divide the cortex into regions which are unnatural from the point of view of gene expression;
+perhaps there is some other way to map the cortex for which each region can be identified by single genes. Another possibility is that, although
-_________________________________________
-   5This would seem to contradict our finding in aim 1 that some cortical areas are combinatorially coded by multiple genes.  However, it is
-possible that the currently accepted cortical maps divide the cortex into regions which are unnatural from the point of view of gene expression;
-perhaps there is some other way to map the cortex for which each region can be identified by single genes. Another possibility is that, although
-the cluster prototype fits an anatomical region, the individual genes are each somewhat different from the prototype.
-We are aware of three existing efforts to cluster spatial gene expression data.
+We are aware of four existing efforts to cluster spatial gene expression data.
+In an interesting twist, [4] applies their technique for finding combinations of marker genes for the purpose of clustering
+genes around a &#8220;seed gene&#8221;. The way they do this is by using the pattern of expression of the seed gene as the target image,
+and then searching for other genes which can be combined to reproduce this pattern.  Those other genes which are found
+are considered to be related to the seed. The same team also describes a method[10] for finding &#8220;association rules&#8221; such as,
+&#8220;if this voxel is expressed in by any gene, then that voxel is probably also expressed in by the same gene&#8221;.  This could be
+useful as part of a procedure for clustering voxels.
-[10 ] todo
-In summary, although these projects obtained hierarchial clusterings, there has not been much comparison between
-different algorithms or scoring methods, so it is likely that the best clustering method for this application has not yet been
-found.
+[11 ] todo
+In summary, although these projects obtained clusterings, there has not been much comparison between different algo-
+rithms or scoring methods, so it is likely that the best clustering method for this application has not yet been found.
@@ -225,22 +232,22 @@
-Mus musculus, the common house mouse, is thought to contain about 22,000 protein-coding genes[12]. The ABA contains
+Mus musculus, the common house mouse, is thought to contain about 22,000 protein-coding genes[13]. The ABA contains
-dataset is derived from only the coronal subset of the ABA, because the sagittal data does not cover the entire cortex,
-and has greater registration error[6]. Genes were selected by the Allen Institute for coronal sectioning based on, &#8220;classes of
-known neuroscientific interest... or through post hoc identification of a marked non-ubiquitous expression pattern&#8221;[6].
-The ABA is not the only large public spatial gene expression dataset.   Other such resources include GENSAT[3],
-GenePaint[11], its sister project GeneAtlas[1], BGEM[5], EMAGE[?], EurExpress (http://www.eurexpress.org/ee/; Eu-
-rExpress data is also entered into EMAGE), todo. With the exception of the ABA, GenePaint, and EMAGE, most of these
-resources, have not (yet) extracted the expression intensity from the ISH images and registered the results into a single 3-D
-space, and only ABA and EMAGE make this form of data available for public download from the website.  Many of these
-resources focus on developmental gene expression.
-Significance
-___________________________
-   6We ran &#8220;vanilla&#8221; NNMF, whereas the paper under discussion used a modified method.  Their main modification consisted of adding a soft
+dataset is derived from only the coronal subset of the ABA, because the sagittal data does not cover the entire cortex, and
+_________________________________________
+the cluster prototype fits an anatomical region, the individual genes are each somewhat different from the prototype.
+    6We ran &#8220;vanilla&#8221; NNMF, whereas the paper under discussion used a modified method.  Their main modification consisted of adding a soft
+also has greater registration error[6]. Genes were selected by the Allen Institute for coronal sectioning based on, &#8220;classes of
+known neuroscientific interest... or through post hoc identification of a marked non-ubiquitous expression pattern&#8221;[6].
+TheABA is not the only large public spatial gene expression dataset.   Other such resources include GENSAT[3],
+GenePaint[12], its sister project GeneAtlas[1], BGEM[5], EMAGE[11], EurExpress7, todo. With the exception of the ABA,
+GenePaint, and EMAGE, most of these resources, have not (yet) extracted the expression intensity from the ISH images
+and registered the results into a single 3-D space, and only ABA and EMAGE make this form of data available for public
+download from the website. Many of these resources focus on developmental gene expression.
+Significance
@@ -260,21 +267,21 @@
-between voxel gene expression profiles within a handful of cortical areas.  However, this sort of analysis is not related to
-either of our aims, as it neither finds marker genes, nor does it suggest a cortical map based on gene expression data. Neither
-of the other components of AGEA can be applied to cortical areas; AGEA&#8217;s Gene Finder cannot be used to find marker
-genes for most cortical areas; and AGEA&#8217;s hierarchial clustering does not produce clusters corresponding to most cortical
-areas7 .
-In summary, for all three aims, (a) none of the previous projects explores combinations of marker genes, (b) there has
+between voxel gene expression profiles within a handful of cortical areas. However, this sort of analysis is not related to either
+of our aims, as it neither finds marker genes, nor does it suggest a cortical map based on gene expression data.  Neither of
+the other components of AGEA can be applied to cortical areas; AGEA&#8217;s Gene Finder cannot be used to find marker genes
+for the cortical areas; and AGEA&#8217;s hierarchial clustering does not produce clusters corresponding to the cortical areas8.
+In summary, for all three aims, (a) only one of the previous projects explores combinations of marker genes, (b) there has
-   7In both cases, the root cause is that pairwise correlations between the gene expression of voxels in different areas but the same layer are
+   7http://www.eurexpress.org/ee/; EurExpress data is also entered into EMAGE
+    8In both cases, the root cause is that pairwise correlations between the gene expression of voxels in different areas but the same layer are
-correlation clustering algorithm will often create clusters representing cortical layers, not areas.  This is why the hierarchial clustering does not
+correlation clustering algorithm will tend to create clusters representing cortical layers, not areas. This is why the hierarchial clustering does not
@@ -310,7 +317,7 @@
-as either a member of a particular anatomical area, or not.  The target area can be represented as a binary mask over the
+as either a member of a particular anatomical area, or not. The target area can be represented as a boolean mask over the
@@ -322,12 +329,12 @@
-for each gene, five thresholded binary masks of the gene data.  For each gene, we created a binary mask of its expression
+for each gene, five thresholded boolean masks of the gene data. For each gene, we created a boolean mask of its expression
-binary masks such that the conditional entropy of the target area&#8217;s binary mask, conditioned upon the pair of gene expression
-binary masks, is minimized.
+boolean masks such that the conditional entropy of the target area&#8217;s boolean mask, conditioned upon the pair of gene
+expression boolean masks, is minimized.
@@ -359,24 +366,24 @@
-Fig. . The top row of Fig.  displays the 3 genes which most match area AUD, according to a pointwise method8. The bottom
-row displays the 3 genes which most match AUD according to a method which considers local geometry9  The pointwise
-method in the top row identifies genes which express more strongly in AUD than outside of it; its weakness is that this
-includes many areas which don&#8217;t have a salient border matching the areal border.  The geometric method identifies genes
-whose salient expression border seems to partially line up with the border of AUD; its weakness is that this includes genes
-which don&#8217;t express over the entire area. Genes which have high rankings using both pointwise and border criteria, such as
-Aph1a in the example, may be particularly good markers. None of these genes are, individually, a perfect marker for AUD;
-we deliberately chose a &#8220;difficult&#8221; area in order to better contrast pointwise with geometric methods.
+Fig. . The top row of Fig.   displays the 3 genes which most match area AUD, according to a pointwise method9.  The
+bottom row displays the 3 genes which most match AUD according to a method which considers local geometry10  The
+pointwise method in the top row identifies genes which express more strongly in AUD than outside of it; its weakness is
+that this includes many areas which don&#8217;t have a salient border matching the areal border. The geometric method identifies
+genes whose salient expression border seems to partially line up with the border of AUD; its weakness is that this includes
+genes which don&#8217;t express over the entire area.  Genes which have high rankings using both pointwise and border criteria,
+such as Aph1a in the example, may be particularly good markers.  None of these genes are, individually, a perfect marker
+for AUD; we deliberately chose a &#8220;difficult&#8221; area in order to better contrast pointwise with geometric methods.
-natorially.  according to logistic regression, gene wwc110  is the best fit single gene for predicting whether or not a pixel on
-_________________________________________
-   8For each gene, a logistic regression in which the response variable was whether or not a surface pixel was within area AUD, and the predictor
+natorially.  according to logistic regression, gene wwc111  is the best fit single gene for predicting whether or not a pixel on
+_________________________________________
+   9For each gene, a logistic regression in which the response variable was whether or not a surface pixel was within area AUD, and the predictor
-    9For each gene the gradient similarity (see section ??) between (a) a map of the expression of each gene on the cortical surface and (b) the
+   10For each gene the gradient similarity (see section ??) between (a) a map of the expression of each gene on the cortical surface and (b) the
-   10&#8220;WW, C2 and coiled-coil domain containing 1&#8221;; EntrezGene ID 211652
+   11&#8220;WW, C2 and coiled-coil domain containing 1&#8221;; EntrezGene ID 211652
@@ -389,7 +396,7 @@
-Gene mtif211 is shown in figure the upper-right of Fig. . Mtif2 captures MO&#8217;s upper-left boundary, but not its lower-right
+Gene mtif212 is shown in figure the upper-right of Fig. . Mtif2 captures MO&#8217;s upper-left boundary, but not its lower-right
@@ -400,7 +407,7 @@
-surface pixels based on their gene expression profiles. We achieved classification accuracy of about 81%12. As noted above,
+surface pixels based on their gene expression profiles. We achieved classification accuracy of about 81%13. As noted above,
@@ -412,8 +419,8 @@
-  11&#8220;mitochondrial translational initiation factor 2&#8221;; EntrezGene ID 76784
-   125-fold cross-validation.
+  12&#8220;mitochondrial translational initiation factor 2&#8221;; EntrezGene ID 76784
+   135-fold cross-validation.
@@ -469,8 +476,9 @@
-[4]Jano Hemert and Richard Baldock.   Matching Spatial Regions with Combinations of Interacting Gene Expression
-Patterns, pages 347&#8211;361. 2008.
+[4]Jano Hemert and Richard Baldock. Matching Spatial Regions with Combinations of Interacting Gene Expression Pat-
+terns, volume 13 of Communications in Computer and Information Science, pages 347&#8211;361. Springer Berlin Heidelberg,
+2008.
@@ -486,12 +494,13 @@
-[10]Shanmugasundaram Venkataraman, Peter Stevenson, Yiya Yang, Lorna Richardson, Nicholas Burton, Thomas P. Perry,
+[10]Jano van Hemert and Richard Baldock. Mining Spatial Gene Expression Data for Association Rules, pages 66&#8211;76. 2007.
+[11]Shanmugasundaram Venkataraman, Peter Stevenson, Yiya Yang, Lorna Richardson, Nicholas Burton, Thomas P. Perry,
-[11]Axel Visel, Christina Thaller, and Gregor Eichele.  GenePaint.org:  an atlas of gene expression patterns in the mouse
+[12]Axel Visel, Christina Thaller, and Gregor Eichele.  GenePaint.org:  an atlas of gene expression patterns in the mouse
-[12]Robert H Waterston, Kerstin Lindblad-Toh, Ewan Birney, Jane Rogers, Josep F Abril, Pankaj Agarwal, Richa Agar-
+[13]Robert H Waterston, Kerstin Lindblad-Toh, Ewan Birney, Jane Rogers, Josep F Abril, Pankaj Agarwal, Richa Agar-
--- a/grant.txt	Wed Apr 15 03:20:19 2009 -0700
+++ b/grant.txt	Wed Apr 15 13:57:53 2009 -0700
@@ -94,9 +94,9 @@
-\cite{hemert_matching_2008} todo
-
-In summary, none of the previous projects explores combinations of marker genes, and none of their publications compare the results obtained by using different algorithms or scoring methods.
+\cite{hemert_matching_2008} describes a technique to find combinations of marker genes to pick out an anatomical region. They use an evolutionary algorithm to evolve logical operators which combine boolean (thresholded) images in order to match a target image. Their match score is Jaccard similarity, which is equal to the number of true pixels in the intersection of the two images, divided by the number of pixels in their union.
+
+In summary, only one of the previous projects explores combinations of marker genes, and none of their publications compare the results obtained by using different algorithms or scoring methods.
@@ -144,7 +144,7 @@
-We are aware of three existing efforts to cluster spatial gene expression data.
+We are aware of four existing efforts to cluster spatial gene expression data.
@@ -159,6 +159,7 @@
+In an interesting twist, \cite{hemert_matching_2008} applies their technique for finding combinations of marker genes for the purpose of clustering genes around a "seed gene". The way they do this is by using the pattern of expression of the seed gene as the target image, and then searching for other genes which can be combined to reproduce this pattern. Those other genes which are found are considered to be related to the seed. The same team also describes a method\cite{van_hemert_mining_2007} for finding "association rules" such as, "if this voxel is expressed in by any gene, then that voxel is probably also expressed in by the same gene". This could be useful as part of a procedure for clustering voxels.
@@ -166,7 +167,7 @@
-In summary, although these projects obtained hierarchial clusterings, there has not been much comparison between different algorithms or scoring methods, so it is likely that the best clustering method for this application has not yet been found.
+In summary, although these projects obtained clusterings, there has not been much comparison between different algorithms or scoring methods, so it is likely that the best clustering method for this application has not yet been found.
@@ -186,9 +187,9 @@
-Mus musculus, the common house mouse, is thought to contain about 22,000 protein-coding genes\cite{waterston_initial_2002}. The ABA contains data on about 20,000 genes in sagittal sections, out of which over 4,000 genes are also measured in coronal sections. Our dataset is derived from only the coronal subset of the ABA, because the sagittal data does not cover the entire cortex, and has greater registration error\cite{ng_anatomic_2009}. Genes were selected by the Allen Institute for coronal sectioning based on, "classes of known neuroscientific interest... or through post hoc identification of a marked non-ubiquitous expression pattern"\cite{ng_anatomic_2009}. 
-
-The ABA is not the only large public spatial gene expression dataset. Other such resources include GENSAT\cite{gong_gene_2003}, GenePaint\cite{visel_genepaint.org:atlas_2004}, its sister project GeneAtlas\cite{carson_data_2005}, BGEM\cite{magdaleno_bgem:in_2006}, EMAGE\cite{?}, EurExpress (http://www.eurexpress.org/ee/; EurExpress data is also entered into EMAGE), todo. With the exception of the ABA, GenePaint, and EMAGE, most of these resources, have not (yet) extracted the expression intensity from the ISH images and registered the results into a single 3-D space, and only ABA and EMAGE make this form of data available for public download from the website. Many of these resources focus on developmental gene expression.
+Mus musculus, the common house mouse, is thought to contain about 22,000 protein-coding genes\cite{waterston_initial_2002}. The ABA contains data on about 20,000 genes in sagittal sections, out of which over 4,000 genes are also measured in coronal sections. Our dataset is derived from only the coronal subset of the ABA, because the sagittal data does not cover the entire cortex, and also has greater registration error\cite{ng_anatomic_2009}. Genes were selected by the Allen Institute for coronal sectioning based on, "classes of known neuroscientific interest... or through post hoc identification of a marked non-ubiquitous expression pattern"\cite{ng_anatomic_2009}. 
+
+The ABA is not the only large public spatial gene expression dataset. Other such resources include GENSAT\cite{gong_gene_2003}, GenePaint\cite{visel_genepaint.org:atlas_2004}, its sister project GeneAtlas\cite{carson_data_2005}, BGEM\cite{magdaleno_bgem:in_2006}, EMAGE\cite{venkataraman_emage_2008}, EurExpress\footnote{http://www.eurexpress.org/ee/; EurExpress data is also entered into EMAGE}, todo. With the exception of the ABA, GenePaint, and EMAGE, most of these resources, have not (yet) extracted the expression intensity from the ISH images and registered the results into a single 3-D space, and only ABA and EMAGE make this form of data available for public download from the website. Many of these resources focus on developmental gene expression.
@@ -205,10 +206,12 @@
-\cite{ng_anatomic_2009} describes the application of AGEA to the cortex. The paper describes interesting results on the structure of correlations between voxel gene expression profiles within a handful of cortical areas. However, this sort of analysis is not related to either of our aims, as it neither finds marker genes, nor does it suggest a cortical map based on gene expression data. Neither of the other components of AGEA can be applied to cortical areas; AGEA's Gene Finder cannot be used to find marker genes for most cortical areas; and AGEA's hierarchial clustering does not produce clusters corresponding to most cortical areas\footnote{In both cases, the root cause is that pairwise correlations between the gene expression of voxels in different areas but the same layer are often stronger than pairwise correlations between the gene expression of voxels in different layers but the same area. Therefore, a pairwise voxel correlation clustering algorithm will often create clusters representing cortical layers, not areas. This is why the hierarchial clustering does not find most cortical areas (there are clusters which presumably correspond to the intersection of a layer and an area, but since one area will have many layer-area intersection clusters, further work is needed to make sense of these). The reason that Gene Finder cannot find marker genes for most cortical areas is that in Gene Finder, although the user chooses a seed voxel, Gene Finder chooses the ROI for which genes will be found, and it creates that ROI by (pairwise voxel correlation) clustering around the seed.}.
-
-
-In summary, for all three aims, (a) none of the previous projects explores combinations of marker genes, (b) there has been almost no comparison of different algorithms or scoring methods, and (c) there has been no work on computationally finding marker genes for cortical areas, or on finding a hierarchial clustering that will yield a map of cortical areas de novo from gene expression data.
+\cite{ng_anatomic_2009} describes the application of AGEA to the cortex. The paper describes interesting results on the structure of correlations between voxel gene expression profiles within a handful of cortical areas. However, this sort of analysis is not related to either of our aims, as it neither finds marker genes, nor does it suggest a cortical map based on gene expression data. Neither of the other components of AGEA can be applied to cortical areas; AGEA's Gene Finder cannot be used to find marker genes for the cortical areas; and AGEA's hierarchial clustering does not produce clusters corresponding to the cortical areas\footnote{In both cases, the root cause is that pairwise correlations between the gene expression of voxels in different areas but the same layer are often stronger than pairwise correlations between the gene expression of voxels in different layers but the same area. Therefore, a pairwise voxel correlation clustering algorithm will tend to create clusters representing cortical layers, not areas. This is why the hierarchial clustering does not find most cortical areas (there are clusters which presumably correspond to the intersection of a layer and an area, but since one area will have many layer-area intersection clusters, further work is needed to make sense of these). The reason that Gene Finder cannot find marker genes for most cortical areas is that in Gene Finder, although the user chooses a seed voxel, Gene Finder chooses the ROI for which genes will be found, and it creates that ROI by (pairwise voxel correlation) clustering around the seed.}.
+
+
+%% Most of the projects which have been discussed have been done by the same groups that develop the public datasets. Although these projects make their algorithms available for use on their own website, none of them have released an open-source software toolkit; instead, users are restricted to using the provided algorithms only on their own dataset.
+
+In summary, for all three aims, (a) only one of the previous projects explores combinations of marker genes, (b) there has been almost no comparison of different algorithms or scoring methods, and (c) there has been no work on computationally finding marker genes for cortical areas, or on finding a hierarchial clustering that will yield a map of cortical areas de novo from gene expression data.
@@ -255,7 +258,7 @@
-Recall that the instances are surface pixels, and consider the problem of attempting to classify each instance as either a member of a particular anatomical area, or not. The target area can be represented as a binary mask over the surface pixels. 
+Recall that the instances are surface pixels, and consider the problem of attempting to classify each instance as either a member of a particular anatomical area, or not. The target area can be represented as a boolean mask over the surface pixels. 
@@ -266,9 +269,9 @@
-The simplest way to use information theory is on discrete data, so we discretized our gene expression data by creating, for each gene, five thresholded binary masks of the gene data. For each gene, we created a binary mask of its expression levels using each of these thresholds: the mean of that gene, the mean minus one standard deviation, the mean minus two standard deviations, the mean plus one standard deviation, the mean plus two standard deviations.
-
-Now, for each region, we created and ran a forward stepwise procedure which attempted to find pairs of gene expression binary masks such that the conditional entropy of the target area's binary mask, conditioned upon the pair of gene expression binary masks, is minimized.
+The simplest way to use information theory is on discrete data, so we discretized our gene expression data by creating, for each gene, five thresholded boolean masks of the gene data. For each gene, we created a boolean mask of its expression levels using each of these thresholds: the mean of that gene, the mean minus one standard deviation, the mean minus two standard deviations, the mean plus one standard deviation, the mean plus two standard deviations.
+
+Now, for each region, we created and ran a forward stepwise procedure which attempted to find pairs of gene expression boolean masks such that the conditional entropy of the target area's boolean mask, conditioned upon the pair of gene expression boolean masks, is minimized.
author	bshanks@bshanks.dyndns.org
date	Wed Apr 15 13:57:53 2009 -0700 (16 years ago)
parents	354ea5edb5f6
children	33c10c13f9a3
files	grant.html grant.odt grant.pdf grant.txt