cg

changeset 51:3ebb8f4ea921
.
author: bshanks@bshanks.dyndns.org
date: Fri Apr 17 12:47:51 2009 -0700 (16 years ago)
parents: 0669519bc685
children: 074e2be60b38
files: grant.html grant.odt grant.pdf grant.txt
--- a/grant.html	Thu Apr 16 14:50:46 2009 -0700
+++ b/grant.html	Fri Apr 17 12:47:51 2009 -0700
@@ -90,13 +90,12 @@
-We are aware of four existing efforts to find marker genes using spatial gene expression data using automated methods.
-[1 ] describes GeneAtlas.  GeneAtlas allows the user to construct a search query by freely demarcating one or two 2-D
-regions on sagittal slices, and then to specify either the strength of expression or the name of another gene whose expression
-pattern is to be matched. GeneAtlas differs from our Aim 1 in at least two ways. First, GeneAtlas finds only single genes,
-whereas we will also look for combinations of genes3. Second, at least for the custom spatial search, Gene Atlas appears to
-use a simple pointwise scoring method (strength of expression), whereas we will also use geometric metrics such as gradient
-similarity.
+We are aware of five existing efforts to find marker genes using spatial gene expression data using automated methods.
+GeneAtlas[1] and EMAGE [11] allow the user to construct a search query by demarcating regions and then specifing
+either the strength of expression or the name of another gene or dataset whose expression pattern is to be matched.  For
+the similiarity score (match score), GeneAtlas appears to use strength of expression, and EMAGE uses Jaccard similarity,
+which is equal to the number of true pixels in the intersection of the two images, divided by the number of pixels in their
+union. Neither GeneAtlas nor EMAGE allow one to search for combinations of genes that together match a region.
@@ -107,15 +106,18 @@
-search for underexpression.  Third, Gene Finder uses a simple pointwise score4, whereas we will also use geometric scores
+search for underexpression.  Third, Gene Finder uses a simple pointwise score3, whereas we will also use geometric scores
-[11 ] todo
+[? ] looks at the mean expression level of genes within anatomical regions, and applies a Student&#8217;s t-test with Bonferroni
+correction to determine whether the mean expression level of a gene is significantly higher in the target region. Like AGEA,
+this is a pointwise measure (only the mean expression level per pixel is being analyzed), it is not being used to look for
+underexpression, and does not look for combinations of genes.
-match score is Jaccard similarity, which is equal to the number of true pixels in the intersection of the two images, divided
-by the number of pixels in their union.
-In summary, only one of the previous projects explores combinations of marker genes, and none of their publications
-compare the results obtained by using different algorithms or scoring methods.
+match score is Jaccard similarity.
+In summary, there has been fruitful work on finding marker genes, however, only one of the previous projects explores
+combinations of marker genes, and none of these publications compare the results obtained by using different algorithms or
+scoring methods.
@@ -123,9 +125,7 @@
-    3See Preliminary Data for an example of an area which cannot be marked by any single gene in the dataset, but which can be marked by a
-combination.
-    4&#8220;Expression energy ratio&#8221;, which captures overexpression.
+    3&#8220;Expression energy ratio&#8221;, which captures overexpression.
@@ -156,7 +156,8 @@
-Dimensionality reduction
+Dimensionality reduction In this section, we discuss reducing the length of the per-pixel gene expression feature
+vector. By &#8220;dimension&#8221;, we mean the dimension of this vector, not the spatial dimension of the underlying data.
@@ -165,49 +166,51 @@
-Another use for dimensionality reduction is to visualize the relationships between regions. For example, one might want
-to make a 2-D plot upon which each region is represented by a single point, and with the property that regions with similar
-gene expression profiles should be nearby on the plot (that is, the property that distance between pairs of points in the plot
-should be proportional to some measure of dissimilarity in gene expression). It is likely that no arrangement of the points on
-a 2-D plan will exactly satisfy this property &#8211; however, dimensionality reduction techniques allow one to find arrangements
-of points that approximately satisfy that property. Note that in this application, dimensionality reduction is being applied
-after clustering; whereas in the previous paragraph, we were talking about using dimensionality reduction before clustering.
+Dimensionality reduction before clustering is useful on large datasets.  First, because the number of features in the
+reduced data set is less than in the original data set, the running time of clustering algorithms may be much less. Second,
+it is thought that some clustering algorithms may give better results on reduced data.
+Another use for dimensionality reduction is to visualize the relationships between regions after clustering. For example,
+one might want to make a 2-D plot upon which each region is represented by a single point, and with the property that regions
+with similar gene expression profiles should be nearby on the plot (that is, the property that distance between pairs of points
+in the plot should be proportional to some measure of dissimilarity in gene expression). It is likely that no arrangement of
+the points on a 2-D plan will exactly satisfy this property &#8211; however, dimensionality reduction techniques allow one to find
+arrangements of points that approximately satisfy that property.  Note that in this application, dimensionality reduction
+is being applied after clustering; whereas in the previous paragraph, we were talking about using dimensionality reduction
+before clustering.
-pattern which seems to pick out a single, spatially continguous region.  Therefore, it seems likely that an anatomically
-interesting region will have multiple genes which each individually pick it out5.  This suggests the following procedure:
-_________________________________________
-   5This would seem to contradict our finding in aim 1 that some cortical areas are combinatorially coded by multiple genes.  However, it is
-possible that the currently accepted cortical maps divide the cortex into regions which are unnatural from the point of view of gene expression;
-perhaps there is some other way to map the cortex for which each region can be identified by single genes. Another possibility is that, although
+patternwhich seems to pick out a single, spatially continguous region.  Therefore, it seems likely that an anatomically
+interesting region will have multiple genes which each individually pick it out4.  This suggests the following procedure:
+The task of clustering both the instances and the features is called co-clustering, and there are a number of co-clustering
+algorithms.
-We are aware of four existing efforts to cluster spatial gene expression data.
+We are aware of five existing efforts to cluster spatial gene expression data.
-the usefulness of computational genomic anatomy.  We have run NNMF on the cortical dataset6  and while the results are
+the usefulness of computational genomic anatomy.  We have run NNMF on the cortical dataset5  and while the results are
+AGEA&#8217;s[6] hierarchial clustering was described above.  EMAGE[11] allows the user to select a dataset from among a
+large number of alternatives, or by running a search query, and then to cluster the genes within that dataset. Clustering is
+hierarchial complete linkage clustering with un-centred correlation as the similarity score.
+todo [?]
-AGEA&#8217;s[6] hierarchial clustering differs from our Aim 2 in at least two ways.  First, AGEA uses perhaps the simplest
-possible similarity score (correlation), and does no dimensionality reduction before calculating similarity. While it is possible
-that a more complex system will not do any better than this, we believe further exploration of alternative methods of scoring
-and dimensionality reduction is warranted.  Second, AGEA did not look at clusters of genes; in Preliminary Data we have
-shown that clusters of genes may identify interesting spatial regions such as cortical areas.
-[11 ] todo
-rithms or scoring methods, so it is likely that the best clustering method for this application has not yet been found.
+rithms or scoring methods, so it is likely that the best clustering method for this application has not yet been found. Also,
+none of these projects did a separate dimensionality reduction step before clustering pixels, or tried to cluster genes first in
+order to guide the clustering of pixels into spatial regions, or used co-clustering algorithms.
@@ -227,6 +230,14 @@
+_________________________________________
+   4This would seem to contradict our finding in aim 1 that some cortical areas are combinatorially coded by multiple genes.  However, it is
+possible that the currently accepted cortical maps divide the cortex into regions which are unnatural from the point of view of gene expression;
+perhaps there is some other way to map the cortex for which each region can be identified by single genes. Another possibility is that, although
+the cluster prototype fits an anatomical region, the individual genes are each somewhat different from the prototype.
+    5We ran &#8220;vanilla&#8221; NNMF, whereas the paper under discussion used a modified method.  Their main modification consisted of adding a soft
+spatial contiguity constraint.  However, on our dataset, NNMF naturally produced spatially contiguous clusters, so no additional constraint was
+needed. The paper under discussion also mentions that they tried a hierarchial variant of NNMF, which we have not yet tried.
@@ -235,19 +246,14 @@
-_________________________________________
-the cluster prototype fits an anatomical region, the individual genes are each somewhat different from the prototype.
-    6We ran &#8220;vanilla&#8221; NNMF, whereas the paper under discussion used a modified method.  Their main modification consisted of adding a soft
-spatial contiguity constraint.  However, on our dataset, NNMF naturally produced spatially contiguous clusters, so no additional constraint was
-needed. The paper under discussion also mentions that they tried a hierarchial variant of NNMF, which we have not yet tried.
-TheABA is not the only large public spatial gene expression dataset.   Other such resources include GENSAT[3],
-GenePaint[12],  its  sister  project  GeneAtlas[1],  BGEM[5],  EMAGE[11],  EurExpress7,  EADHB8,  MAMEP9,  Xenbase10,
-ZFIN[? ], Aniseed11, VisiGene12, GEISHA[?], Fruitfly.org[?], COMPARE[?] todo.  With the exception of the ABA, Gene-
-Paint, and EMAGE, most of these resources, have not (yet) extracted the expression intensity from the ISH images and
-registered the results into a single 3-D space, and only ABA and EMAGE make this form of data available for public
-download from the website. Many of these resources focus on developmental gene expression.
+The ABA is not the only large public spatial gene expression dataset.   Other such resources include GENSAT[3],
+GenePaint[12], its sister project GeneAtlas[1], BGEM[5], EMAGE[11], EurExpress6, EADHB7, MAMEP8, Xenbase9, ZFIN[?],
+Aniseed10, VisiGene11, GEISHA[?], Fruitfly.org[?], COMPARE[?] todo.  With the exception of the ABA, GenePaint, and
+EMAGE, most of these resources have not (yet) extracted the expression intensity from the ISH images and registered the
+results into a single 3-D space, and only ABA and EMAGE make this form of data available for public download from the
+website12. Many of these resources focus on developmental gene expression.
@@ -276,15 +282,14 @@
-Our project is guided by a concrete application with a well-specified criterion of success (how well we can find marker
-genes for / reproduce the layout of cortical areas), which will provide a solid basis for comparing different methods.
-_________________________________________
-   7http://www.eurexpress.org/ee/; EurExpress data is also entered into EMAGE
-    8http://www.ncl.ac.uk/ihg/EADHB/database/EADHB_database.html
-   9http://mamep.molgen.mpg.de/index.php
-  10http://xenbase.org/
-  11http://aniseed-ibdm.univ-mrs.fr/
-  12http://genome.ucsc.edu/cgi-bin/hgVisiGene ; includes data from some the other listed data sources
+___________________
+   6http://www.eurexpress.org/ee/; EurExpress data is also entered into EMAGE
+    7http://www.ncl.ac.uk/ihg/EADHB/database/EADHB_database.html
+   8http://mamep.molgen.mpg.de/index.php
+   9http://xenbase.org/
+  10http://aniseed-ibdm.univ-mrs.fr/
+  11http://genome.ucsc.edu/cgi-bin/hgVisiGene ; includes data from some the other listed data sources
+   12without prior offline registration
@@ -292,6 +297,8 @@
+Our project is guided by a concrete application with a well-specified criterion of success (how well we can find marker
+genes for / reproduce the layout of cortical areas), which will provide a solid basis for comparing different methods.
@@ -458,6 +465,7 @@
+# Linear discriminant analysis
@@ -472,6 +480,9 @@
+# Linear discriminant analysis
+# jbt, coclustering
+# self-organizing map
--- a/grant.txt	Thu Apr 16 14:50:46 2009 -0700
+++ b/grant.txt	Fri Apr 17 12:47:51 2009 -0700
@@ -71,9 +71,11 @@
-We are aware of four existing efforts to find marker genes using spatial gene expression data using automated methods.
-
-\cite{carson_data_2005} describes GeneAtlas. GeneAtlas allows the user to construct a search query by freely demarcating one or two 2-D regions on sagittal slices, and then to specify either the strength of expression or the name of another gene whose expression pattern is to be matched. GeneAtlas differs from our Aim 1 in at least two ways. First, GeneAtlas finds only single genes, whereas we will also look for combinations of genes\footnote{See Preliminary Data for an example of an area which cannot be marked by any single gene in the dataset, but which can be marked by a combination.}. Second, at least for the custom spatial search, Gene Atlas appears to use a simple pointwise scoring method (strength of expression), whereas we will also use geometric metrics such as gradient similarity. 
+We are aware of five existing efforts to find marker genes using spatial gene expression data using automated methods.
+
+%%GeneAtlas\cite{carson_data_2005} allows the user to construct a search query by freely demarcating one or two 2-D regions on sagittal slices, and then to specify either the strength of expression or the name of another gene whose expression pattern is to be matched. 
+
+GeneAtlas\cite{carson_data_2005} and EMAGE \cite{venkataraman_emage_2008} allow the user to construct a search query by demarcating regions and then specifing either the strength of expression or the name of another gene or dataset whose expression pattern is to be matched. For the similiarity score (match score), GeneAtlas appears to use strength of expression, and EMAGE uses Jaccard similarity, which is equal to the number of true pixels in the intersection of the two images, divided by the number of pixels in their union. Neither GeneAtlas nor EMAGE allow one to search for combinations of genes that together match a region.
@@ -91,14 +93,11 @@
-\cite{venkataraman_emage_2008} todo
-
-
-\cite{chin_genome-scale_2007} uses a Student's t-test with Bonferroni correction to determine whether a gene is overexpressed in a specific anatomical region.
-
-\cite{hemert_matching_2008} describes a technique to find combinations of marker genes to pick out an anatomical region. They use an evolutionary algorithm to evolve logical operators which combine boolean (thresholded) images in order to match a target image. Their match score is Jaccard similarity, which is equal to the number of true pixels in the intersection of the two images, divided by the number of pixels in their union.
-
-In summary, only one of the previous projects explores combinations of marker genes, and none of their publications compare the results obtained by using different algorithms or scoring methods.
+\cite{chin_genome-scale_2007} looks at the mean expression level of genes within anatomical regions, and applies a Student's t-test with Bonferroni correction to determine whether the mean expression level of a gene is significantly higher in the target region. Like AGEA, this is a pointwise measure (only the mean expression level per pixel is being analyzed), it is not being used to look for underexpression, and does not look for combinations of genes.
+
+\cite{hemert_matching_2008} describes a technique to find combinations of marker genes to pick out an anatomical region. They use an evolutionary algorithm to evolve logical operators which combine boolean (thresholded) images in order to match a target image. Their match score is Jaccard similarity.
+
+In summary, there has been fruitful work on finding marker genes, however, only one of the previous projects explores combinations of marker genes, and none of these publications compare the results obtained by using different algorithms or scoring methods.
@@ -128,11 +127,13 @@
-
+In this section, we discuss reducing the length of the per-pixel gene expression feature vector. By "dimension", we mean the dimension of this vector, not the spatial dimension of the underlying data. 
-Another use for dimensionality reduction is to visualize the relationships between regions. For example, one might want to make a 2-D plot upon which each region is represented by a single point, and with the property that regions with similar gene expression profiles should be nearby on the plot (that is, the property that distance between pairs of points in the plot should be proportional to some measure of dissimilarity in gene expression). It is likely that no arrangement of the points on a 2-D plan will exactly satisfy this property -- however, dimensionality reduction techniques allow one to find arrangements of points that approximately satisfy that property. Note that in this application, dimensionality reduction is being applied after clustering; whereas in the previous paragraph, we were talking about using dimensionality reduction before clustering.
+Dimensionality reduction before clustering is useful on large datasets. First, because the number of features in the reduced data set is less than in the original data set, the running time of clustering algorithms may be much less. Second, it is thought that some clustering algorithms may give better results on reduced data.
+
+Another use for dimensionality reduction is to visualize the relationships between regions after clustering. For example, one might want to make a 2-D plot upon which each region is represented by a single point, and with the property that regions with similar gene expression profiles should be nearby on the plot (that is, the property that distance between pairs of points in the plot should be proportional to some measure of dissimilarity in gene expression). It is likely that no arrangement of the points on a 2-D plan will exactly satisfy this property -- however, dimensionality reduction techniques allow one to find arrangements of points that approximately satisfy that property. Note that in this application, dimensionality reduction is being applied after clustering; whereas in the previous paragraph, we were talking about using dimensionality reduction before clustering.
@@ -144,10 +145,10 @@
+The task of clustering both the instances and the features is called co-clustering, and there are a number of co-clustering algorithms. 
-We are aware of four existing efforts to cluster spatial gene expression data.
-
+We are aware of five existing efforts to cluster spatial gene expression data.
@@ -156,20 +157,16 @@
-
-
-
-%% todo \cite{thompson_genomic_2008} reports that both mNNMF and hierarchial mNNMF clustering were useful, and that hierarchial recursive bifurcation gave similar results.
+%% \cite{thompson_genomic_2008} reports that both mNNMF and hierarchial mNNMF clustering were useful, and that hierarchial recursive bifurcation gave similar results.
+
+
+AGEA's\cite{ng_anatomic_2009} hierarchial clustering was described above. EMAGE\cite{venkataraman_emage_2008} allows the user to select a dataset from among a large number of alternatives, or by running a search query, and then to cluster the genes within that dataset. Clustering is hierarchial complete linkage clustering with un-centred correlation as the similarity score.
+
+todo \cite{chin_genome-scale_2007} 
-
-AGEA's\cite{ng_anatomic_2009} hierarchial clustering differs from our Aim 2 in at least two ways. First, AGEA uses perhaps the simplest possible similarity score (correlation), and does no dimensionality reduction before calculating similarity. While it is possible that a more complex system will not do any better than this, we believe further exploration of alternative methods of scoring and dimensionality reduction is warranted. Second, AGEA did not look at clusters of genes; in Preliminary Data we have shown that clusters of genes may identify interesting spatial regions such as cortical areas. 
-
-\cite{venkataraman_emage_2008} todo
-
-
-In summary, although these projects obtained clusterings, there has not been much comparison between different algorithms or scoring methods, so it is likely that the best clustering method for this application has not yet been found.
+In summary, although these projects obtained clusterings, there has not been much comparison between different algorithms or scoring methods, so it is likely that the best clustering method for this application has not yet been found. Also, none of these projects did a separate dimensionality reduction step before clustering pixels, or tried to cluster genes first in order to guide the clustering of pixels into spatial regions, or used co-clustering algorithms.
@@ -191,7 +188,7 @@
-The ABA is not the only large public spatial gene expression dataset. Other such resources include GENSAT\cite{gong_gene_2003}, GenePaint\cite{visel_genepaint.org:atlas_2004}, its sister project GeneAtlas\cite{carson_data_2005}, BGEM\cite{magdaleno_bgem:in_2006}, EMAGE\cite{venkataraman_emage_2008}, EurExpress\footnote{http://www.eurexpress.org/ee/; EurExpress data is also entered into EMAGE}, EADHB\footnote{http://www.ncl.ac.uk/ihg/EADHB/database/EADHB_database.html}, MAMEP\footnote{http://mamep.molgen.mpg.de/index.php}, Xenbase\footnote{http://xenbase.org/}, ZFIN\cite{sprague_zebrafish_2006}, Aniseed\footnote{http://aniseed-ibdm.univ-mrs.fr/}, VisiGene\footnote{http://genome.ucsc.edu/cgi-bin/hgVisiGene ; includes data from some  the other listed data sources}, GEISHA\cite{bell_geisha_2004}, Fruitfly.org\cite{tomancak_systematic_2002}, COMPARE\cite{http://compare.ibdml.univ-mrs.fr/} todo. With the exception of the ABA, GenePaint, and EMAGE, most of these resources, have not (yet) extracted the expression intensity from the ISH images and registered the results into a single 3-D space, and only ABA and EMAGE make this form of data available for public download from the website. Many of these resources focus on developmental gene expression.
+The ABA is not the only large public spatial gene expression dataset. Other such resources include GENSAT\cite{gong_gene_2003}, GenePaint\cite{visel_genepaint.org:atlas_2004}, its sister project GeneAtlas\cite{carson_data_2005}, BGEM\cite{magdaleno_bgem:in_2006}, EMAGE\cite{venkataraman_emage_2008}, EurExpress\footnote{http://www.eurexpress.org/ee/; EurExpress data is also entered into EMAGE}, EADHB\footnote{http://www.ncl.ac.uk/ihg/EADHB/database/EADHB_database.html}, MAMEP\footnote{http://mamep.molgen.mpg.de/index.php}, Xenbase\footnote{http://xenbase.org/}, ZFIN\cite{sprague_zebrafish_2006}, Aniseed\footnote{http://aniseed-ibdm.univ-mrs.fr/}, VisiGene\footnote{http://genome.ucsc.edu/cgi-bin/hgVisiGene ; includes data from some  the other listed data sources}, GEISHA\cite{bell_geisha_2004}, Fruitfly.org\cite{tomancak_systematic_2002}, COMPARE\footnote{http://compare.ibdml.univ-mrs.fr/} todo. With the exception of the ABA, GenePaint, and EMAGE, most of these resources have not (yet) extracted the expression intensity from the ISH images and registered the results into a single 3-D space, and only ABA and EMAGE make this form of data available for public download from the website\footnote{without prior offline registration}. Many of these resources focus on developmental gene expression.
@@ -411,7 +408,6 @@
-
@@ -432,6 +428,9 @@
+# jbt, coclustering
+
+# self-organizing map
author	bshanks@bshanks.dyndns.org
date	Fri Apr 17 12:47:51 2009 -0700 (16 years ago)
parents	0669519bc685
children	074e2be60b38
files	grant.html grant.odt grant.pdf grant.txt