cg

diff grant.txt @ 51:3ebb8f4ea921

.
author bshanks@bshanks.dyndns.org
date Fri Apr 17 12:47:51 2009 -0700 (16 years ago)
parents 0669519bc685
children 304d07e0ac94
line diff
1.1 --- a/grant.txt Thu Apr 16 14:50:46 2009 -0700 1.2 +++ b/grant.txt Fri Apr 17 12:47:51 2009 -0700 1.3 @@ -71,9 +71,11 @@ 1.4 1.5 As noted above, there has been much work on both supervised learning and there are many available algorithms for each. However, the algorithms require the scientist to provide a framework for representing the problem domain, and the way that this framework is set up has a large impact on performance. Creating a good framework can require creatively reconceptualizing the problem domain, and is not merely a mechanical "fine-tuning" of numerical parameters. For example, we believe that domain-specific scoring measures (such as gradient similarity, which is discussed in Preliminary Work) may be necessary in order to achieve the best results in this application. 1.6 1.7 -We are aware of four existing efforts to find marker genes using spatial gene expression data using automated methods. 1.8 - 1.9 -\cite{carson_data_2005} describes GeneAtlas. GeneAtlas allows the user to construct a search query by freely demarcating one or two 2-D regions on sagittal slices, and then to specify either the strength of expression or the name of another gene whose expression pattern is to be matched. GeneAtlas differs from our Aim 1 in at least two ways. First, GeneAtlas finds only single genes, whereas we will also look for combinations of genes\footnote{See Preliminary Data for an example of an area which cannot be marked by any single gene in the dataset, but which can be marked by a combination.}. Second, at least for the custom spatial search, Gene Atlas appears to use a simple pointwise scoring method (strength of expression), whereas we will also use geometric metrics such as gradient similarity. 1.10 +We are aware of five existing efforts to find marker genes using spatial gene expression data using automated methods. 1.11 + 1.12 +%%GeneAtlas\cite{carson_data_2005} allows the user to construct a search query by freely demarcating one or two 2-D regions on sagittal slices, and then to specify either the strength of expression or the name of another gene whose expression pattern is to be matched. 1.13 + 1.14 +GeneAtlas\cite{carson_data_2005} and EMAGE \cite{venkataraman_emage_2008} allow the user to construct a search query by demarcating regions and then specifing either the strength of expression or the name of another gene or dataset whose expression pattern is to be matched. For the similiarity score (match score), GeneAtlas appears to use strength of expression, and EMAGE uses Jaccard similarity, which is equal to the number of true pixels in the intersection of the two images, divided by the number of pixels in their union. Neither GeneAtlas nor EMAGE allow one to search for combinations of genes that together match a region. 1.15 1.16 \cite{ng_anatomic_2009} describes AGEA, "Anatomic Gene Expression 1.17 Atlas". AGEA has three 1.18 @@ -91,14 +93,11 @@ 1.19 1.20 Gene Finder is different from our Aim 1 in at least three ways. First, Gene Finder finds only single genes, whereas we will also look for combinations of genes. Second, gene finder can only use overexpression as a marker, whereas we will also search for underexpression. Third, Gene Finder uses a simple pointwise score\footnote{"Expression energy ratio", which captures overexpression.}, whereas we will also use geometric scores such as gradient similarity. The Preliminary Data section contains evidence that each of our three choices is the right one. 1.21 1.22 -\cite{venkataraman_emage_2008} todo 1.23 - 1.24 - 1.25 -\cite{chin_genome-scale_2007} uses a Student's t-test with Bonferroni correction to determine whether a gene is overexpressed in a specific anatomical region. 1.26 - 1.27 -\cite{hemert_matching_2008} describes a technique to find combinations of marker genes to pick out an anatomical region. They use an evolutionary algorithm to evolve logical operators which combine boolean (thresholded) images in order to match a target image. Their match score is Jaccard similarity, which is equal to the number of true pixels in the intersection of the two images, divided by the number of pixels in their union. 1.28 - 1.29 -In summary, only one of the previous projects explores combinations of marker genes, and none of their publications compare the results obtained by using different algorithms or scoring methods. 1.30 +\cite{chin_genome-scale_2007} looks at the mean expression level of genes within anatomical regions, and applies a Student's t-test with Bonferroni correction to determine whether the mean expression level of a gene is significantly higher in the target region. Like AGEA, this is a pointwise measure (only the mean expression level per pixel is being analyzed), it is not being used to look for underexpression, and does not look for combinations of genes. 1.31 + 1.32 +\cite{hemert_matching_2008} describes a technique to find combinations of marker genes to pick out an anatomical region. They use an evolutionary algorithm to evolve logical operators which combine boolean (thresholded) images in order to match a target image. Their match score is Jaccard similarity. 1.33 + 1.34 +In summary, there has been fruitful work on finding marker genes, however, only one of the previous projects explores combinations of marker genes, and none of these publications compare the results obtained by using different algorithms or scoring methods. 1.35 1.36 1.37 1.38 @@ -128,11 +127,13 @@ 1.39 1.40 1.41 \vspace{0.3cm}**Dimensionality reduction** 1.42 - 1.43 +In this section, we discuss reducing the length of the per-pixel gene expression feature vector. By "dimension", we mean the dimension of this vector, not the spatial dimension of the underlying data. 1.44 1.45 Unlike aim 1, there is no externally-imposed need to select only a handful of informative genes for inclusion in the instances. However, some clustering algorithms perform better on small numbers of features. There are techniques which "summarize" a larger number of features using a smaller number of features; these techniques go by the name of feature extraction or dimensionality reduction. The small set of features that such a technique yields is called the __reduced feature set__. After the reduced feature set is created, the instances may be replaced by __reduced instances__, which have as their features the reduced feature set rather than the original feature set of all gene expression levels. Note that the features in the reduced feature set do not necessarily correspond to genes; each feature in the reduced set may be any function of the set of gene expression levels. 1.46 1.47 -Another use for dimensionality reduction is to visualize the relationships between regions. For example, one might want to make a 2-D plot upon which each region is represented by a single point, and with the property that regions with similar gene expression profiles should be nearby on the plot (that is, the property that distance between pairs of points in the plot should be proportional to some measure of dissimilarity in gene expression). It is likely that no arrangement of the points on a 2-D plan will exactly satisfy this property -- however, dimensionality reduction techniques allow one to find arrangements of points that approximately satisfy that property. Note that in this application, dimensionality reduction is being applied after clustering; whereas in the previous paragraph, we were talking about using dimensionality reduction before clustering. 1.48 +Dimensionality reduction before clustering is useful on large datasets. First, because the number of features in the reduced data set is less than in the original data set, the running time of clustering algorithms may be much less. Second, it is thought that some clustering algorithms may give better results on reduced data. 1.49 + 1.50 +Another use for dimensionality reduction is to visualize the relationships between regions after clustering. For example, one might want to make a 2-D plot upon which each region is represented by a single point, and with the property that regions with similar gene expression profiles should be nearby on the plot (that is, the property that distance between pairs of points in the plot should be proportional to some measure of dissimilarity in gene expression). It is likely that no arrangement of the points on a 2-D plan will exactly satisfy this property -- however, dimensionality reduction techniques allow one to find arrangements of points that approximately satisfy that property. Note that in this application, dimensionality reduction is being applied after clustering; whereas in the previous paragraph, we were talking about using dimensionality reduction before clustering. 1.51 1.52 1.53 \vspace{0.3cm}**Clustering genes rather than voxels** 1.54 @@ -144,10 +145,10 @@ 1.55 1.56 Gene clusters could also be used to directly yield a clustering on instances. This is because many genes have an expression pattern which seems to pick out a single, spatially continguous region. Therefore, it seems likely that an anatomically interesting region will have multiple genes which each individually pick it out\footnote{This would seem to contradict our finding in aim 1 that some cortical areas are combinatorially coded by multiple genes. However, it is possible that the currently accepted cortical maps divide the cortex into regions which are unnatural from the point of view of gene expression; perhaps there is some other way to map the cortex for which each region can be identified by single genes. Another possibility is that, although the cluster prototype fits an anatomical region, the individual genes are each somewhat different from the prototype.}. This suggests the following procedure: cluster together genes which pick out similar regions, and then to use the more popular common regions as the final clusters. In the Preliminary Data we show that a number of anatomically recognized cortical regions, as well as some "superregions" formed by lumping together a few regions, are associated with gene clusters in this fashion. 1.57 1.58 +The task of clustering both the instances and the features is called co-clustering, and there are a number of co-clustering algorithms. 1.59 1.60 === Related work === 1.61 -We are aware of four existing efforts to cluster spatial gene expression data. 1.62 - 1.63 +We are aware of five existing efforts to cluster spatial gene expression data. 1.64 1.65 \cite{thompson_genomic_2008} describes an analysis of the anatomy of 1.66 the hippocampus using the ABA dataset. In addition to manual analysis, 1.67 @@ -156,20 +157,16 @@ 1.68 1.69 %% In addition, this paper described a visual screening of the data, specifically, a visual analysis of 6000 genes with the primary purpose of observing how the spatial pattern of their expression coincided with the regions that had been identified by NNMF. We propose to do this sort of screening automatically, which would yield an objective, quantifiable result, rather than qualitative observations. 1.70 1.71 - 1.72 - 1.73 - 1.74 -%% todo \cite{thompson_genomic_2008} reports that both mNNMF and hierarchial mNNMF clustering were useful, and that hierarchial recursive bifurcation gave similar results. 1.75 +%% \cite{thompson_genomic_2008} reports that both mNNMF and hierarchial mNNMF clustering were useful, and that hierarchial recursive bifurcation gave similar results. 1.76 + 1.77 + 1.78 +AGEA's\cite{ng_anatomic_2009} hierarchial clustering was described above. EMAGE\cite{venkataraman_emage_2008} allows the user to select a dataset from among a large number of alternatives, or by running a search query, and then to cluster the genes within that dataset. Clustering is hierarchial complete linkage clustering with un-centred correlation as the similarity score. 1.79 + 1.80 +todo \cite{chin_genome-scale_2007} 1.81 1.82 In an interesting twist, \cite{hemert_matching_2008} applies their technique for finding combinations of marker genes for the purpose of clustering genes around a "seed gene". The way they do this is by using the pattern of expression of the seed gene as the target image, and then searching for other genes which can be combined to reproduce this pattern. Those other genes which are found are considered to be related to the seed. The same team also describes a method\cite{van_hemert_mining_2007} for finding "association rules" such as, "if this voxel is expressed in by any gene, then that voxel is probably also expressed in by the same gene". This could be useful as part of a procedure for clustering voxels. 1.83 1.84 - 1.85 -AGEA's\cite{ng_anatomic_2009} hierarchial clustering differs from our Aim 2 in at least two ways. First, AGEA uses perhaps the simplest possible similarity score (correlation), and does no dimensionality reduction before calculating similarity. While it is possible that a more complex system will not do any better than this, we believe further exploration of alternative methods of scoring and dimensionality reduction is warranted. Second, AGEA did not look at clusters of genes; in Preliminary Data we have shown that clusters of genes may identify interesting spatial regions such as cortical areas. 1.86 - 1.87 -\cite{venkataraman_emage_2008} todo 1.88 - 1.89 - 1.90 -In summary, although these projects obtained clusterings, there has not been much comparison between different algorithms or scoring methods, so it is likely that the best clustering method for this application has not yet been found. 1.91 +In summary, although these projects obtained clusterings, there has not been much comparison between different algorithms or scoring methods, so it is likely that the best clustering method for this application has not yet been found. Also, none of these projects did a separate dimensionality reduction step before clustering pixels, or tried to cluster genes first in order to guide the clustering of pixels into spatial regions, or used co-clustering algorithms. 1.92 1.93 1.94 1.95 @@ -191,7 +188,7 @@ 1.96 1.97 Mus musculus, the common house mouse, is thought to contain about 22,000 protein-coding genes\cite{waterston_initial_2002}. The ABA contains data on about 20,000 genes in sagittal sections, out of which over 4,000 genes are also measured in coronal sections. Our dataset is derived from only the coronal subset of the ABA, because the sagittal data does not cover the entire cortex, and also has greater registration error\cite{ng_anatomic_2009}. Genes were selected by the Allen Institute for coronal sectioning based on, "classes of known neuroscientific interest... or through post hoc identification of a marked non-ubiquitous expression pattern"\cite{ng_anatomic_2009}. 1.98 1.99 -The ABA is not the only large public spatial gene expression dataset. Other such resources include GENSAT\cite{gong_gene_2003}, GenePaint\cite{visel_genepaint.org:atlas_2004}, its sister project GeneAtlas\cite{carson_data_2005}, BGEM\cite{magdaleno_bgem:in_2006}, EMAGE\cite{venkataraman_emage_2008}, EurExpress\footnote{http://www.eurexpress.org/ee/; EurExpress data is also entered into EMAGE}, EADHB\footnote{http://www.ncl.ac.uk/ihg/EADHB/database/EADHB_database.html}, MAMEP\footnote{http://mamep.molgen.mpg.de/index.php}, Xenbase\footnote{http://xenbase.org/}, ZFIN\cite{sprague_zebrafish_2006}, Aniseed\footnote{http://aniseed-ibdm.univ-mrs.fr/}, VisiGene\footnote{http://genome.ucsc.edu/cgi-bin/hgVisiGene ; includes data from some the other listed data sources}, GEISHA\cite{bell_geisha_2004}, Fruitfly.org\cite{tomancak_systematic_2002}, COMPARE\cite{http://compare.ibdml.univ-mrs.fr/} todo. With the exception of the ABA, GenePaint, and EMAGE, most of these resources, have not (yet) extracted the expression intensity from the ISH images and registered the results into a single 3-D space, and only ABA and EMAGE make this form of data available for public download from the website. Many of these resources focus on developmental gene expression. 1.100 +The ABA is not the only large public spatial gene expression dataset. Other such resources include GENSAT\cite{gong_gene_2003}, GenePaint\cite{visel_genepaint.org:atlas_2004}, its sister project GeneAtlas\cite{carson_data_2005}, BGEM\cite{magdaleno_bgem:in_2006}, EMAGE\cite{venkataraman_emage_2008}, EurExpress\footnote{http://www.eurexpress.org/ee/; EurExpress data is also entered into EMAGE}, EADHB\footnote{http://www.ncl.ac.uk/ihg/EADHB/database/EADHB_database.html}, MAMEP\footnote{http://mamep.molgen.mpg.de/index.php}, Xenbase\footnote{http://xenbase.org/}, ZFIN\cite{sprague_zebrafish_2006}, Aniseed\footnote{http://aniseed-ibdm.univ-mrs.fr/}, VisiGene\footnote{http://genome.ucsc.edu/cgi-bin/hgVisiGene ; includes data from some the other listed data sources}, GEISHA\cite{bell_geisha_2004}, Fruitfly.org\cite{tomancak_systematic_2002}, COMPARE\footnote{http://compare.ibdml.univ-mrs.fr/} todo. With the exception of the ABA, GenePaint, and EMAGE, most of these resources have not (yet) extracted the expression intensity from the ISH images and registered the results into a single 3-D space, and only ABA and EMAGE make this form of data available for public download from the website\footnote{without prior offline registration}. Many of these resources focus on developmental gene expression. 1.101 1.102 1.103 1.104 @@ -411,7 +408,6 @@ 1.105 # Linear discriminant analysis 1.106 1.107 1.108 - 1.109 \vspace{0.3cm}**Apply these algorithms to the cortex** 1.110 1.111 # Create open source format conversion tools: we will create tools to bulk download the ABA dataset and to convert between SEV, NIFTI and MATLAB formats. 1.112 @@ -432,6 +428,9 @@ 1.113 1.114 # Linear discriminant analysis 1.115 1.116 +# jbt, coclustering 1.117 + 1.118 +# self-organizing map 1.119 1.120 \newpage 1.121