cg
diff grant.txt @ 53:304d07e0ac94
.
author | bshanks@bshanks.dyndns.org |
---|---|
date | Sat Apr 18 16:52:41 2009 -0700 (16 years ago) |
parents | 3ebb8f4ea921 |
children | 1a2a8d08b7c3 |
line diff
1.1 --- a/grant.txt Fri Apr 17 12:47:51 2009 -0700
1.2 +++ b/grant.txt Sat Apr 18 16:52:41 2009 -0700
1.3 @@ -3,7 +3,7 @@
1.4
1.5 == Specific aims ==
1.6
1.7 -Massive new datasets obtained with techniques such as in situ hybridization (ISH), immunohistochemistry, or in situ transgenic reporter allow the expression levels of many genes at many locations to be compared. Our goal is to develop automated methods to relate spatial variation in gene expression to anatomy. We want to find marker genes for specific anatomical regions, and also to draw new anatomical maps based on gene expression patterns. We have three specific aims:\\
1.8 +Massive new datasets obtained with techniques such as in situ hybridization (ISH), immunohistochemistry, in situ transgenic reporter, microarray voxelation, and others, allow the expression levels of many genes at many locations to be compared. Our goal is to develop automated methods to relate spatial variation in gene expression to anatomy. We want to find marker genes for specific anatomical regions, and also to draw new anatomical maps based on gene expression patterns. We have three specific aims:\\
1.9
1.10 (1) develop an algorithm to screen spatial gene expression data for combinations of marker genes which selectively target anatomical regions\\
1.11
1.12 @@ -13,7 +13,7 @@
1.13
1.14 In addition to validating the usefulness of the algorithms, the application of these methods to cerebral cortex will produce immediate benefits, because there are currently no known genetic markers for many cortical areas. The results of the project will support the development of new ways to selectively target cortical areas, and it will support the development of a method for identifying the cortical areal boundaries present in small tissue samples.
1.15
1.16 -All algorithms that we develop will be implemented in an open-source software toolkit. The toolkit, as well as the machine-readable datasets developed in aim (3), will be published and freely available for others to use.
1.17 +All algorithms that we develop will be implemented in a GPL open-source software toolkit. The toolkit, as well as the machine-readable datasets developed in aim (3), will be published and freely available for others to use.
1.18
1.19
1.20 \newpage
1.21 @@ -38,7 +38,7 @@
1.22
1.23 One class of feature selection methods assigns some sort of score to each candidate gene. The top-ranked genes are then chosen. Some scoring measures can assign a score to a set of selected genes, not just to a single gene; in this case, a dynamic procedure may be used in which features are added and subtracted from the selected set depending on how much they raise the score. Such procedures are called "stepwise" or "greedy".
1.24
1.25 -Although the classifier itself may only look at the gene expression data within each voxel before classifying that voxel, the learning algorithm which constructs the classifier may look over the entire dataset. We can categorize score-based feature selection methods depending on how the score of calculated. Often the score calculation consists of assigning a sub-score to each voxel, and then aggregating these sub-scores into a final score (the aggregation is often a sum or a sum of squares). If only information from nearby voxels is used to calculate a voxel's sub-score, then we say it is a __local scoring method__. If only information from the voxel itself is used to calculate a voxel's sub-score, then we say it is a __pointwise scoring method__.
1.26 +Although the classifier itself may only look at the gene expression data within each voxel before classifying that voxel, the learning algorithm which constructs the classifier may look over the entire dataset. We can categorize score-based feature selection methods depending on how the score of calculated. Often the score calculation consists of assigning a sub-score to each voxel, and then aggregating these sub-scores into a final score (the aggregation is often a sum or a sum of squares or average). If only information from nearby voxels is used to calculate a voxel's sub-score, then we say it is a __local scoring method__. If only information from the voxel itself is used to calculate a voxel's sub-score, then we say it is a __pointwise scoring method__.
1.27
1.28 Key questions when choosing a learning method are: What are the instances? What are the features? How are the features chosen? Here are four principles that outline our answers to these questions.
1.29
1.30 @@ -71,11 +71,13 @@
1.31
1.32 As noted above, there has been much work on both supervised learning and there are many available algorithms for each. However, the algorithms require the scientist to provide a framework for representing the problem domain, and the way that this framework is set up has a large impact on performance. Creating a good framework can require creatively reconceptualizing the problem domain, and is not merely a mechanical "fine-tuning" of numerical parameters. For example, we believe that domain-specific scoring measures (such as gradient similarity, which is discussed in Preliminary Work) may be necessary in order to achieve the best results in this application.
1.33
1.34 -We are aware of five existing efforts to find marker genes using spatial gene expression data using automated methods.
1.35 -
1.36 -%%GeneAtlas\cite{carson_data_2005} allows the user to construct a search query by freely demarcating one or two 2-D regions on sagittal slices, and then to specify either the strength of expression or the name of another gene whose expression pattern is to be matched.
1.37 -
1.38 -GeneAtlas\cite{carson_data_2005} and EMAGE \cite{venkataraman_emage_2008} allow the user to construct a search query by demarcating regions and then specifing either the strength of expression or the name of another gene or dataset whose expression pattern is to be matched. For the similiarity score (match score), GeneAtlas appears to use strength of expression, and EMAGE uses Jaccard similarity, which is equal to the number of true pixels in the intersection of the two images, divided by the number of pixels in their union. Neither GeneAtlas nor EMAGE allow one to search for combinations of genes that together match a region.
1.39 +We are aware of six existing efforts to find marker genes using spatial gene expression data using automated methods.
1.40 +
1.41 +%%GeneAtlas\cite{carson_digital_2005} allows the user to construct a search query by freely demarcating one or two 2-D regions on sagittal slices, and then to specify either the strength of expression or the name of another gene whose expression pattern is to be matched.
1.42 +
1.43 +\cite{lee_high-resolution_2007} mentions the possibility of constructing a spatial region for each gene, and then, for each anatomical structure of interest, computing what proportion of this structure is covered by the gene's spatial region.
1.44 +
1.45 +GeneAtlas\cite{carson_digital_2005} and EMAGE \cite{venkataraman_emage_2008} allow the user to construct a search query by demarcating regions and then specifing either the strength of expression or the name of another gene or dataset whose expression pattern is to be matched. For the similiarity score (match score) between two images (in this case, the query and the gene expression images), GeneAtlas uses the sum of a weighted L1-norm distance between vectors whose components represent the number of cells within a pixel\footnote{Actually, many of these projects use quadrilaterals instead of square pixels; but we will refer to them as pixels for simplicity.} whose expression is within four discretization levels. EMAGE uses Jaccard similarity, which is equal to the number of true pixels in the intersection of the two images, divided by the number of pixels in their union. Neither GeneAtlas nor EMAGE allow one to search for combinations of genes that define a region in concert but not separately.
1.46
1.47 \cite{ng_anatomic_2009} describes AGEA, "Anatomic Gene Expression
1.48 Atlas". AGEA has three
1.49 @@ -89,7 +91,7 @@
1.50 the shows the user how much correlation there is between the gene
1.51 expression profile of the seed voxel and every other voxel.
1.52
1.53 -* Clusters: AGEA includes a precomputed hierarchial clustering of voxels based on a recursive bifurcation algorithm with correlation as the similarity metric.
1.54 +* Clusters: will be described later
1.55
1.56 Gene Finder is different from our Aim 1 in at least three ways. First, Gene Finder finds only single genes, whereas we will also look for combinations of genes. Second, gene finder can only use overexpression as a marker, whereas we will also search for underexpression. Third, Gene Finder uses a simple pointwise score\footnote{"Expression energy ratio", which captures overexpression.}, whereas we will also use geometric scores such as gradient similarity. The Preliminary Data section contains evidence that each of our three choices is the right one.
1.57
1.58 @@ -160,13 +162,13 @@
1.59 %% \cite{thompson_genomic_2008} reports that both mNNMF and hierarchial mNNMF clustering were useful, and that hierarchial recursive bifurcation gave similar results.
1.60
1.61
1.62 -AGEA's\cite{ng_anatomic_2009} hierarchial clustering was described above. EMAGE\cite{venkataraman_emage_2008} allows the user to select a dataset from among a large number of alternatives, or by running a search query, and then to cluster the genes within that dataset. Clustering is hierarchial complete linkage clustering with un-centred correlation as the similarity score.
1.63 -
1.64 -todo \cite{chin_genome-scale_2007}
1.65 +AGEA\cite{ng_anatomic_2009} includes a preset hierarchial clustering of voxels based on a recursive bifurcation algorithm with correlation as the similarity metric. EMAGE\cite{venkataraman_emage_2008} allows the user to select a dataset from among a large number of alternatives, or by running a search query, and then to cluster the genes within that dataset. EMAGE clusters via hierarchial complete linkage clustering with un-centred correlation as the similarity score.
1.66 +
1.67 +\cite{chin_genome-scale_2007} clustered genes, starting out by selecting 135 genes out of 20,000 which had high variance over voxels and which were highly correlated with many other genes. They computed the matrix of (rank) correlations between pairs of these genes, and ordered the rows of this matrix as follows: "the first row of the matrix was chosen to show the strongest contrast between the highest and lowest correlation coefficient for that row. The remaining rows were then arranged in order of decreasing similarity using a least squares metric". The resulting matrix showed four clusters. For each cluster, prototypical spatial expression patterns were created by averaging the genes in the cluster. The prototypes were analyzed manually, without clustering voxels
1.68
1.69 In an interesting twist, \cite{hemert_matching_2008} applies their technique for finding combinations of marker genes for the purpose of clustering genes around a "seed gene". The way they do this is by using the pattern of expression of the seed gene as the target image, and then searching for other genes which can be combined to reproduce this pattern. Those other genes which are found are considered to be related to the seed. The same team also describes a method\cite{van_hemert_mining_2007} for finding "association rules" such as, "if this voxel is expressed in by any gene, then that voxel is probably also expressed in by the same gene". This could be useful as part of a procedure for clustering voxels.
1.70
1.71 -In summary, although these projects obtained clusterings, there has not been much comparison between different algorithms or scoring methods, so it is likely that the best clustering method for this application has not yet been found. Also, none of these projects did a separate dimensionality reduction step before clustering pixels, or tried to cluster genes first in order to guide the clustering of pixels into spatial regions, or used co-clustering algorithms.
1.72 +In summary, although these projects obtained clusterings, there has not been much comparison between different algorithms or scoring methods, so it is likely that the best clustering method for this application has not yet been found. Also, none of these projects did a separate dimensionality reduction step before clustering pixels, none tried to cluster genes first in order to guide automated clustering of pixels into spatial regions, and none used co-clustering algorithms.
1.73
1.74
1.75
1.76 @@ -188,7 +190,7 @@
1.77
1.78 Mus musculus, the common house mouse, is thought to contain about 22,000 protein-coding genes\cite{waterston_initial_2002}. The ABA contains data on about 20,000 genes in sagittal sections, out of which over 4,000 genes are also measured in coronal sections. Our dataset is derived from only the coronal subset of the ABA, because the sagittal data does not cover the entire cortex, and also has greater registration error\cite{ng_anatomic_2009}. Genes were selected by the Allen Institute for coronal sectioning based on, "classes of known neuroscientific interest... or through post hoc identification of a marked non-ubiquitous expression pattern"\cite{ng_anatomic_2009}.
1.79
1.80 -The ABA is not the only large public spatial gene expression dataset. Other such resources include GENSAT\cite{gong_gene_2003}, GenePaint\cite{visel_genepaint.org:atlas_2004}, its sister project GeneAtlas\cite{carson_data_2005}, BGEM\cite{magdaleno_bgem:in_2006}, EMAGE\cite{venkataraman_emage_2008}, EurExpress\footnote{http://www.eurexpress.org/ee/; EurExpress data is also entered into EMAGE}, EADHB\footnote{http://www.ncl.ac.uk/ihg/EADHB/database/EADHB_database.html}, MAMEP\footnote{http://mamep.molgen.mpg.de/index.php}, Xenbase\footnote{http://xenbase.org/}, ZFIN\cite{sprague_zebrafish_2006}, Aniseed\footnote{http://aniseed-ibdm.univ-mrs.fr/}, VisiGene\footnote{http://genome.ucsc.edu/cgi-bin/hgVisiGene ; includes data from some the other listed data sources}, GEISHA\cite{bell_geisha_2004}, Fruitfly.org\cite{tomancak_systematic_2002}, COMPARE\footnote{http://compare.ibdml.univ-mrs.fr/} todo. With the exception of the ABA, GenePaint, and EMAGE, most of these resources have not (yet) extracted the expression intensity from the ISH images and registered the results into a single 3-D space, and only ABA and EMAGE make this form of data available for public download from the website\footnote{without prior offline registration}. Many of these resources focus on developmental gene expression.
1.81 +The ABA is not the only large public spatial gene expression dataset. Other such resources include GENSAT\cite{gong_gene_2003}, GenePaint\cite{visel_genepaint.org:atlas_2004}, its sister project GeneAtlas\cite{carson_digital_2005}, BGEM\cite{magdaleno_bgem:in_2006}, EMAGE\cite{venkataraman_emage_2008}, EurExpress\footnote{http://www.eurexpress.org/ee/; EurExpress data is also entered into EMAGE}, EADHB\footnote{http://www.ncl.ac.uk/ihg/EADHB/database/EADHB_database.html}, MAMEP\footnote{http://mamep.molgen.mpg.de/index.php}, Xenbase\footnote{http://xenbase.org/}, ZFIN\cite{sprague_zebrafish_2006}, Aniseed\footnote{http://aniseed-ibdm.univ-mrs.fr/}, VisiGene\footnote{http://genome.ucsc.edu/cgi-bin/hgVisiGene ; includes data from some the other listed data sources}, GEISHA\cite{bell_geishawhole-mount_2004}, Fruitfly.org\cite{tomancak_systematic_2002}, COMPARE\footnote{http://compare.ibdml.univ-mrs.fr/} GXD\cite{smith_mouse_2007}, GEO\cite{barrett_ncbi_2007}\footnote{GXD and GEO contain spatial data but also non-spatial data. All GXD spatial data are also in EMAGE.}. With the exception of the ABA, GenePaint, and EMAGE, most of these resources have not (yet) extracted the expression intensity from the ISH images and registered the results into a single 3-D space, and to our knowledge only ABA and EMAGE make this form of data available for public download from the website\footnote{without prior offline registration}. Many of these resources focus on developmental gene expression.
1.82
1.83
1.84
1.85 @@ -198,7 +200,8 @@
1.86
1.87 The application of the marker gene finding algorithm to the cortex will also support the development of new neuroanatomical methods. In addition to finding markers for each individual cortical areas, we will find a small panel of genes that can find many of the areal boundaries at once. This panel of marker genes will allow the development of an ISH protocol that will allow experimenters to more easily identify which anatomical areas are present in small samples of cortex.
1.88
1.89 -The method developed in aim (3) will provide a genoarchitectonic viewpoint that will contribute to the creation of a better map. The development of present-day cortical maps was driven by the application of histological stains. It is conceivable that if a different set of stains had been available which identified a different set of features, then the today's cortical maps would have come out differently. Since the number of classes of stains is small compared to the number of genes, it is likely that there are many repeated, salient spatial patterns in the gene expression which have not yet been captured by any stain. Therefore, current ideas about cortical anatomy need to incorporate what we can learn from looking at the patterns of gene expression.
1.90 +The method developed in aim (2) will provide a genoarchitectonic viewpoint that will contribute to the creation of a better map. The development of present-day cortical maps was driven by the application of histological stains. It is conceivable that if a different set of stains had been available which identified a different set of features, then the today's cortical maps would have come out differently. Since the number of classes of stains is small compared to the number of genes, it is likely that there are many repeated, salient spatial patterns in the gene expression which have not yet been captured by any stain. Therefore, current ideas about cortical anatomy need to incorporate what we can learn from looking at the patterns of gene expression.
1.91 +
1.92
1.93 While we do not here propose to analyze human gene expression data, it is conceivable that the methods we propose to develop could be used to suggest modifications to the human cortical map as well.
1.94
1.95 @@ -215,7 +218,6 @@
1.96 Our project is guided by a concrete application with a well-specified criterion of success (how well we can find marker genes for \begin{latex}/\end{latex} reproduce the layout of cortical areas), which will provide a solid basis for comparing different methods.
1.97
1.98
1.99 -%% todo: poster; check AGEA cortical data
1.100
1.101 \newpage
1.102
1.103 @@ -291,7 +293,7 @@
1.104
1.105 \vspace{0.3cm}**Gradient similarity provides information complementary to correlation**
1.106
1.107 -To show that gradient similarity can provide useful information that cannot be detected via pointwise analyses, consider Fig. \ref{AUDgeometry}. The top row of Fig. \ref{AUDgeometry} displays the 3 genes which most match area AUD, according to a pointwise method\footnote{For each gene, a logistic regression in which the response variable was whether or not a surface pixel was within area AUD, and the predictor variable was the value of the expression of the gene underneath that pixel. The resulting scores were used to rank the genes in terms of how well they predict area AUD.}. The bottom row displays the 3 genes which most match AUD according to a method which considers local geometry\footnote{For each gene the gradient similarity (see section \ref{gradientSim}) between (a) a map of the expression of each gene on the cortical surface and (b) the shape of area AUD, was calculated, and this was used to rank the genes.} The pointwise method in the top row identifies genes which express more strongly in AUD than outside of it; its weakness is that this includes many areas which don't have a salient border matching the areal border. The geometric method identifies genes whose salient expression border seems to partially line up with the border of AUD; its weakness is that this includes genes which don't express over the entire area. Genes which have high rankings using both pointwise and border criteria, such as $Aph1a$ in the example, may be particularly good markers. None of these genes are, individually, a perfect marker for AUD; we deliberately chose a "difficult" area in order to better contrast pointwise with geometric methods.
1.108 +To show that gradient similarity can provide useful information that cannot be detected via pointwise analyses, consider Fig. \ref{AUDgeometry}. The top row of Fig. \ref{AUDgeometry} displays the 3 genes which most match area AUD, according to a pointwise method\footnote{For each gene, a logistic regression in which the response variable was whether or not a surface pixel was within area AUD, and the predictor variable was the value of the expression of the gene underneath that pixel. The resulting scores were used to rank the genes in terms of how well they predict area AUD.}. The bottom row displays the 3 genes which most match AUD according to a method which considers local geometry\footnote{For each gene the gradient similarity between (a) a map of the expression of each gene on the cortical surface and (b) the shape of area AUD, was calculated, and this was used to rank the genes.} The pointwise method in the top row identifies genes which express more strongly in AUD than outside of it; its weakness is that this includes many areas which don't have a salient border matching the areal border. The geometric method identifies genes whose salient expression border seems to partially line up with the border of AUD; its weakness is that this includes genes which don't express over the entire area. Genes which have high rankings using both pointwise and border criteria, such as $Aph1a$ in the example, may be particularly good markers. None of these genes are, individually, a perfect marker for AUD; we deliberately chose a "difficult" area in order to better contrast pointwise with geometric methods.
1.109
1.110
1.111 \begin{figure}\label{AUDgeometry}
1.112 @@ -432,6 +434,11 @@
1.113
1.114 # self-organizing map
1.115
1.116 +# confirm with EMAGE, GeneAtlas, GENSAT, etc, to fight overfitting
1.117 +
1.118 +# compare using clustering scores
1.119 +
1.120 +
1.121 \newpage
1.122
1.123 \bibliographystyle{plain}