cg

diff grant.txt @ 97:1849a5bd1ce9

.
author bshanks@bshanks.dyndns.org
date Wed Apr 22 05:27:25 2009 -0700 (16 years ago)
parents a25a60a4bf43
children a75c226cbdd6
line diff
1.1 --- a/grant.txt Tue Apr 21 18:53:40 2009 -0700 1.2 +++ b/grant.txt Wed Apr 22 05:27:25 2009 -0700 1.3 @@ -1,9 +1,30 @@ 1.4 -\documentclass{nih-blank} 1.5 +\documentclass[11pt]{nih-blank} 1.6 + 1.7 + 1.8 %%\piname{Stevens, Charles F.} 1.9 1.10 %%\usepackage{floatflt} 1.11 \usepackage{wrapfig} 1.12 1.13 +%%\renewcommand{\rmdefault}{phv} %% Arial 1.14 +%%\renewcommand{\sfdefault}{phv} %% Arial 1.15 + 1.16 +%%\usepackage[T1]{fontenc} 1.17 +%%\usepackage[scaled]{uarial} 1.18 + 1.19 +%% \fontencoding{T1} 1.20 +%% \fontfamily{garamond} 1.21 + 1.22 +%% \fontseries{m} 1.23 +%% \fontshape{it} 1.24 + 1.25 +%% \fontfamily{arial} 1.26 +%% \fontsize{11}{15} 1.27 +%% \selectfont 1.28 + 1.29 +\begin{document} 1.30 + 1.31 + 1.32 == Specific aims == 1.33 1.34 Massive new datasets obtained with techniques such as in situ hybridization (ISH), immunohistochemistry, in situ transgenic reporter, microarray voxelation, and others, allow the expression levels of many genes at many locations to be compared. Our goal is to develop automated methods to relate spatial variation in gene expression to anatomy. We want to find marker genes for specific anatomical regions, and also to draw new anatomical maps based on gene expression patterns. We have three specific aims:\\ 1.35 @@ -463,7 +484,7 @@ 1.36 1.37 1.38 1.39 -We have applied the following dimensionality reduction algorithms to reduce the dimensionality of the gene expression profile associated with each voxel: Principal Components Analysis (PCA), Simple PCA (SPCA), Multi-Dimensional Scaling (MDS), Isomap, Landmark Isomap, Laplacian eigenmaps, Local Tangent Space Alignment (LTSA), Hessian locally linear embedding, Diffusion maps, Stochastic Neighbor Embedding (SNE), Stochastic Proximity Embedding (SPE), Fast Maximum Variance Unfolding (FastMVU), Non-negative Matrix Factorization (NNMF). Space constraints prevent us from showing many of the results, but as a sample, PCA, NNMF, and landmark Isomap are shown in the first, second, and third rows of Figure \ref{dimReduc}. 1.40 +We have applied the following dimensionality reduction algorithms to reduce the dimensionality of the gene expression profile associated with each pixel: Principal Components Analysis (PCA), Simple PCA (SPCA), Multi-Dimensional Scaling (MDS), Isomap, Landmark Isomap, Laplacian eigenmaps, Local Tangent Space Alignment (LTSA), Stochastic Proximity Embedding (SPE), Fast Maximum Variance Unfolding (FastMVU), Non-negative Matrix Factorization (NNMF). Space constraints prevent us from showing many of the results, but as a sample, PCA, NNMF, and landmark Isomap are shown in the first, second, and third rows of Figure \ref{dimReduc}. 1.41 1.42 After applying the dimensionality reduction, we ran clustering algorithms on the reduced data. To date we have tried k-means and spectral clustering. The results of k-means after PCA, NNMF, and landmark Isomap are shown in the last row of Figure \ref{dimReduc}. To compare, the leftmost picture on the bottom row of Figure \ref{dimReduc} shows some of the major subdivisions of cortex. These results clearly show that different dimensionality reduction techniques capture different aspects of the data and lead to different clusterings, indicating the utility of our proposal to produce a detailed comparion of these techniques as applied to the domain of genomic anatomy. 1.43 1.44 @@ -475,7 +496,7 @@ 1.45 \label{geneClusters}\end{wrapfigure} 1.46 1.47 \vspace{0.3cm}**Many areas are captured by clusters of genes** 1.48 -We also clustered the genes using gradient similarity to see if the spatial regions defined by any clusters matched known anatomical regions. Figure \ref{geneClusters} shows, for ten sample gene clusters, each cluster's average expression pattern, compared to a known anatomical boundary. This suggests that it is worth attempting to cluster genes, and then to use the results to cluster voxels. 1.49 +We also clustered the genes using gradient similarity to see if the spatial regions defined by any clusters matched known anatomical regions. Figure \ref{geneClusters} shows, for ten sample gene clusters, each cluster's average expression pattern, compared to a known anatomical boundary. This suggests that it is worth attempting to cluster genes, and then to use the results to cluster pixels. 1.50 1.51 1.52 1.53 @@ -504,10 +525,8 @@ 1.54 1.55 === Develop algorithms that find genetic markers for anatomical regions === 1.56 1.57 -%%\vspace{0.3cm}**Scoring measures and feature selection** 1.58 - 1.59 +\vspace{0.3cm}**Scoring measures and feature selection** 1.60 %%We will develop scoring methods for evaluating how good individual genes are at marking areas. We will compare pointwise, geometric, and information-theoretic measures. We already developed one entirely new scoring method (gradient similarity), but we may develop more. Scoring measures that we will explore will include the L1 norm, correlation, expression energy ratio, conditional entropy, gradient similarity, Jaccard similarity, Dice similarity, Hough transform, and statistical tests such as Hotelling's T-square test (a multivariate generalization of Student's t-test), ANOVA, and a multivariate version of the Mann-Whitney U test (a non-parametric test). 1.61 - 1.62 We will develop scoring methods for evaluating how good individual genes are at marking areas. We will compare pointwise, geometric, and information-theoretic measures. We already developed one entirely new scoring method (gradient similarity), but we may develop more. Scoring measures that we will explore will include the L1 norm, correlation, expression energy ratio, conditional entropy, gradient similarity, Jaccard similarity, Dice similarity, Hough transform, and statistical tests such as Student's t-test, and the Mann-Whitney U test (a non-parametric test). In addition, any classifier induces a scoring measure on genes by taking the prediction error when using that gene to predict the target. 1.63 1.64 Using some combination of these measures, we will develop a procedure to find single marker genes for anatomical regions: for each cortical area, we will rank the genes by their ability to delineate each area. We will quantitatively compare the list of single genes generated by our method to the lists generated by previous methods which are mentioned in Aim 1 Related Work. 1.65 @@ -519,50 +538,55 @@ 1.66 1.67 Since errors of displacement and of shape may cause genes and target areas to match less than they should, we will consider the robustness of feature selection methods in the presence of error. Some of these methods, such as the Hough transform, are designed to be resistant in the presence of error, but many are not. We will consider extensions to scoring measures that may improve their robustness; for example, a wrapper that runs a scoring method on small displacements and distortions of the data adds robustness to registration error at the expense of computation time. 1.68 1.69 -An area may be difficult to identify because the boundaries are misdrawn in the atlas, or because the shape of the natural domain of gene expression corresponding to the area is different from the shape of the area as recognized by anatomists. We will extend our procedure to handle difficult areas by combining areas or redrawing their boundaries. We will develop extensions to our procedure which (a) detect when a difficult area could be fit if its boundary were redrawn slightly, and (b) detect when a difficult area could be combined with adjacent areas to create a larger area which can be fit. 1.70 +An area may be difficult to identify because the boundaries are misdrawn in the atlas, or because the shape of the natural domain of gene expression corresponding to the area is different from the shape of the area as recognized by anatomists. We will extend our procedure to handle difficult areas by combining areas or redrawing their boundaries. We will develop extensions to our procedure which (a) detect when a difficult area could be fit if its boundary were redrawn slightly\footnote{Not just any redrawing is acceptable, only those which appear to be justified as a natural spatial domain of gene expression by multiple sources of evidence. Interestingly, the need to detect "natural spatial domains of gene expression" in a data-driven fashion means that the methods of Aim 2 might be useful in achieving Aim 1, as well -- particularly discriminative dimensionality reduction.}, and (b) detect when a difficult area could be combined with adjacent areas to create a larger area which can be fit. 1.71 1.72 A future publication on the method that we develop in Aim 1 will review the scoring measures and quantitatively compare their performance in order to provide a foundation for future research of methods of marker gene finding. We will measure the robustness of the scoring measures as well as their absolute performance on our dataset. 1.73 1.74 \vspace{0.3cm}**Classifiers** 1.75 - 1.76 We will explore and compare different classifiers. As noted above, this activity is not separate from the previous one, because some supervised learning algorithms include feature selection, and any classifier can be combined with a stepwise wrapper for use as a feature selection method. We will explore logistic regression (including spatial models\cite{paciorek_computational_2007}), decision trees\footnote{Actually, we have already begun to explore decision trees. For each cortical area, we have used the C4.5 algorithm to find a decision tree for that area. We achieved good classification accuracy on our training set, but the number of genes that appeared in each tree was too large. We plan to implement a pruning procedure to generate trees that use fewer genes.}, sparse SVMs, generative mixture models (including naive bayes), kernel density estimation, instance-based learning methods (such as k-nearest neighbor), genetic algorithms, and artificial neural networks. 1.77 1.78 -\vspace{0.3cm}**Application to cortical areas** 1.79 - 1.80 - 1.81 - 1.82 -# confirm with EMAGE, GeneAtlas, GENSAT, etc, to fight overfitting, two hemis 1.83 - 1.84 - 1.85 -\vspace{0.3cm}**Develop algorithms to suggest a division of a structure into anatomical parts** 1.86 - 1.87 -# Explore dimensionality reduction algorithms applied to pixels: including TODO 1.88 -# Explore dimensionality reduction algorithms applied to genes: including TODO 1.89 -# Explore clustering algorithms applied to pixels: including TODO 1.90 -# Explore clustering algorithms applied to genes: including gene shaving\cite{hastie_gene_2000}, TODO 1.91 -# Develop an algorithm to use dimensionality reduction and/or hierarchial clustering to create anatomical maps 1.92 -# Run this algorithm on the cortex: present a hierarchial, genoarchitectonic map of the cortex 1.93 - 1.94 -# Linear discriminant analysis 1.95 - 1.96 -# jbt, coclustering 1.97 - 1.98 -# self-organizing map 1.99 - 1.100 -# Linear discriminant analysis 1.101 - 1.102 - 1.103 -# compare using clustering scores 1.104 - 1.105 -# multivariate gradient similarity 1.106 - 1.107 -# deep belief nets 1.108 - 1.109 - 1.110 - 1.111 -\vspace{0.3cm}**Apply these algorithms to the cortex** 1.112 - 1.113 -Using the methods developed in Aim 1, we will present, for each cortical area, a short list of markers to identify that area; and we will also present lists of "panels" of genes that can be used to delineate many areas at once. Using the methods developed in Aim 2, we will present one or more hierarchial cortical maps. We will identify and explain how the statistical structure in the gene expression data led to any unexpected or interesting features of these maps, and we will provide biological hypotheses to interpret any new cortical areas, or groupings of areas, which are discovered. 1.114 + 1.115 + 1.116 +=== Develop algorithms to suggest a division of a structure into anatomical parts === 1.117 + 1.118 +\vspace{0.3cm}**Explore dimensionality reduction on gene expression profiles** 1.119 +We have already described the application of ten dimensionality reduction algorithms for the purpose of replacing the gene expression profiles, which are vectors of about 4000 gene expression levels, with a smaller number of features. We plan to further explore and interpret these results, as well as to apply other unsupervised learning algorithms, including independent components analysis, self-organizing maps, and generative models such as deep Boltzmann machines. We will explore ways to quantitatively compare the relevance of the different dimensionality reduction methods for identifying cortical areal boundaries. 1.120 + 1.121 +\vspace{0.3cm}**Explore dimensionality reduction on pixels** 1.122 +Instead of applying dimensionality reduction to the gene expression profiles, the same techniques can be applied instead to the pixels\footnote{Consider a matrix whose rows represent pixel locations, and whose columns represent genes. An entry in this matrix represents the gene expression level at a given pixel. One can look at this matrix as a collection of pixels, each corresponding to a vector of many gene expression levels; or one can look at it as a collection of genes, each corresponding to a vector giving that gene's expression at each pixel. Similarly, dimensionality reduction can be used to replace a large number of genes with a small number of features, or it can be used to replace a large number of pixels with a small number of features.}. It is possible that the features generated in this way by some dimensionality reduction techniques will directly correspond to interesting spatial regions. 1.123 + 1.124 + 1.125 +\vspace{0.3cm}**Explore clustering and segmentation algorithms on pixels** 1.126 +We will explore clustering and segmentation algorithms in order to segment the pixels into regions. We will explore k-means, spectral clustering, gene shaving\cite{hastie_gene_2000}, recursive division clustering, multivariate generalizations of edge detectors, multivariate generalizations of watershed transformations, region growing, active contours, graph partitioning methods, and recursive agglomerative clustering with various linkage functions. These methods can be combined with dimensionality reduction. 1.127 + 1.128 +\vspace{0.3cm}**Explore clustering on genes** 1.129 +We have already shown that the procedure of clustering genes according to gradient similarity, and then creating an averaged prototype of each cluster's expression pattern, yields some spatial patterns which match cortical areas. We will further explore the clustering of genes. 1.130 + 1.131 +In addition to using the cluster expression prototypes directly to identify spatial regions, this might be useful as a component of dimensionality reduction. For example, one could imagine clustering similar genes and then replacing their expression levels with a single average expression level, thereby removing some redundancy from the gene expression profiles. One could then perform clustering on pixels (possibly after a second dimensionality reduction step) in order to identify spatial regions. It remains to be seen whether removal of redundancy would help or hurt the ultimate goal of identifying interesting spatial regions. 1.132 + 1.133 +\vspace{0.3cm}**Explore co-clustering** 1.134 +There are some algorithms which simultaineously incorporate clustering on instances and on features (in our case, genes and pixels), for example, IRM\cite{kemp_learning_2006}. These are called co-clustering or biclustering algorithms. 1.135 + 1.136 + 1.137 + 1.138 + 1.139 +\vspace{0.3cm}**Quantitatively compare different methods** 1.140 +In order to tell which method is best for genomic anatomy, for each experimental method we will compare the cortical map found by unsupervised learning to a cortical map derived from the Allen Reference Atlas. In order to compare the experimental clustering with the reference clustering, we will explore various quantitative metrics that purport to measure how similar two clusterings are, such as Jaccard, Rand index, Fowlkes-Mallows, variation of information, Larsen, Van Dongen, and others. 1.141 + 1.142 + 1.143 +\vspace{0.3cm}**Discriminative dimensionality reduction** 1.144 +In addition to using a purely data-driven approach to identify spatial regions, it might be useful to see how well the known regions can be reconstructed from a small number of features, even if those features are chosen by using knowledge of the regions. For example, linear discriminant analysis could be used as a dimensionality reduction technique in order to identify a few features which are the best linear summary of gene expression profiles for the purpose of discriminating between regions. This reduced feature set could then be used to cluster pixels into regions. Perhaps the resulting clusters will be similar to the reference atlas, yet more faithful to natural spatial domains of gene expression than the reference atlas is. 1.145 + 1.146 + 1.147 +=== Apply the new methods to the cortex === 1.148 +Using the methods developed in Aim 1, we will present, for each cortical area, a short list of markers to identify that area; and we will also present lists of "panels" of genes that can be used to delineate many areas at once. 1.149 + 1.150 +Because in most cases the ABA coronal dataset only contains one ISH per gene, it is possible for an unrelated combination of genes to seem to identify an area when in fact it is only coincidence. There are two ways we will validate our marker genes to guard against this. First, we will confirm that putative combinations of marker genes express the same pattern in both hemispheres. Second, we will manually validate our final results on other gene expression datasets such as EMAGE, GeneAtlas, and GENSAT. 1.151 + 1.152 +Using the methods developed in Aim 2, we will present one or more hierarchial cortical maps. We will identify and explain how the statistical structure in the gene expression data led to any unexpected or interesting features of these maps, and we will provide biological hypotheses to interpret any new cortical areas, or groupings of areas, which are discovered. 1.153 + 1.154 + 1.155 + 1.156 1.157 1.158 %%# note: slice artifact 1.159 @@ -573,22 +597,21 @@ 1.160 1.161 == Timeline and milestones == 1.162 1.163 -=== Finding marker genes === 1.164 - 1.165 -* September-November 2009: Develop an automated mechanism for segmenting the cortical voxels into layers 1.166 -* November 2009 (milestone): Have completed construction of a flatmapped, cortical dataset with information for each layer 1.167 -* October 2009-April 2010: Develop scoring methods and to test them in various supervised learning frameworks. Also test out various dimensionality reduction schemes in combination with supervised learning. create or extend supervised learning frameworks which use multivariate versions of the best scoring methods. 1.168 -* January 2010 (milestone): Submit a publication on single marker genes for cortical areas 1.169 -* February-July 2010: Continue to develop scoring methods and supervised learning frameworks. Explore the best way to integrate radial profiles with supervised learning. Explore the best way to make supervised learning techniques robust against incorrect labels (i.e. when the areas drawn on the input cortical map are slightly off). Quantitatively compare the performance of different supervised learning techniques. Validate marker genes found in the ABA dataset by checking against other gene expression datasets. Create documentation and unit tests for software toolbox for Aim 1. Respond to user bug reports for Aim 1 software toolbox. 1.170 -* June 2010 (milestone): Submit a paper describing a method fulfilling Aim 1. Release toolbox. 1.171 -* July 2010 (milestone): Submit a paper describing combinations of marker genes for each cortical area, and a small number of marker genes that can, in combination, define most of the areas at once 1.172 - 1.173 -=== Revealing new ways to parcellate a structure into regions === 1.174 -* June 2010-March 2011: Explore dimensionality reduction algorithms for Aim 2. Explore standard hierarchial clustering algorithms, used in combination with dimensionality reduction, for Aim 2. Explore co-clustering algorithms. Think about how radial profile information can be used for Aim 2. Adapt clustering algorithms to use radial profile information. Quantitatively compare the performance of different dimensionality reduction and clustering techniques. Quantitatively compare the value of different flatmapping methods and ways of representing radial profiles. 1.175 -* March 2011 (milestone): Submit a paper describing a method fulfilling Aim 2. Release toolbox. 1.176 -* February-May 2011: Using the methods developed for Aim 2, explore the genomic anatomy of the cortex. If new ways of organizing the cortex into areas are discovered, read the literature and talk to people to learn about research related to interpreting our results. Create documentation and unit tests for software toolbox for Aim 2. Respond to user bug reports for Aim 2 software toolbox. 1.177 -* May 2011 (milestone): Submit a paper on the genomic anatomy of the cortex, using the methods developed in Aim 2 1.178 -* May-August 2011: Revisit Aim 1 to see if what was learned during Aim 2 can improve the methods for Aim 1. Follow up on responses to our papers. Possibly submit another paper. 1.179 +\vspace{0.3cm}**Finding marker genes** 1.180 +\\ **September-November 2009**: Develop an automated mechanism for segmenting the cortical voxels into layers 1.181 +\\ **November 2009 (milestone)**: Have completed construction of a flatmapped, cortical dataset with information for each layer 1.182 +\\ **October 2009-April 2010**: Develop scoring methods and to test them in various supervised learning frameworks. Also test out various dimensionality reduction schemes in combination with supervised learning. create or extend supervised learning frameworks which use multivariate versions of the best scoring methods. 1.183 +\\ **January 2010 (milestone)**: Submit a publication on single marker genes for cortical areas 1.184 +\\ **February-July 2010**: Continue to develop scoring methods and supervised learning frameworks. Explore the best way to integrate radial profiles with supervised learning. Explore the best way to make supervised learning techniques robust against incorrect labels (i.e. when the areas drawn on the input cortical map are slightly off). Quantitatively compare the performance of different supervised learning techniques. Validate marker genes found in the ABA dataset by checking against other gene expression datasets. Create documentation and unit tests for software toolbox for Aim 1. Respond to user bug reports for Aim 1 software toolbox. 1.185 +\\ **June 2010 (milestone)**: Submit a paper describing a method fulfilling Aim 1. Release toolbox. 1.186 +\\ **July 2010 (milestone)**: Submit a paper describing combinations of marker genes for each cortical area, and a small number of marker genes that can, in combination, define most of the areas at once 1.187 + 1.188 +\vspace{0.3cm}**Revealing new ways to parcellate a structure into regions** 1.189 +\\ **June 2010-March 2011**: Explore dimensionality reduction algorithms for Aim 2. Explore standard hierarchial clustering algorithms, used in combination with dimensionality reduction, for Aim 2. Explore co-clustering algorithms. Think about how radial profile information can be used for Aim 2. Adapt clustering algorithms to use radial profile information. Quantitatively compare the performance of different dimensionality reduction and clustering techniques. Quantitatively compare the value of different flatmapping methods and ways of representing radial profiles. 1.190 +\\ **March 2011 (milestone)**: Submit a paper describing a method fulfilling Aim 2. Release toolbox. 1.191 +\\ **February-May 2011**: Using the methods developed for Aim 2, explore the genomic anatomy of the cortex. If new ways of organizing the cortex into areas are discovered, read the literature and talk to people to learn about research related to interpreting our results. Create documentation and unit tests for software toolbox for Aim 2. Respond to user bug reports for Aim 2 software toolbox. 1.192 +\\ **May 2011 (milestone)**: Submit a paper on the genomic anatomy of the cortex, using the methods developed in Aim 2 1.193 +\\ **May-August 2011**: Revisit Aim 1 to see if what was learned during Aim 2 can improve the methods for Aim 1. Follow up on responses to our papers. Possibly submit another paper. 1.194 1.195 \newpage 1.196 1.197 @@ -603,3 +626,4 @@ 1.198 1.199 1.200 1.201 +\end{document}