cg

diff grant.txt @ 92:b4b79f107b2a

.
author bshanks@bshanks.dyndns.org
date Tue Apr 21 14:28:12 2009 -0700 (16 years ago)
parents 7c5d98f0cd5a
children 9f36acf8d9a8
line diff
1.1 --- a/grant.txt Tue Apr 21 06:11:15 2009 -0700 1.2 +++ b/grant.txt Tue Apr 21 14:28:12 2009 -0700 1.3 @@ -231,7 +231,27 @@ 1.4 1.5 The method developed in aim (2) will provide a genoarchitectonic viewpoint that will contribute to the creation of a better map. The development of present-day cortical maps was driven by the application of histological stains. If a different set of stains had been available which identified a different set of features, then today's cortical maps may have come out differently. It is likely that there are many repeated, salient spatial patterns in the gene expression which have not yet been captured by any stain. Therefore, cortical anatomy needs to incorporate what we can learn from looking at the patterns of gene expression. 1.6 1.7 -While we do not here propose to analyze human gene expression data, it is conceivable that the methods we propose to develop could be used to suggest modifications to the human cortical map as well. In fact, the methods we will develop will be applicable to other datasets beyond the brain. We will provide an open-source toolbox to allow other researchers to easily use our methods. With these methods, researchers with gene expression for any area of the body will be able to efficiently find marker genes for anatomical regions, or to use gene expression to discover new anatomical patterning. As described above, marker genes have a variety of uses in the development of drugs and experimental manipulations, and in the anatomical characterization of tissue samples. The discovery of new ways to carve up anatomical structures into regions will widely impact all areas of biology. 1.8 +While we do not here propose to analyze human gene expression data, it 1.9 +is conceivable that the methods we propose to develop could be used to 1.10 +suggest modifications to the human cortical map as well. In fact, the 1.11 +methods we will develop will be applicable to other datasets beyond 1.12 +the brain. We will provide an open-source toolbox to allow other 1.13 +researchers to easily use our methods. With these methods, researchers 1.14 +with gene expression for any area of the body will be able to 1.15 +efficiently find marker genes for anatomical regions, or to use gene 1.16 +expression to discover new anatomical patterning. As described above, 1.17 +marker genes have a variety of uses in the development of drugs and 1.18 +experimental manipulations, and in the anatomical characterization of 1.19 +tissue samples. The discovery of new ways to carve up anatomical 1.20 +structures into regions may lead to the discovery of 1.21 +new anatomical subregions in various structures, which will widely 1.22 +impact all areas of biology. 1.23 + 1.24 +Although our particular application involves the 3D spatial 1.25 +distribution of gene expression, we anticipate that the methods 1.26 +developed in aims (1) and (2) will not be limited to gene expression 1.27 +data, but rather will generalize to any sort of 1.28 +high-dimensional data over points located in a low-dimensional space. 1.29 1.30 1.31 1.32 @@ -431,19 +451,19 @@ 1.33 1.34 === Data-driven redrawing of the cortical map === 1.35 1.36 + 1.37 + 1.38 + 1.39 +We have applied the following dimensionality reduction algorithms to reduce the dimensionality of the gene expression profile associated with each voxel: Principal Components Analysis (PCA), Simple PCA (SPCA), Multi-Dimensional Scaling (MDS), Isomap, Landmark Isomap, Laplacian eigenmaps, Local Tangent Space Alignment (LTSA), Hessian locally linear embedding, Diffusion maps, Stochastic Neighbor Embedding (SNE), Stochastic Proximity Embedding (SPE), Fast Maximum Variance Unfolding (FastMVU), Non-negative Matrix Factorization (NNMF). Space constraints prevent us from showing many of the results, but as a sample, PCA, NNMF, and landmark Isomap are shown in the first, second, and third rows of Figure \ref{dimReduc}. 1.40 + 1.41 +After applying the dimensionality reduction, we ran clustering algorithms on the reduced data. To date we have tried k-means and spectral clustering. The results of k-means after PCA, NNMF, and landmark Isomap are shown in the last row of Figure \ref{dimReduc}. To compare, the leftmost picture on the bottom row of Figure \ref{dimReduc} shows some of the major subdivisions of cortex. These results clearly show that different dimensionality reduction techniques capture different aspects of the data and lead to different clusterings, indicating the utility of our proposal to produce a detailed comparion of these techniques as applied to the domain of genomic anatomy. 1.42 + 1.43 + 1.44 \begin{wrapfigure}{L}{0.5\textwidth}\centering 1.45 \includegraphics[scale=.2]{cosine_similarity1_rearrange_colorize.eps} 1.46 \caption{Prototypes corresponding to sample gene clusters, clustered by gradient similarity. Region boundaries for the region that most matches each prototype are overlayed.} 1.47 \label{geneClusters}\end{wrapfigure} 1.48 1.49 - 1.50 - 1.51 -We have applied the following dimensionality reduction algorithms to reduce the dimensionality of the gene expression profile associated with each voxel: Principal Components Analysis (PCA), Simple PCA (SPCA), Multi-Dimensional Scaling (MDS), Isomap, Landmark Isomap, Laplacian eigenmaps, Local Tangent Space Alignment (LTSA), Hessian locally linear embedding, Diffusion maps, Stochastic Neighbor Embedding (SNE), Stochastic Proximity Embedding (SPE), Fast Maximum Variance Unfolding (FastMVU), Non-negative Matrix Factorization (NNMF). Space constraints prevent us from showing many of the results, but as a sample, PCA, NNMF, and landmark Isomap are shown in the first, second, and third rows of Figure \ref{dimReduc}. 1.52 - 1.53 -After applying the dimensionality reduction, we ran clustering algorithms on the reduced data. To date we have tried k-means and spectral clustering. The results of k-means after PCA, NNMF, and landmark Isomap are shown in the last row of Figure \ref{dimReduc}. To compare, the leftmost picture on the bottom row of Figure \ref{dimReduc} shows some of the major subdivisions of cortex. These results clearly show that different dimensionality reduction techniques capture different aspects of the data and lead to different clusterings, indicating the utility of our proposal to produce a detailed comparion of these techniques as applied to the domain of genomic anatomy. 1.54 - 1.55 - 1.56 - 1.57 \vspace{0.3cm}**Many areas are captured by clusters of genes** 1.58 We also clustered the genes using gradient similarity to see if the spatial regions defined by any clusters matched known anatomical regions. Figure \ref{geneClusters} shows, for ten sample gene clusters, each cluster's average expression pattern, compared to a known anatomical boundary. This suggests that it is worth attempting to cluster genes, and then to use the results to cluster voxels. 1.59 1.60 @@ -453,7 +473,9 @@ 1.61 == The approach: what we plan to do == 1.62 1.63 1.64 -\vspace{0.3cm}**Flatmap and segment cortical layers** 1.65 +%%\vspace{0.3cm}**Flatmap cortex and segment cortical layers** 1.66 + 1.67 +=== Flatmap cortex and segment cortical layers === 1.68 1.69 %%In anatomy, the manifold of interest is usually either defined by a combination of two relevant anatomical axes (todo), or by the surface of the structure (as is the case with the cortex). In the former case, the manifold of interest is a plane, but in the latter case it is curved. If the manifold is curved, there are various methods for mapping the manifold into a plane. 1.70 1.71 @@ -466,20 +488,35 @@ 1.72 1.73 We have not yet made use of radial profiles. While the radial profiles may be used "raw", for laminar structures like the cortex another strategy is to group together voxels in the same cortical layer; each surface pixel would then be associated with one expression level per gene per layer. We will develop a segmentation algorithm to automatically identify the layer boundaries. 1.74 1.75 -\vspace{0.3cm}**Develop algorithms that find genetic markers for anatomical regions** 1.76 - 1.77 - 1.78 - 1.79 - 1.80 - 1.81 - 1.82 - 1.83 -# Develop scoring measures for evaluating how good individual genes are at marking areas: we will compare pointwise, geometric, and information-theoretic measures. 1.84 -# Develop a procedure to find single marker genes for anatomical regions: for each cortical area, by using or combining the scoring measures developed, we will rank the genes by their ability to delineate each area. 1.85 -# Extend the procedure to handle difficult areas by using combinatorial coding: for areas that cannot be identified by any single gene, identify them with a handful of genes. We will consider both (a) algorithms that incrementally/greedily combine single gene markers into sets, such as forward stepwise regression and decision trees, and also (b) supervised learning techniques which use soft constraints to minimize the number of features, such as sparse support vector machines. 1.86 +%%\vspace{0.3cm}**Develop algorithms that find genetic markers for anatomical regions** 1.87 +%%\vspace{0.3cm}**** 1.88 + 1.89 + 1.90 +=== Develop algorithms that find genetic markers for anatomical regions === 1.91 + 1.92 +%%\vspace{0.3cm}**Scoring measures and feature selection** 1.93 + 1.94 +%%We will develop scoring methods for evaluating how good individual genes are at marking areas. We will compare pointwise, geometric, and information-theoretic measures. We already developed one entirely new scoring method (gradient similarity), but we may develop more. Scoring measures that we will explore will include the L1 norm, correlation, expression energy ratio, conditional entropy, gradient similarity, Jaccard similarity, Dice similarity, Hough transform, and statistical tests such as Hotelling's T-square test (a multivariate generalization of Student's t-test), ANOVA, and a multivariate version of the Mann-Whitney U test (a non-parametric test). 1.95 + 1.96 +We will develop scoring methods for evaluating how good individual genes are at marking areas. We will compare pointwise, geometric, and information-theoretic measures. We already developed one entirely new scoring method (gradient similarity), but we may develop more. Scoring measures that we will explore will include the L1 norm, correlation, expression energy ratio, conditional entropy, gradient similarity, Jaccard similarity, Dice similarity, Hough transform, and statistical tests such as Student's t-test, and the Mann-Whitney U test (a non-parametric test). In addition, any predictive procedure induces a scoring measure on genes by taking the prediction error when using that gene to predict the target. 1.97 + 1.98 +Using some combination of these measures, we will develop a procedure to find single marker genes for anatomical regions: for each cortical area, we will rank the genes by their ability to delineate each area. 1.99 + 1.100 +Some cortical areas have no single marker genes but can be identified by combinatorial coding. This requires multivariate scoring measures and feature selection procedures. Many of the measures, such as expression energy, gradient similarity, Jaccard, Dice, Hough, Student's t, and Mann-Whitney U are univariate. We will extend these scoring measures for use in multivariate feature selection, that is, for scoring how well combinations of genes, rather than individual genes, can distinguish a target area. There are existing multivariate forms of some of the univariate scoring measures, for example, Hotelling's T-square is a multivariate analog of Student's t. 1.101 + 1.102 +We will develop a feature selection procedure for choosing the best small set of marker genes for a given anatomical area. In addition to using the scoring measures that we develop, we will also explore (a) feature selection using a stepwise wrapper over "vanilla" predictive methods such as logistic regression, (b) predictive methods such as decision trees which incrementally/greedily combine single gene markers into sets, and (c) predictive methods which use soft constraints to minimize number of features used, such as sparse support vector machines. 1.103 + 1.104 +todo 1.105 + 1.106 +Some of these methods, such as the Hough transform, are designed to be resistant to registration error and error in the anatomical map. 1.107 + 1.108 +We will also consider extensions to scoring measures that may improve their robustness to registration error and to error in the anatomical map; for example, a wrapper that runs a scoring method on small displacements and distortions of the data adds robustness to registration error at the expense of computation time. It is possible that some areas in the anatomical map do not correspond to natural domains of gene expression. 1.109 + 1.110 # Extend the procedure to handle difficult areas by combining or redrawing the boundaries: An area may be difficult to identify because the boundaries are misdrawn, or because it does not "really" exist as a single area, at least on the genetic level. We will develop extensions to our procedure which (a) detect when a difficult area could be fit if its boundary were redrawn slightly, and (b) detect when a difficult area could be combined with adjacent areas to create a larger area which can be fit. 1.111 1.112 -# Linear discriminant analysis 1.113 + 1.114 +A future publication on the method that we develop in Aim 1 will review the scoring measures and quantitatively compare their performance in order to provide a foundation for future research of methods of marker gene finding. We will measure the robustness of the scoring measures as well as their absolute performance on our dataset. 1.115 + 1.116 1.117 1.118 \vspace{0.3cm}**Decision trees** 1.119 @@ -493,6 +530,7 @@ 1.120 1.121 1.122 1.123 + 1.124 \vspace{0.3cm}**Develop algorithms to suggest a division of a structure into anatomical parts** 1.125 1.126 # Explore dimensionality reduction algorithms applied to pixels: including TODO 1.127 @@ -508,6 +546,7 @@ 1.128 1.129 # self-organizing map 1.130 1.131 +# Linear discriminant analysis 1.132 1.133 1.134 # compare using clustering scores 1.135 @@ -520,7 +559,7 @@ 1.136 1.137 \vspace{0.3cm}**Apply these algorithms to the cortex** 1.138 1.139 -Using the methods developed in Aim 1, we will present, for each cortical area, a short list of markers to identify that area; and we will also present lists of "panels" of genes that can be used to delineate many areas at once. Using the methods developed in Aim 2, we will present one or more hierarchial cortical maps. We will identify and explain how the statistical structure in the gene expression data led to any unexpected or interesting features of these maps. 1.140 +Using the methods developed in Aim 1, we will present, for each cortical area, a short list of markers to identify that area; and we will also present lists of "panels" of genes that can be used to delineate many areas at once. Using the methods developed in Aim 2, we will present one or more hierarchial cortical maps. We will identify and explain how the statistical structure in the gene expression data led to any unexpected or interesting features of these maps, and we will provide biological hypotheses to interpret any new cortical areas, or groupings of areas, which are discovered. 1.141 1.142 1.143 %%# note: slice artifact