cg
diff grant.txt @ 94:e460569c21d4
.
author | bshanks@bshanks-salk.dyndns.org |
---|---|
date | Tue Apr 21 17:35:00 2009 -0700 (16 years ago) |
parents | 9f36acf8d9a8 |
children | a25a60a4bf43 |
line diff
1.1 --- a/grant.txt Tue Apr 21 14:50:10 2009 -0700
1.2 +++ b/grant.txt Tue Apr 21 17:35:00 2009 -0700
1.3 @@ -14,7 +14,7 @@
1.4
1.5 (3) create a 2-D "flat map" dataset of the mouse cerebral cortex that contains a flattened version of the Allen Mouse Brain Atlas ISH data, as well as the boundaries of cortical anatomical areas. This will involve extending the functionality of Caret, an existing open-source scientific imaging program. Use this dataset to validate the methods developed in (1) and (2).\\
1.6
1.7 -Although our particular application involves the 3D spatial distribution of gene expression, we anticipate that the methods developed in aims (1) and (2) will generalize to any sort of high-dimensional data over points located in a low-dimensional space.
1.8 +Although our particular application involves the 3D spatial distribution of gene expression, we anticipate that the methods developed in aims (1) and (2) will generalize to any sort of high-dimensional data over points located in a low-dimensional space. In particular, our method could be applied to genome-wide sequencing data derived from sets of tissues and disease states.
1.9
1.10 In terms of the application of the methods to cerebral cortex, aim (1) is to go from cortical areas to marker genes, and aim (2) is to let the gene profile define the cortical areas. In addition to validating the usefulness of the algorithms, the application of these methods to cortex will produce immediate benefits, because there are currently no known genetic markers for most cortical areas. The results of the project will support the development of new ways to selectively target cortical areas, and it will support the development of a method for identifying the cortical areal boundaries present in small tissue samples.
1.11
1.12 @@ -29,11 +29,11 @@
1.13
1.14 == The Challenge and Potential impact ==
1.15
1.16 -Now we will discuss each of our three aims in turn. For each aim, we will develop a conceptual framework for thinking about the task, and we will present our strategy for solving it. Next we will discuss related work. At the conclusion of each section, we will summarize why our strategy is different from what has been done before. At the end of this section, we will describe the potential impact.
1.17 +Each of our three aims will be discussed in turn. For each aim, we will develop a conceptual framework for thinking about the task, and we will present our strategy for solving it. Next we will discuss related work. At the conclusion of each section, we will summarize why our strategy is different from what has been done before. At the end of this section, we will describe the potential impact.
1.18
1.19 === Aim 1: Given a map of regions, find genes that mark the regions ===
1.20
1.21 -\vspace{0.3cm}**Machine learning terminology** The task of looking for marker genes for known anatomical regions means that one is looking for a set of genes such that, if the expression level of those genes is known, then the locations of the regions can be inferred.
1.22 +\vspace{0.3cm}**Machine learning terminology: classifiers** The task of looking for marker genes for known anatomical regions means that one is looking for a set of genes such that, if the expression level of those genes is known, then the locations of the regions can be inferred.
1.23
1.24 %% then instead of saying that we are using gene expression to find the locations of the regions,
1.25
1.26 @@ -41,7 +41,7 @@
1.27
1.28 %%Therefore, an understanding of the relationship between the combination of their expression levels and the locations of the regions may be expressed as a function. The input to this function is a voxel, along with the gene expression levels within that voxel; the output is the regional identity of the target voxel, that is, the region to which the target voxel belongs. We call this function a __classifier__. In general, the input to a classifier is called an __instance__, and the output is called a __label__ (or a __class label__).
1.29
1.30 -If we define the regions so that they cover the entire anatomical structure to be divided, we may say that we are using gene expression to determine to which region each voxel within the structure belongs. We call this a __classification task__, because each voxel is being assigned to a class (namely, its region). An understanding of the relationship between the combination of their expression levels and the locations of the regions may be expressed as a function. The input to this function is a voxel, along with the gene expression levels within that voxel; the output is the regional identity of the target voxel, that is, the region to which the target voxel belongs. We call this function a __classifier__. In general, the input to a classifier is called an __instance__, and the output is called a __label__ (or a __class label__).
1.31 +If we define the regions so that they cover the entire anatomical structure to be subdivided, we may say that we are using gene expression in each voxel to assign that voxel to the proper area. We call this a __classification task__, because each voxel is being assigned to a class (namely, its region). An understanding of the relationship between the combination of their expression levels and the locations of the regions may be expressed as a function. The input to this function is a voxel, along with the gene expression levels within that voxel; the output is the regional identity of the target voxel, that is, the region to which the target voxel belongs. We call this function a __classifier__. In general, the input to a classifier is called an __instance__, and the output is called a __label__ (or a __class label__).
1.32
1.33 %% The construction of the classifier is called __training__ (also __learning__), and
1.34
1.35 @@ -53,6 +53,9 @@
1.36
1.37 Although the classifier itself may only look at the gene expression data within each voxel before classifying that voxel, the algorithm which constructs the classifier may look over the entire dataset. We can categorize score-based feature selection methods depending on how the score of calculated. Often the score calculation consists of assigning a sub-score to each voxel, and then aggregating these sub-scores into a final score (the aggregation is often a sum or a sum of squares or average). If only information from nearby voxels is used to calculate a voxel's sub-score, then we say it is a __local scoring method__. If only information from the voxel itself is used to calculate a voxel's sub-score, then we say it is a __pointwise scoring method__.
1.38
1.39 +Both gene expression data and anatomical atlases have errors, due to a variety of factors. Individual subjects have idiosyncratic anatomy. Subjects may be improperly registred to the atlas. The method used to measure gene expression may be noisy. The atlas may have errors. It is even possible that some areas in the anatomical atlas are "wrong" in that they do not have the same shape as the natural domains of gene expression to which they correspond. These sources of error can affect the displacement and the shape of both the gene expression data and the anatomical target areas. Therefore, it is important to use feature selection methods which are robust to these kinds of errors.
1.40 +
1.41 +
1.42 === Our strategy for Aim 1 ===
1.43
1.44 Key questions when choosing a learning method are: What are the instances? What are the features? How are the features chosen? Here are four principles that outline our answers to these questions.
1.45 @@ -290,19 +293,7 @@
1.46
1.47 We downloaded the ABA data and applied a mask to select only those voxels which belong to cerebral cortex. We divided the cortex into hemispheres.
1.48
1.49 -Using Caret\cite{van_essen_integrated_2001}, we created a mesh representation of the surface of the selected voxels. For each gene, for each node of the mesh, we calculated an average of the gene expression of the voxels "underneath" that mesh node. We then flattened the cortex, creating a two-dimensional mesh.
1.50 -
1.51 -We sampled the nodes of the irregular, flat mesh in order to create a regular grid of pixel values. We converted this grid into a MATLAB matrix.
1.52 -
1.53 -We manually traced the boundaries of each of 49 cortical areas from the ABA coronal reference atlas slides. We then converted these manual traces into Caret-format regional boundary data on the mesh surface. We projected the regions onto the 2-d mesh, and then onto the grid, and then we converted the region data into MATLAB format.
1.54 -
1.55 -At this point, the data are in the form of a number of 2-D matrices, all in registration, with the matrix entries representing a grid of points (pixels) over the cortical surface:
1.56 -
1.57 -
1.58 -
1.59 -* A 2-D matrix whose entries represent the regional label associated with each surface pixel
1.60 -* For each gene, a 2-D matrix whose entries represent the average expression level underneath each surface pixel
1.61 -
1.62 +Using Caret\cite{van_essen_integrated_2001}, we created a mesh representation of the surface of the selected voxels. For each gene, and for each node of the mesh, we calculated an average of the gene expression of the voxels "underneath" that mesh node. We then flattened the cortex, creating a two-dimensional mesh.
1.63
1.64
1.65 \begin{wrapfigure}{L}{0.35\textwidth}\centering
1.66 @@ -316,6 +307,19 @@
1.67 \caption{The top row shows the two genes which (individually) best predict area AUD, according to logistic regression. The bottom row shows the two genes which (individually) best match area AUD, according to gradient similarity. From left to right and top to bottom, the genes are $Ssr1$, $Efcbp1$, $Ptk7$, and $Aph1a$.}
1.68 \label{AUDgeometry}\end{wrapfigure}
1.69
1.70 +We sampled the nodes of the irregular, flat mesh in order to create a regular grid of pixel values. We converted this grid into a MATLAB matrix.
1.71 +
1.72 +We manually traced the boundaries of each of 49 cortical areas from the ABA coronal reference atlas slides. We then converted these manual traces into Caret-format regional boundary data on the mesh surface. We projected the regions onto the 2-d mesh, and then onto the grid, and then we converted the region data into MATLAB format.
1.73 +
1.74 +At this point, the data are in the form of a number of 2-D matrices, all in registration, with the matrix entries representing a grid of points (pixels) over the cortical surface:
1.75 +
1.76 +
1.77 +
1.78 +* A 2-D matrix whose entries represent the regional label associated with each surface pixel
1.79 +* For each gene, a 2-D matrix whose entries represent the average expression level underneath each surface pixel
1.80 +
1.81 +
1.82 +
1.83 We created a normalized version of the gene expression data by subtracting each gene's mean expression level (over all surface pixels) and dividing the expression level of each gene by its standard deviation.
1.84
1.85 The features and the target area are both functions on the surface pixels. They can be referred to as scalar fields over the space of surface pixels; alternately, they can be thought of as images which can be displayed on the flatmapped surface.
1.86 @@ -339,15 +343,6 @@
1.87
1.88
1.89
1.90 -\vspace{0.3cm}**Correlation**
1.91 -Recall that the instances are surface pixels, and consider the problem of attempting to classify each instance as either a member of a particular anatomical area, or not. The target area can be represented as a boolean mask over the surface pixels.
1.92 -
1.93 -One class of feature selection scoring methods contains methods which calculate some sort of "match" between each gene image and the target image. Those genes which match the best are good candidates for features.
1.94 -
1.95 -One of the simplest methods in this class is to use correlation as the match score. We calculated the correlation between each gene and each cortical area. The top row of Figure \ref{SScorrLr} shows the three genes most correlated with area SS.
1.96 -
1.97 -
1.98 -
1.99 \begin{wrapfigure}{L}{0.35\textwidth}\centering
1.100 \includegraphics[scale=.27]{MO_vs_Wwc1_jet.eps}\includegraphics[scale=.27]{MO_vs_Mtif2_jet.eps}
1.101
1.102 @@ -355,6 +350,15 @@
1.103 \caption{Upper left: $wwc1$. Upper right: $mtif2$. Lower left: wwc1 + mtif2 (each pixel's value on the lower left is the sum of the corresponding pixels in the upper row).}
1.104 \label{MOcombo}\end{wrapfigure}
1.105
1.106 +\vspace{0.3cm}**Correlation**
1.107 +Recall that the instances are surface pixels, and consider the problem of attempting to classify each instance as either a member of a particular anatomical area, or not. The target area can be represented as a boolean mask over the surface pixels.
1.108 +
1.109 +One class of feature selection scoring methods contains methods which calculate some sort of "match" between each gene image and the target image. Those genes which match the best are good candidates for features.
1.110 +
1.111 +One of the simplest methods in this class is to use correlation as the match score. We calculated the correlation between each gene and each cortical area. The top row of Figure \ref{SScorrLr} shows the three genes most correlated with area SS.
1.112 +
1.113 +
1.114 +
1.115 \vspace{0.3cm}**Conditional entropy**
1.116 An information-theoretic scoring method is to find features such that, if the features (gene expression levels) are known, uncertainty about the target (the regional identity) is reduced. Entropy measures uncertainty, so what we want is to find features such that the conditional distribution of the target has minimal entropy. The distribution to which we are referring is the probability distribution over the population of surface pixels.
1.117
1.118 @@ -424,10 +428,10 @@
1.119
1.120
1.121 \vspace{0.3cm}**Feature selection integrated with prediction**
1.122 -As noted earlier, in general, any predictive method can be used for feature selection by running it inside a stepwise wrapper. Also, some predictive methods integrate soft constraints on number of features used. Examples of both of these will be seen in the section "Multivariate Predictive methods".
1.123 -
1.124 -
1.125 -=== Multivariate Predictive methods ===
1.126 +As noted earlier, in general, any classifier can be used for feature selection by running it inside a stepwise wrapper. Also, some learning algorithms integrate soft constraints on number of features used. Examples of both of these will be seen in the section "Multivariate supervised learning".
1.127 +
1.128 +
1.129 +=== Multivariate supervised learning ===
1.130
1.131
1.132 \begin{wrapfigure}{L}{0.6\textwidth}\centering
1.133 @@ -504,47 +508,39 @@
1.134
1.135 %%We will develop scoring methods for evaluating how good individual genes are at marking areas. We will compare pointwise, geometric, and information-theoretic measures. We already developed one entirely new scoring method (gradient similarity), but we may develop more. Scoring measures that we will explore will include the L1 norm, correlation, expression energy ratio, conditional entropy, gradient similarity, Jaccard similarity, Dice similarity, Hough transform, and statistical tests such as Hotelling's T-square test (a multivariate generalization of Student's t-test), ANOVA, and a multivariate version of the Mann-Whitney U test (a non-parametric test).
1.136
1.137 -We will develop scoring methods for evaluating how good individual genes are at marking areas. We will compare pointwise, geometric, and information-theoretic measures. We already developed one entirely new scoring method (gradient similarity), but we may develop more. Scoring measures that we will explore will include the L1 norm, correlation, expression energy ratio, conditional entropy, gradient similarity, Jaccard similarity, Dice similarity, Hough transform, and statistical tests such as Student's t-test, and the Mann-Whitney U test (a non-parametric test). In addition, any predictive procedure induces a scoring measure on genes by taking the prediction error when using that gene to predict the target.
1.138 -
1.139 -
1.140 -
1.141 -Using some combination of these measures, we will develop a procedure to find single marker genes for anatomical regions: for each cortical area, we will rank the genes by their ability to delineate each area.
1.142 +We will develop scoring methods for evaluating how good individual genes are at marking areas. We will compare pointwise, geometric, and information-theoretic measures. We already developed one entirely new scoring method (gradient similarity), but we may develop more. Scoring measures that we will explore will include the L1 norm, correlation, expression energy ratio, conditional entropy, gradient similarity, Jaccard similarity, Dice similarity, Hough transform, and statistical tests such as Student's t-test, and the Mann-Whitney U test (a non-parametric test). In addition, any classifier induces a scoring measure on genes by taking the prediction error when using that gene to predict the target.
1.143 +
1.144 +Using some combination of these measures, we will develop a procedure to find single marker genes for anatomical regions: for each cortical area, we will rank the genes by their ability to delineate each area. We will quantitatively compare the list of single genes generated by our method to the lists generated by previous methods which are mentioned in Aim 1 Related Work.
1.145 +
1.146
1.147 Some cortical areas have no single marker genes but can be identified by combinatorial coding. This requires multivariate scoring measures and feature selection procedures. Many of the measures, such as expression energy, gradient similarity, Jaccard, Dice, Hough, Student's t, and Mann-Whitney U are univariate. We will extend these scoring measures for use in multivariate feature selection, that is, for scoring how well combinations of genes, rather than individual genes, can distinguish a target area. There are existing multivariate forms of some of the univariate scoring measures, for example, Hotelling's T-square is a multivariate analog of Student's t.
1.148
1.149 -We will develop a feature selection procedure for choosing the best small set of marker genes for a given anatomical area. In addition to using the scoring measures that we develop, we will also explore (a) feature selection using a stepwise wrapper over "vanilla" predictive methods such as logistic regression, (b) predictive methods such as decision trees which incrementally/greedily combine single gene markers into sets, and (c) predictive methods which use soft constraints to minimize number of features used, such as sparse support vector machines.
1.150 -
1.151 -todo
1.152 -
1.153 -Some of these methods, such as the Hough transform, are designed to be resistant to registration error and error in the anatomical map.
1.154 -
1.155 -We will also consider extensions to scoring measures that may improve their robustness to registration error and to error in the anatomical map; for example, a wrapper that runs a scoring method on small displacements and distortions of the data adds robustness to registration error at the expense of computation time. It is possible that some areas in the anatomical map do not correspond to natural domains of gene expression.
1.156 -
1.157 -# Extend the procedure to handle difficult areas by combining or redrawing the boundaries: An area may be difficult to identify because the boundaries are misdrawn, or because it does not "really" exist as a single area, at least on the genetic level. We will develop extensions to our procedure which (a) detect when a difficult area could be fit if its boundary were redrawn slightly, and (b) detect when a difficult area could be combined with adjacent areas to create a larger area which can be fit.
1.158 -
1.159 +We will develop a feature selection procedure for choosing the best small set of marker genes for a given anatomical area. In addition to using the scoring measures that we develop, we will also explore (a) feature selection using a stepwise wrapper over "vanilla" classifiers such as logistic regression, (b) supervised learning methods such as decision trees which incrementally/greedily combine single gene markers into sets, and (c) supervised learning methods which use soft constraints to minimize number of features used, such as sparse support vector machines.
1.160 +
1.161 +Since errors of displacement and of shape may cause genes and target areas to match less than they should, we will consider the robustness of feature selection methods in the presence of error. Some of these methods, such as the Hough transform, are designed to be resistant in the presence of error, but many are not. We will consider extensions to scoring measures that may improve their robustness; for example, a wrapper that runs a scoring method on small displacements and distortions of the data adds robustness to registration error at the expense of computation time.
1.162 +
1.163 +An area may be difficult to identify because the boundaries are misdrawn in the atlas, or because the shape of the natural domain of gene expression corresponding to the area is different from the shape of the area as recognized by anatomists. We will extend our procedure to handle difficult areas by combining areas or redrawing their boundaries. We will develop extensions to our procedure which (a) detect when a difficult area could be fit if its boundary were redrawn slightly, and (b) detect when a difficult area could be combined with adjacent areas to create a larger area which can be fit.
1.164
1.165 A future publication on the method that we develop in Aim 1 will review the scoring measures and quantitatively compare their performance in order to provide a foundation for future research of methods of marker gene finding. We will measure the robustness of the scoring measures as well as their absolute performance on our dataset.
1.166
1.167 +\vspace{0.3cm}**Classifiers**
1.168 +
1.169 +We will explore and compare different classifiers. As noted above, this activity is not separate from the previous one, because some supervised learning algorithms include feature selection, and any classifier can be combined with a stepwise wrapper for use as a feature selection method. We will explore logistic regression (including spatial models\cite{paciorek_computational_2007}), decision trees\footnote{Actually, we have already begun to explore decision trees. For each cortical area, we have used the C4.5 algorithm to find a decision tree for that area. We achieved good classification accuracy on our training set, but the number of genes that appeared in each tree was too large. We plan to implement a pruning procedure to generate trees that use fewer genes.}, sparse SVMs, generative mixture models (including naive bayes), kernel density estimation, genetic algorithms, and artificial neural networks.
1.170 +
1.171
1.172
1.173 \vspace{0.3cm}**Decision trees**
1.174 -todo
1.175 -
1.176 -\footnote{Already, for each cortical area, we have used the C4.5 algorithm to find a decision tree for that area. We achieved good classification accuracy on our training set, but the number of genes that appeared in each tree was too large. We plan to implement a pruning procedure to generate trees that use fewer genes}.
1.177 +
1.178
1.179 # confirm with EMAGE, GeneAtlas, GENSAT, etc, to fight overfitting, two hemis
1.180
1.181 -# mixture models, etc
1.182 -
1.183 -
1.184 -
1.185
1.186 \vspace{0.3cm}**Develop algorithms to suggest a division of a structure into anatomical parts**
1.187
1.188 # Explore dimensionality reduction algorithms applied to pixels: including TODO
1.189 # Explore dimensionality reduction algorithms applied to genes: including TODO
1.190 # Explore clustering algorithms applied to pixels: including TODO
1.191 -# Explore clustering algorithms applied to genes: including gene shaving, TODO
1.192 +# Explore clustering algorithms applied to genes: including gene shaving\cite{hastie_gene_2000}, TODO
1.193 # Develop an algorithm to use dimensionality reduction and/or hierarchial clustering to create anatomical maps
1.194 # Run this algorithm on the cortex: present a hierarchial, genoarchitectonic map of the cortex
1.195