cg

diff grant.txt @ 94:e460569c21d4
.
author: bshanks@bshanks-salk.dyndns.org
date: Tue Apr 21 17:35:00 2009 -0700 (16 years ago)
parents: 9f36acf8d9a8
children: a25a60a4bf43
--- a/grant.txt	Tue Apr 21 14:50:10 2009 -0700
+++ b/grant.txt	Tue Apr 21 17:35:00 2009 -0700
@@ -14,7 +14,7 @@
-Although our particular application involves the 3D spatial distribution of gene expression, we anticipate that the methods developed in aims (1) and (2) will generalize to any sort of high-dimensional data over points located in a low-dimensional space.
+Although our particular application involves the 3D spatial distribution of gene expression, we anticipate that the methods developed in aims (1) and (2) will generalize to any sort of high-dimensional data over points located in a low-dimensional space. In particular, our method could be applied to genome-wide sequencing data derived from sets of tissues and disease states.
@@ -29,11 +29,11 @@
-Now we will discuss each of our three aims in turn. For each aim, we will develop a conceptual framework for thinking about the task, and we will present our strategy for solving it. Next we will discuss related work. At the conclusion of each section, we will summarize why our strategy is different from what has been done before. At the end of this section, we will describe the potential impact.
+Each of our three aims will be discussed in turn. For each aim, we will develop a conceptual framework for thinking about the task, and we will present our strategy for solving it. Next we will discuss related work. At the conclusion of each section, we will summarize why our strategy is different from what has been done before. At the end of this section, we will describe the potential impact.
-\vspace{0.3cm}**Machine learning terminology** The task of looking for marker genes for known anatomical regions means that one is looking for a set of genes such that, if the expression level of those genes is known, then the locations of the regions can be inferred. 
+\vspace{0.3cm}**Machine learning terminology: classifiers** The task of looking for marker genes for known anatomical regions means that one is looking for a set of genes such that, if the expression level of those genes is known, then the locations of the regions can be inferred. 
@@ -41,7 +41,7 @@
-If we define the regions so that they cover the entire anatomical structure to be divided, we may say that we are using gene expression to determine to which region each voxel within the structure belongs. We call this a __classification task__, because each voxel is being assigned to a class (namely, its region). An understanding of the relationship between the combination of their expression levels and the locations of the regions may be expressed as a function. The input to this function is a voxel, along with the gene expression levels within that voxel; the output is the regional identity of the target voxel, that is, the region to which the target voxel belongs. We call this function a __classifier__. In general, the input to a classifier is called an __instance__, and the output is called a __label__ (or a __class label__).
+If we define the regions so that they cover the entire anatomical structure to be subdivided, we may say that we are using gene expression in each voxel to assign that voxel to the proper area. We call this a __classification task__, because each voxel is being assigned to a class (namely, its region). An understanding of the relationship between the combination of their expression levels and the locations of the regions may be expressed as a function. The input to this function is a voxel, along with the gene expression levels within that voxel; the output is the regional identity of the target voxel, that is, the region to which the target voxel belongs. We call this function a __classifier__. In general, the input to a classifier is called an __instance__, and the output is called a __label__ (or a __class label__).
@@ -53,6 +53,9 @@
+Both gene expression data and anatomical atlases have errors, due to a variety of factors. Individual subjects have idiosyncratic anatomy. Subjects may be improperly registred to the atlas. The method used to measure gene expression may be noisy. The atlas may have errors. It is even possible that some areas in the anatomical atlas are "wrong" in that they do not have the same shape as the natural domains of gene expression to which they correspond. These sources of error can affect the displacement and the shape of both the gene expression data and the anatomical target areas. Therefore, it is important to use feature selection methods which are robust to these kinds of errors.
+
+
@@ -290,19 +293,7 @@
-Using Caret\cite{van_essen_integrated_2001}, we created a mesh representation of the surface of the selected voxels. For each gene, for each node of the mesh, we calculated an average of the gene expression of the voxels "underneath" that mesh node. We then flattened the cortex, creating a two-dimensional mesh. 
-
-We sampled the nodes of the irregular, flat mesh in order to create a regular grid of pixel values. We converted this grid into a MATLAB matrix.
-
-We manually traced the boundaries of each of 49 cortical areas from the ABA coronal reference atlas slides. We then converted these manual traces into Caret-format regional boundary data on the mesh surface. We projected the regions onto the 2-d mesh, and then onto the grid, and then we converted the region data into MATLAB format.
-
-At this point, the data are in the form of a number of 2-D matrices, all in registration, with the matrix entries representing a grid of points (pixels) over the cortical surface:
-
-
-
-* A 2-D matrix whose entries represent the regional label associated with each surface pixel
-* For each gene, a 2-D matrix whose entries represent the average expression level underneath each surface pixel 
-
+Using Caret\cite{van_essen_integrated_2001}, we created a mesh representation of the surface of the selected voxels. For each gene, and for each node of the mesh, we calculated an average of the gene expression of the voxels "underneath" that mesh node. We then flattened the cortex, creating a two-dimensional mesh. 
@@ -316,6 +307,19 @@
+We sampled the nodes of the irregular, flat mesh in order to create a regular grid of pixel values. We converted this grid into a MATLAB matrix.
+
+We manually traced the boundaries of each of 49 cortical areas from the ABA coronal reference atlas slides. We then converted these manual traces into Caret-format regional boundary data on the mesh surface. We projected the regions onto the 2-d mesh, and then onto the grid, and then we converted the region data into MATLAB format.
+
+At this point, the data are in the form of a number of 2-D matrices, all in registration, with the matrix entries representing a grid of points (pixels) over the cortical surface:
+
+
+
+* A 2-D matrix whose entries represent the regional label associated with each surface pixel
+* For each gene, a 2-D matrix whose entries represent the average expression level underneath each surface pixel 
+
+
+
@@ -339,15 +343,6 @@
-\vspace{0.3cm}**Correlation**
-Recall that the instances are surface pixels, and consider the problem of attempting to classify each instance as either a member of a particular anatomical area, or not. The target area can be represented as a boolean mask over the surface pixels. 
-
-One class of feature selection scoring methods contains methods which calculate some sort of "match" between each gene image and the target image. Those genes which match the best are good candidates for features.
-
-One of the simplest methods in this class is to use correlation as the match score. We calculated the correlation between each gene and each cortical area. The top row of Figure \ref{SScorrLr} shows the three genes most correlated with area SS.
-
-
-
@@ -355,6 +350,15 @@
+\vspace{0.3cm}**Correlation**
+Recall that the instances are surface pixels, and consider the problem of attempting to classify each instance as either a member of a particular anatomical area, or not. The target area can be represented as a boolean mask over the surface pixels. 
+
+One class of feature selection scoring methods contains methods which calculate some sort of "match" between each gene image and the target image. Those genes which match the best are good candidates for features.
+
+One of the simplest methods in this class is to use correlation as the match score. We calculated the correlation between each gene and each cortical area. The top row of Figure \ref{SScorrLr} shows the three genes most correlated with area SS.
+
+
+
@@ -424,10 +428,10 @@
-As noted earlier, in general, any predictive method can be used for feature selection by running it inside a stepwise wrapper. Also, some predictive methods integrate soft constraints on number of features used. Examples of both of these will be seen in the section "Multivariate Predictive methods".
-
-
-=== Multivariate Predictive methods ===
+As noted earlier, in general, any classifier can be used for feature selection by running it inside a stepwise wrapper. Also, some learning algorithms integrate soft constraints on number of features used. Examples of both of these will be seen in the section "Multivariate supervised learning".
+
+
+=== Multivariate supervised learning ===
@@ -504,47 +508,39 @@
-We will develop scoring methods for evaluating how good individual genes are at marking areas. We will compare pointwise, geometric, and information-theoretic measures. We already developed one entirely new scoring method (gradient similarity), but we may develop more. Scoring measures that we will explore will include the L1 norm, correlation, expression energy ratio, conditional entropy, gradient similarity, Jaccard similarity, Dice similarity, Hough transform, and statistical tests such as Student's t-test, and the Mann-Whitney U test (a non-parametric test). In addition, any predictive procedure induces a scoring measure on genes by taking the prediction error when using that gene to predict the target. 
-
-
-
-Using some combination of these measures, we will develop a procedure to find single marker genes for anatomical regions: for each cortical area, we will rank the genes by their ability to delineate each area.
+We will develop scoring methods for evaluating how good individual genes are at marking areas. We will compare pointwise, geometric, and information-theoretic measures. We already developed one entirely new scoring method (gradient similarity), but we may develop more. Scoring measures that we will explore will include the L1 norm, correlation, expression energy ratio, conditional entropy, gradient similarity, Jaccard similarity, Dice similarity, Hough transform, and statistical tests such as Student's t-test, and the Mann-Whitney U test (a non-parametric test). In addition, any classifier induces a scoring measure on genes by taking the prediction error when using that gene to predict the target. 
+
+Using some combination of these measures, we will develop a procedure to find single marker genes for anatomical regions: for each cortical area, we will rank the genes by their ability to delineate each area. We will quantitatively compare the list of single genes generated by our method to the lists generated by previous methods which are mentioned in Aim 1 Related Work.
+
-We will develop a feature selection procedure for choosing the best small set of marker genes for a given anatomical area. In addition to using the scoring measures that we develop, we will also explore (a) feature selection using a stepwise wrapper over "vanilla" predictive methods such as logistic regression, (b) predictive methods such as decision trees which incrementally/greedily combine single gene markers into sets, and (c) predictive methods which use soft constraints to minimize number of features used, such as sparse support vector machines. 
-
-todo
-
-Some of these methods, such as the Hough transform, are designed to be resistant to registration error and error in the anatomical map. 
-
-We will also consider extensions to scoring measures that may improve their robustness to registration error and to error in the anatomical map; for example, a wrapper that runs a scoring method on small displacements and distortions of the data adds robustness to registration error at the expense of computation time. It is possible that some areas in the anatomical map do not correspond to natural domains of gene expression.
-
-# Extend the procedure to handle difficult areas by combining or redrawing the boundaries: An area may be difficult to identify because the boundaries are misdrawn, or because it does not "really" exist as a single area, at least on the genetic level. We will develop extensions to our procedure which (a) detect when a difficult area could be fit if its boundary were redrawn slightly, and (b) detect when a difficult area could be combined with adjacent areas to create a larger area which can be fit.
-
+We will develop a feature selection procedure for choosing the best small set of marker genes for a given anatomical area. In addition to using the scoring measures that we develop, we will also explore (a) feature selection using a stepwise wrapper over "vanilla" classifiers such as logistic regression, (b) supervised learning methods such as decision trees which incrementally/greedily combine single gene markers into sets, and (c) supervised learning methods which use soft constraints to minimize number of features used, such as sparse support vector machines. 
+
+Since errors of displacement and of shape may cause genes and target areas to match less than they should, we will consider the robustness of feature selection methods in the presence of error. Some of these methods, such as the Hough transform, are designed to be resistant in the presence of error, but many are not. We will consider extensions to scoring measures that may improve their robustness; for example, a wrapper that runs a scoring method on small displacements and distortions of the data adds robustness to registration error at the expense of computation time. 
+
+An area may be difficult to identify because the boundaries are misdrawn in the atlas, or because the shape of the natural domain of gene expression corresponding to the area is different from the shape of the area as recognized by anatomists. We will extend our procedure to handle difficult areas by combining areas or redrawing their boundaries. We will develop extensions to our procedure which (a) detect when a difficult area could be fit if its boundary were redrawn slightly, and (b) detect when a difficult area could be combined with adjacent areas to create a larger area which can be fit.
+\vspace{0.3cm}**Classifiers**
+
+We will explore and compare different classifiers. As noted above, this activity is not separate from the previous one, because some supervised learning algorithms include feature selection, and any classifier can be combined with a stepwise wrapper for use as a feature selection method. We will explore logistic regression (including spatial models\cite{paciorek_computational_2007}), decision trees\footnote{Actually, we have already begun to explore decision trees. For each cortical area, we have used the C4.5 algorithm to find a decision tree for that area. We achieved good classification accuracy on our training set, but the number of genes that appeared in each tree was too large. We plan to implement a pruning procedure to generate trees that use fewer genes.}, sparse SVMs, generative mixture models (including naive bayes), kernel density estimation, genetic algorithms, and artificial neural networks.
+
-todo
-
-\footnote{Already, for each cortical area, we have used the C4.5 algorithm to find a decision tree for that area. We achieved good classification accuracy on our training set, but the number of genes that appeared in each tree was too large. We plan to implement a pruning procedure to generate trees that use fewer genes}.
+
-# mixture models, etc
-
-
-
-# Explore clustering algorithms applied to genes: including gene shaving, TODO
+# Explore clustering algorithms applied to genes: including gene shaving\cite{hastie_gene_2000}, TODO
author	bshanks@bshanks-salk.dyndns.org
date	Tue Apr 21 17:35:00 2009 -0700 (16 years ago)
parents	9f36acf8d9a8
children	a25a60a4bf43