cg

changeset 63:af5fd52f453f

.
author bshanks@bshanks.dyndns.org
date Sun Apr 19 15:23:53 2009 -0700 (16 years ago)
parents ecf330fcfba3
children 54ac7984b164
files grant.doc grant.html grant.odt grant.pdf grant.txt
line diff
1.1 Binary file grant.doc has changed
2.1 --- a/grant.html Sun Apr 19 14:50:20 2009 -0700 2.2 +++ b/grant.html Sun Apr 19 15:23:53 2009 -0700 2.3 @@ -70,9 +70,12 @@ 2.4 level of more than a handful of genes. Similarly, if the goal is to develop a procedure to do ISH on tissue samples in order 2.5 to label their anatomy, then it is infeasible to label more than a few genes. Therefore, we must select only a few genes as 2.6 features. 2.7 +__________________________________ 2.8 + 1Strictly speaking, the features are gene expression levels, but we’ll call them genes. 2.9 +The requirement to find combinations of only a small number of genes limits us from straightforwardly applying many 2.10 +of the most simple techniques from the field of supervised machine learning. In the parlance of machine learning, our task 2.11 +combines feature selection with supervised learning. 2.12 Principle 3: Use geometry in feature selection 2.13 -_________________________________________ 2.14 - 1Strictly speaking, the features are gene expression levels, but we’ll call them genes. 2.15 When doing feature selection with score-based methods, the simplest thing to do would be to score the performance of 2.16 each voxel by itself and then combine these scores (pointwise scoring). A more powerful approach is to also use information 2.17 about the geometric relations between each voxel and its neighbors; this requires non-pointwise, local scoring methods. See 2.18 @@ -117,17 +120,17 @@ 2.19 correction to determine whether the mean expression level of a gene is significantly higher in the target region. Like AGEA, 2.20 this is a pointwise measure (only the mean expression level per pixel is being analyzed), it is not being used to look for 2.21 underexpression, and does not look for combinations of genes. 2.22 +_________________________________________ 2.23 + 2By “fundamentally spatial” we mean that there is information from a large number of spatial locations indexed by spatial coordinates; not 2.24 +just data which has only a few different locations or which is indexed by anatomical label. 2.25 + 3Actually, many of these projects use quadrilaterals instead of square pixels; but we will refer to them as pixels for simplicity. 2.26 + 4“Expression energy ratio”, which captures overexpression. 2.27 [7 ] describes a technique to find combinations of marker genes to pick out an anatomical region. They use an evolutionary 2.28 algorithm to evolve logical operators which combine boolean (thresholded) images in order to match a target image. Their 2.29 match score is Jaccard similarity. 2.30 In summary, there has been fruitful work on finding marker genes, however, only one of the previous projects explores 2.31 combinations of marker genes, and none of these publications compare the results obtained by using different algorithms or 2.32 scoring methods. 2.33 -___________________________ 2.34 - 2By “fundamentally spatial” we mean that there is information from a large number of spatial locations indexed by spatial coordinates; not 2.35 -just data which has only a few different locations or which is indexed by anatomical label. 2.36 - 3Actually, many of these projects use quadrilaterals instead of square pixels; but we will refer to them as pixels for simplicity. 2.37 - 4“Expression energy ratio”, which captures overexpression. 2.38 Aim 2 2.39 Machine learning terminology: clustering 2.40 If one is given a dataset consisting merely of instances, with no class labels, then analysis of the dataset is referred to as 2.41 @@ -224,6 +227,14 @@ 2.42 rithms or scoring methods, so it is likely that the best clustering method for this application has not yet been found. Also, 2.43 none of these projects did a separate dimensionality reduction step before clustering pixels, none tried to cluster genes first 2.44 in order to guide automated clustering of pixels into spatial regions, and none used co-clustering algorithms. 2.45 +_________________________________________ 2.46 + 5This would seem to contradict our finding in aim 1 that some cortical areas are combinatorially coded by multiple genes. However, it is 2.47 +possible that the currently accepted cortical maps divide the cortex into regions which are unnatural from the point of view of gene expression; 2.48 +perhaps there is some other way to map the cortex for which each region can be identified by single genes. Another possibility is that, although 2.49 +the cluster prototype fits an anatomical region, the individual genes are each somewhat different from the prototype. 2.50 + 6We ran “vanilla” NNMF, whereas the paper under discussion used a modified method. Their main modification consisted of adding a soft 2.51 +spatial contiguity constraint. However, on our dataset, NNMF naturally produced spatially contiguous clusters, so no additional constraint was 2.52 +needed. The paper under discussion also mentions that they tried a hierarchial variant of NNMF, which we have not yet tried. 2.53 Aim 3 2.54 Background 2.55 The cortex is divided into areas and layers. To a first approximation, the parcellation of the cortex into areas can 2.56 @@ -232,14 +243,6 @@ 2.57 picture an area of the cortex as a slice of many-layered cake. 2.58 Although it is known that different cortical areas have distinct roles in both normal functioning and in disease processes, 2.59 there are no known marker genes for many cortical areas. When it is necessary to divide a tissue sample into cortical areas, 2.60 -_________________________________________ 2.61 - 5This would seem to contradict our finding in aim 1 that some cortical areas are combinatorially coded by multiple genes. However, it is 2.62 -possible that the currently accepted cortical maps divide the cortex into regions which are unnatural from the point of view of gene expression; 2.63 -perhaps there is some other way to map the cortex for which each region can be identified by single genes. Another possibility is that, although 2.64 -the cluster prototype fits an anatomical region, the individual genes are each somewhat different from the prototype. 2.65 - 6We ran “vanilla” NNMF, whereas the paper under discussion used a modified method. Their main modification consisted of adding a soft 2.66 -spatial contiguity constraint. However, on our dataset, NNMF naturally produced spatially contiguous clusters, so no additional constraint was 2.67 -needed. The paper under discussion also mentions that they tried a hierarchial variant of NNMF, which we have not yet tried. 2.68 this is a manual process that requires a skilled human to combine multiple visual cues and interpret them in the context of 2.69 their approximate location upon the cortical surface. 2.70 Even the questions of how many areas should be recognized in cortex, and what their arrangement is, are still not 2.71 @@ -281,13 +284,6 @@ 2.72 conceivable that if a different set of stains had been available which identified a different set of features, then the today’s 2.73 cortical maps would have come out differently. Since the number of classes of stains is small compared to the number of 2.74 genes, it is likely that there are many repeated, salient spatial patterns in the gene expression which have not yet been 2.75 -captured by any stain. Therefore, current ideas about cortical anatomy need to incorporate what we can learn from looking 2.76 -at the patterns of gene expression. 2.77 -While we do not here propose to analyze human gene expression data, it is conceivable that the methods we propose to 2.78 -develop could be used to suggest modifications to the human cortical map as well. 2.79 -Related work 2.80 -[10 ] describes the application of AGEA to the cortex. The paper describes interesting results on the structure of correlations 2.81 -between voxel gene expression profiles within a handful of cortical areas. However, this sort of analysis is not related to either 2.82 _________________________________________ 2.83 7http://www.eurexpress.org/ee/; EurExpress data is also entered into EMAGE 2.84 8http://www.ncl.ac.uk/ihg/EADHB/database/EADHB_database.html 2.85 @@ -298,6 +294,13 @@ 2.86 13http://compare.ibdml.univ-mrs.fr/ 2.87 14GXD and GEO contain spatial data but also non-spatial data. All GXD spatial data are also in EMAGE. 2.88 15without prior offline registration 2.89 +captured by any stain. Therefore, current ideas about cortical anatomy need to incorporate what we can learn from looking 2.90 +at the patterns of gene expression. 2.91 +While we do not here propose to analyze human gene expression data, it is conceivable that the methods we propose to 2.92 +develop could be used to suggest modifications to the human cortical map as well. 2.93 +Related work 2.94 +[10 ] describes the application of AGEA to the cortex. The paper describes interesting results on the structure of correlations 2.95 +between voxel gene expression profiles within a handful of cortical areas. However, this sort of analysis is not related to either 2.96 of our aims, as it neither finds marker genes, nor does it suggest a cortical map based on gene expression data. Neither of 2.97 the other components of AGEA can be applied to cortical areas; AGEA’s Gene Finder cannot be used to find marker genes 2.98 for the cortical areas; and AGEA’s hierarchial clustering does not produce clusters corresponding to the cortical areas16. 2.99 @@ -364,7 +367,7 @@ 2.100 boolean masks such that the conditional entropy of the target area’s boolean mask, conditioned upon the pair of gene 2.101 expression boolean masks, is minimized. 2.102 This finds pairs of genes which are most informative (at least at these discretization thresholds) relative to the question, 2.103 -“Is this surface pixel a member of the target area?”. 2.104 +“Is this surface pixel a member of the target area?”. Its advantage over linear methods such as logistic regression is that it 2.105 2.106 2.107 2.108 @@ -373,6 +376,8 @@ 2.109 each picture, the vertical axis roughly corresponds to anterior at the top and posterior at the bottom, and the horizontal 2.110 axis roughly corresponds to medial at the left and lateral at the right. The red outline is the boundary of region MO. Pixels 2.111 are colored according to correlation, with red meaning high correlation and blue meaning low. 2.112 +takes account of arbitrarily nonlinear relationships; for example, if the XOR of two variables predicts the target, conditional 2.113 +entropy would notice, whereas linear methods would not. 2.114 Gradient similarity We noticed that the previous two scoring methods, which are pointwise, often found genes whose 2.115 pattern of expression did not look similar in shape to the target region. Fort his reason we designed a non-pointwise local 2.116 scoring method to detect when a gene had a pattern of expression which looked like it had a boundary whose shape is similar 2.117 @@ -404,9 +409,7 @@ 2.118 such as Aph1a in the example, may be particularly good markers. None of these genes are, individually, a perfect marker 2.119 for AUD; we deliberately chose a “difficult” area in order to better contrast pointwise with geometric methods. 2.120 Combinations of multiple genes are useful and necessary for some areas 2.121 -In Figure 3, we give an example of a cortical area which is not marked by any single gene, but which can be identified 2.122 -combinatorially. 2.123 -____________________________ 2.124 +_________________________________________ 2.125 17For each gene, a logistic regression in which the response variable was whether or not a surface pixel was within area AUD, and the predictor 2.126 variable was the value of the expression of the gene underneath that pixel. The resulting scores were used to rank the genes in terms of how well 2.127 they predict area AUD. 2.128 @@ -418,6 +421,38 @@ 2.129 Figure 2: The top row shows the three genes which (individually) best predict area AUD, according to logistic regression. 2.130 The bottom row shows the three genes which (individually) best match area AUD, according to gradient similarity. From 2.131 left to right and top to bottom, the genes are Ssr1, Efcbp1, Aph1a, Ptk7, Aph1a again, and Lepr 2.132 +In Figure 3, we give an example of a cortical area which is not marked by any single gene, but which can be identified 2.133 +combinatorially. 2.134 +Underexpression of a gene can serve as a marker Underexpression of a gene can sometimes serve as a marker. 2.135 +See, for example, Figure 4. 2.136 +Feature selection integrated with prediction As noted earlier, in general, any predictive method can be used for 2.137 +feature selection by running it inside a stepwise wrapper. Also, some predictive methods integrate soft constraints on number 2.138 +of features used. Examples of both of these will be seen in the section “Locating areas with gene expression”. 2.139 +Locating areas with gene expression 2.140 +Forward stepwise logistic regression As a pilot run, for five cortical areas (SS, AUD, RSP, VIS, and MO), we performed 2.141 +forward stepwise logistic regression to find single genes, pairs of genes, and triplets of genes which predict areal identify. 2.142 +Some of the single genes found were shown in previous figures, and Figure 3 shows a combination of genes which was found. 2.143 +SVM on all genes at once 2.144 +In order to see how well one can do when looking at all genes at once, we ran a support vector machine to classify cortical 2.145 +surface pixels based on their gene expression profiles. We achieved classification accuracy of about 81%19. As noted above, 2.146 +however, a classifier that looks at all the genes at once isn’t as practically useful as a classifier that uses only a few genes. 2.147 +Decision trees 2.148 +todo 2.149 +Areas which can be identified by single genes 2.150 +Using all of the methods we have tried to far, we have already found single genes which roughly identify some areas and 2.151 +groupings of areas. For each of these areas, an example of a gene which roughly identifies it is shown in Figure 5. We have 2.152 +not yet cross-verified these genes in other atlases. 2.153 +In addition, there are a number of areas which are almost identified by single genes: COAa+NLOT (anterior part of 2.154 +cortical amygdalar area, nucleus of the lateral olfactory tract), ENT (entorhinal), ACAv (ventral anterior cingulate), VIS 2.155 +(visual), AUD (auditory). 2.156 +Data-driven redrawing of the cortical map 2.157 +Raw dimensionality reduction results 2.158 +todo 2.159 +(might want to incld nnMF since mentioned above) 2.160 +Dimensionality reduction plus K-means or spectral clustering 2.161 +_________________________________________ 2.162 + 195-fold cross-validation. 2.163 + 2.164 2.165 2.166 Figure 3: Upper left: wwc1. Upper right: mtif2. Lower left: wwc1 + mtif2 (each pixel’s value on the lower left is the 2.167 @@ -429,7 +464,6 @@ 2.168 Gene mtif2 is shown in the upper-right. Mtif2 captures MO’s upper-left boundary, but not its lower-right boundary. Mtif2 2.169 does not express very much on the medial surface. By adding together the values at each pixel in these two figures, we get 2.170 the lower-left image. This combination captures area MO much better than any single gene. 2.171 - 2.172 2.173 Figure 4: Gene Pitx2 is selectively underexpressed in area SS (somatosensory). 2.174 2.175 @@ -441,33 +475,6 @@ 2.176 and lateral visual (VISpm, VISpl, VISI, VISp; posteromedial, posterolateral, lateral, and primary visual; the posterior and 2.177 lateral visual area is distinguished from its neighbors, but not from the entire rest of the cortex). The genes are Pitx2, 2.178 Aldh1a2, Ppfibp1, Slco1a5, Tshz2, Trhr, Col12a1, Ets1. 2.179 -Underexpression of a gene can serve as a marker Underexpression of a gene can sometimes serve as a marker. 2.180 -See, for example, Figure 4. 2.181 -Specific to Aim 1 (and Aim 3) 2.182 -Forward stepwise logistic regression todo 2.183 -SVM on all genes at once 2.184 -In order to see how well one can do when looking at all genes at once, we ran a support vector machine to classify cortical 2.185 -surface pixels based on their gene expression profiles. We achieved classification accuracy of about 81%19. As noted above, 2.186 -however, a classifier that looks at all the genes at once isn’t practically useful. 2.187 -The requirement to find combinations of only a small number of genes limits us from straightforwardly applying many 2.188 -of the most simple techniques from the field of supervised machine learning. In the parlance of machine learning, our task 2.189 -combines feature selection with supervised learning. 2.190 -Decision trees 2.191 -todo 2.192 -Areas which can be identified by single genes 2.193 -Using all of the methods we have tried to far, we have already found single genes which roughly identify some areas and 2.194 -groupings of areas. For each of these areas, an example of a gene which roughly identifies it is shown in Figure 5. We have 2.195 -not yet cross-verified these genes in other atlases. 2.196 -In addition, there are a number of areas which are almost identified by single genes: COAa+NLOT (anterior part of 2.197 -cortical amygdalar area, nucleus of the lateral olfactory tract), ENT (entorhinal), ACAv (ventral anterior cingulate), VIS 2.198 -(visual), AUD (auditory). 2.199 -____________________ 2.200 - 195-fold cross-validation. 2.201 -Specific to Aim 2 (and Aim 3) 2.202 -Raw dimensionality reduction results 2.203 -todo 2.204 -(might want to incld nnMF since mentioned above) 2.205 -Dimensionality reduction plus K-means or spectral clustering 2.206 Many areas are captured by clusters of genes 2.207 todo 2.208 todo 2.209 @@ -483,6 +490,7 @@ 2.210 If possible, we would like the method we develop to include a statistical test that warns the user if the assumption of 2-D 2.211 structure seems to be wrong. 2.212 todo amongst other things: 2.213 +layerfinding 2.214 Develop algorithms that find genetic markers for anatomical regions 2.215 1.Develop scoring measures for evaluating how good individual genes are at marking areas: we will compare pointwise, 2.216 geometric, and information-theoretic measures.
3.1 Binary file grant.odt has changed
4.1 Binary file grant.pdf has changed
5.1 --- a/grant.txt Sun Apr 19 14:50:20 2009 -0700 5.2 +++ b/grant.txt Sun Apr 19 15:23:53 2009 -0700 5.3 @@ -50,6 +50,7 @@ 5.4 \vspace{0.3cm}**Principle 2: Only look at combinations of small numbers of genes** 5.5 When the classifier classifies a voxel, it is only allowed to look at the expression of the genes which have been selected as features. The more data that is available to a classifier, the better that it can do. For example, perhaps there are weak correlations over many genes that add up to a strong signal. So, why not include every gene as a feature? The reason is that we wish to employ the classifier in situations in which it is not feasible to gather data about every gene. For example, if we want to use the expression of marker genes as a trigger for some regionally-targeted intervention, then our intervention must contain a molecular mechanism to check the expression level of each marker gene before it triggers. It is currently infeasible to design a molecular trigger that checks the level of more than a handful of genes. Similarly, if the goal is to develop a procedure to do ISH on tissue samples in order to label their anatomy, then it is infeasible to label more than a few genes. Therefore, we must select only a few genes as features. 5.6 5.7 +The requirement to find combinations of only a small number of genes limits us from straightforwardly applying many of the most simple techniques from the field of supervised machine learning. In the parlance of machine learning, our task combines feature selection with supervised learning. 5.8 5.9 5.10 \vspace{0.3cm}**Principle 3: Use geometry in feature selection** 5.11 @@ -289,7 +290,7 @@ 5.12 5.13 Now, for each region, we created and ran a forward stepwise procedure which attempted to find pairs of gene expression boolean masks such that the conditional entropy of the target area's boolean mask, conditioned upon the pair of gene expression boolean masks, is minimized. 5.14 5.15 -This finds pairs of genes which are most informative (at least at these discretization thresholds) relative to the question, "Is this surface pixel a member of the target area?". 5.16 +This finds pairs of genes which are most informative (at least at these discretization thresholds) relative to the question, "Is this surface pixel a member of the target area?". Its advantage over linear methods such as logistic regression is that it takes account of arbitrarily nonlinear relationships; for example, if the XOR of two variables predicts the target, conditional entropy would notice, whereas linear methods would not. 5.17 5.18 5.19 \vspace{0.3cm}**Gradient similarity** 5.20 @@ -356,18 +357,18 @@ 5.21 \label{hole}\end{figure} 5.22 5.23 5.24 - 5.25 -=== Specific to Aim 1 (and Aim 3) === 5.26 +\vspace{0.3cm}**Feature selection integrated with prediction** 5.27 +As noted earlier, in general, any predictive method can be used for feature selection by running it inside a stepwise wrapper. Also, some predictive methods integrate soft constraints on number of features used. Examples of both of these will be seen in the section "Locating areas with gene expression". 5.28 + 5.29 + 5.30 +=== Locating areas with gene expression === 5.31 \vspace{0.3cm}**Forward stepwise logistic regression** 5.32 -todo 5.33 +As a pilot run, for five cortical areas (SS, AUD, RSP, VIS, and MO), we performed forward stepwise logistic regression to find single genes, pairs of genes, and triplets of genes which predict areal identify. This is an example of feature selection integrated with prediction using a stepwise wrapper. Some of the single genes found were shown in previous figures, and Figure \ref{MOcombo} shows a combination of genes which was found. 5.34 5.35 5.36 \vspace{0.3cm}**SVM on all genes at once** 5.37 5.38 -In order to see how well one can do when looking at all genes at once, we ran a support vector machine to classify cortical surface pixels based on their gene expression profiles. We achieved classification accuracy of about 81%\footnote{5-fold cross-validation.}. As noted above, however, a classifier that looks at all the genes at once isn't practically useful. 5.39 - 5.40 -The requirement to find combinations of only a small number of genes limits us from straightforwardly applying many of the most simple techniques from the field of supervised machine learning. In the parlance of machine learning, our task combines feature selection with supervised learning. 5.41 - 5.42 +In order to see how well one can do when looking at all genes at once, we ran a support vector machine to classify cortical surface pixels based on their gene expression profiles. We achieved classification accuracy of about 81%\footnote{5-fold cross-validation.}. As noted above, however, a classifier that looks at all the genes at once isn't as practically useful as a classifier that uses only a few genes. 5.43 5.44 5.45 \vspace{0.3cm}**Decision trees** 5.46 @@ -396,7 +397,7 @@ 5.47 5.48 5.49 5.50 -=== Specific to Aim 2 (and Aim 3) === 5.51 +=== Data-driven redrawing of the cortical map === 5.52 5.53 \vspace{0.3cm}**Raw dimensionality reduction results** 5.54 5.55 @@ -443,6 +444,10 @@ 5.56 todo amongst other things: 5.57 5.58 5.59 +layerfinding 5.60 + 5.61 + 5.62 + 5.63 5.64 \vspace{0.3cm}**Develop algorithms that find genetic markers for anatomical regions** 5.65