cg

changeset 18:5d6dfc57654a

.
author bshanks@bshanks.dyndns.org
date Sun Apr 12 15:34:12 2009 -0700 (16 years ago)
parents ff9b47f2c7d3
children 717d4025b861
files grant.html grant.odt grant.pdf grant.txt
line diff
1.1 --- a/grant.html Sun Apr 12 04:01:58 2009 -0700 1.2 +++ b/grant.html Sun Apr 12 15:34:12 2009 -0700 1.3 @@ -145,7 +145,10 @@ 1.4 outcome of clustering may be a hierarchial tree of clusters, rather than a single 1.5 set of clusters which partition the voxels. This is called hierarchial clustering. 1.6 Similarity scores 1.7 - todo 1.8 + A crucial choice when designing a clustering method is how to measure 1.9 + similarity, across either pairs of instances, or clusters, or both. There is much 1.10 + overlap between scoring methods for feature selection (discussed above under 1.11 + Aim 1) and scoring methods for similarity. 1.12 Spatially contiguous clusters; image segmentation 1.13 We have shown that aim 2 is a type of clustering task. In fact, it is a 1.14 special type of clustering task because we have an additional constraint on 1.15 @@ -173,11 +176,11 @@ 1.16 algorithms perform better on small numbers of features. There are techniques 1.17 which “summarize” a larger number of features using a smaller number of fea- 1.18 tures; these techniques go by the name of feature extraction or dimensionality 1.19 + 4 1.20 + 1.21 reduction. The small set of features that such a technique yields is called the 1.22 reduced feature set. After the reduced feature set is created, the instances may 1.23 be replaced by reduced instances, which have as their features the reduced fea- 1.24 - 4 1.25 - 1.26 ture set rather than the original feature set of all gene expression levels. Note 1.27 that the features in the reduced feature set do not necessarily correspond to 1.28 genes; each feature in the reduced set may be any function of the set of gene 1.29 @@ -213,11 +216,7 @@ 1.30 this fashion. 1.31 Aim 3 1.32 Background 1.33 - The cortex is divided into areas and layers. To a first approximation, the 1.34 - parcellation of the cortex into areas can be drawn as a 2-D map on the surface of 1.35 - the cortex. In the third dimension, the boundaries between the areas continue 1.36 - downwards into the cortical depth, perpendicular to the surface. The layer 1.37 -__________________________ 1.38 +_______________ 1.39 1This would seem to contradict our finding in aim 1 that some cortical areas are combina- 1.40 torially coded by multiple genes. However, it is possible that the currently accepted cortical 1.41 maps divide the cortex into subregions which are unnatural from the point of view of gene 1.42 @@ -225,6 +224,10 @@ 1.43 be identified by single genes. 1.44 5 1.45 1.46 + The cortex is divided into areas and layers. To a first approximation, the 1.47 + parcellation of the cortex into areas can be drawn as a 2-D map on the surface of 1.48 + the cortex. In the third dimension, the boundaries between the areas continue 1.49 + downwards into the cortical depth, perpendicular to the surface. The layer 1.50 boundaries run parallel to the surface. One can picture an area of the cortex as 1.51 a slice of many-layered cake. 1.52 Although it is known that different cortical areas have distinct roles in both 1.53 @@ -266,14 +269,50 @@ 1.54 While we do not here propose to analyze human gene expression data, it is 1.55 conceivable that the methods we propose to develop could be used to suggest 1.56 modifications to the human cortical map as well. 1.57 + 6 1.58 + 1.59 Related work 1.60 - todo 1.61 + There does not appear to be much work on the automated analysis of spatial 1.62 + gene expression data. 1.63 + There is a substantial body of work on the analysis of gene expression data, 1.64 + however, most of this concerns gene expression data which is not fundamentally 1.65 + spatial, for example, microarray datasets. In some cases, a few locations have 1.66 + been sampled, but such a dataset is still of a fundamentally different character 1.67 + than a dataset containing a large grid of sampling points distributed over space. 1.68 + In relating gene expression to anatomy, it is the spatial aspects of the problem 1.69 + which are the most important. 1.70 + As noted above, there has been much work on both supervised learning and 1.71 + clustering, and there are many available algorithms for each. Many of these 1.72 + algorithms are flexible enough to accomodate new scoring measures; and the 1.73 + performance of most of the algorithms is greatly affected by preprocessing and 1.74 + by the choice of which representation to use for feature values. We think it likely 1.75 + that for this application, the development of domain-specific scoring measures 1.76 + (such as gradient similarity, which is discussed in Preliminary Work) will be 1.77 + necessary in order to achieve the best results. In essence, the machine learning 1.78 + community has provided algorithms, but the scientist must provide a framework 1.79 + for representing the problem domain, and the way that this framework is set 1.80 + up has a large impact on performance. Creating a good framework can require 1.81 + creatively reconceptualizing the problem domain, and is not merely a mechanical 1.82 + “fine-tuning” of numerical parameters. Therefore, the completion of Aims 1 1.83 + and 2 involves more than just reimplementing an existing algorithm, and more 1.84 + than just choosing between a set of existing algorithms, and will constitute a 1.85 + substantial contribution to biology. 1.86 + We are aware of one other effort to computationally analyze spatial gene 1.87 + expression data. 1.88 + In the Preliminary Work, we show that 1.89 + The creation of a domain-specific scoring measure may be required in order 1.90 + to achieve good performance, and it is not impossible that the algorithms them- 1.91 + selves will have to be extended. We plan to test out existing algorithms and 1.92 + scoring measures, 1.93 + Therefore, we anticipate 1.94 + Therefore, it is unclear which of the 1.95 + todo 1.96 vs. AGEA – i wrote something on this but i’m going to rewrite it 1.97 - 6 1.98 - 1.99 Preliminary work 1.100 Format conversion between SEV, MATLAB, NIFTI 1.101 todo 1.102 + 7 1.103 + 1.104 Flatmap of cortex 1.105 todo 1.106 Using combinations of multiple genes is necessary and sufficient to 1.107 @@ -305,6 +344,12 @@ 1.108 genes which express more strongly in AUD than outside of it; its weakness is that 1.109 this includes many areas which don’t have a salient border matching the areal 1.110 border. The geometric method identifies genes whose salient expression border 1.111 + seems to partially line up with the border of AUD; its weakness is that this 1.112 + includes genes which don’t express over the entire area. Genes which have high 1.113 + rankings using both pointwise and border criteria, such as Aph1a in the example, 1.114 + may be particularly good markers. None of these genes are, individually, a 1.115 + perfect marker for AUD; we deliberately chose a “difficult” area in order to 1.116 + better contrast pointwise with geometric methods. 1.117 __________________________ 1.118 2“WW, C2 and coiled-coil domain containing 1”; EntrezGene ID 211652 1.119 3“mitochondrial translational initiation factor 2”; EntrezGene ID 76784 1.120 @@ -315,7 +360,7 @@ 1.121 5For each gene the gradient similarity (see section ??) between (a) a map of the expression 1.122 of each gene on the cortical surface and (b) the shape of area AUD, was calculated, and this 1.123 was used to rank the genes. 1.124 - 7 1.125 + 8 1.126 1.127 1.128 1.129 @@ -327,8 +372,6 @@ 1.130 the boundary of region MO. Pixels are colored approximately according to the 1.131 density of expressing cells underneath each pixel, with red meaning a lot of 1.132 expression and blue meaning little. 1.133 - 8 1.134 - 1.135 1.136 1.137 Figure 2: The top row shows the three genes which (individually) best predict 1.138 @@ -336,15 +379,11 @@ 1.139 genes which (individually) best match area AUD, according to gradient similar- 1.140 ity. From left to right and top to bottom, the genes are Ssr1, Efcbp1, Aph1a, 1.141 Ptk7, Aph1a again, and Lepr 1.142 - seems to partially line up with the border of AUD; its weakness is that this 1.143 - includes genes which don’t express over the entire area. Genes which have high 1.144 - rankings using both pointwise and border criteria, such as Aph1a in the example, 1.145 - may be particularly good markers. None of these genes are, individually, a 1.146 - perfect marker for AUD; we deliberately chose a “difficult” area in order to 1.147 - better contrast pointwise with geometric methods. 1.148 + 9 1.149 + 1.150 Areas which can be identified by single genes 1.151 todo 1.152 - Aim 1 (and Aim 3) 1.153 + Specific to Aim 1 (and Aim 3) 1.154 Forward stepwise logistic regression todo 1.155 SVM on all genes at once 1.156 In order to see how well one can do when looking at all genes at once, we 1.157 @@ -354,27 +393,18 @@ 1.158 practically useful. 1.159 The requirement to find combinations of only a small number of genes limits 1.160 us from straightforwardly applying many of the most simple techniques from 1.161 -__________________________ 1.162 - 6Using the Shogun SVM package (todo:cite), with parameters type=GMNPSVM (multi- 1.163 -class b-SVM), kernal = gaussian with sigma = 0.1, c = 10, epsilon = 1e-1 – these are the 1.164 -first parameters we tried, so presumably performance would improve with different choices of 1.165 -parameters. 5-fold cross-validation. 1.166 - 9 1.167 - 1.168 the field of supervised machine learning. In the parlance of machine learning, 1.169 our task combines feature selection with supervised learning. 1.170 Decision trees 1.171 todo 1.172 - Aim 2 (and Aim 3) 1.173 - Raw dimensionality reduction results 1.174 - Dimensionality reduction plus K-means or spectral clus- 1.175 - tering 1.176 - Many areas are captured by clusters of genes 1.177 + Specific to Aim 2 (and Aim 3) 1.178 + Raw dimensionality reduction results 1.179 + Dimensionality reduction plus K-means or spectral clustering 1.180 + Many areas are captured by clusters of genes 1.181 todo 1.182 todo 1.183 Research plan 1.184 - todo 1.185 - amongst other thigns: 1.186 + todo amongst other things: 1.187 Develop algorithms that find genetic markers for anatomical re- 1.188 gions 1.189 1. Develop scoring measures for evaluating how good individual genes are at 1.190 @@ -387,6 +417,10 @@ 1.191 ing: for areas that cannot be identified by any single gene, identify them 1.192 with a handful of genes. We will consider both (a) algorithms that incre- 1.193 mentally/greedily combine single gene markers into sets, such as forward 1.194 +__________________________ 1.195 + 65-fold cross-validation. 1.196 + 10 1.197 + 1.198 stepwise regression and decision trees, and also (b) supervised learning 1.199 techniques which use soft constraints to minimize the number of features, 1.200 such as sparse support vector machines. 1.201 @@ -397,8 +431,6 @@ 1.202 which (a) detect when a difficult area could be fit if its boundary were 1.203 redrawn slightly, and (b) detect when a difficult area could be combined 1.204 with adjacent areas to create a larger area which can be fit. 1.205 - 10 1.206 - 1.207 Apply these algorithms to the cortex 1.208 1. Create open source format conversion tools: we will create tools to bulk 1.209 download the ABA dataset and to convert between SEV, NIFTI and MAT- 1.210 @@ -424,8 +456,9 @@ 1.211 clustering to create anatomical maps 1.212 6. Run this algorithm on the cortex: present a hierarchial, genoarchitectonic 1.213 map of the cortex 1.214 -______________________________________________ 1.215 - stuff i dunno where to put yet (there is more scattered through grant- 1.216 + 11 1.217 + 1.218 + _______________________________________________________________________________________________________ stuff i dunno where to put yet (there is more scattered through grant- 1.219 oldtext): 1.220 Principle 4: Work in 2-D whenever possible 1.221 In anatomy, the manifold of interest is usually either defined by a combina- 1.222 @@ -436,16 +469,14 @@ 1.223 The method that we will develop will begin by mapping the data into a 1.224 2-D plane. Although the manifold that characterized cortical areas is known 1.225 to be the cortical surface, it remains to be seen which method of mapping the 1.226 - 11 1.227 - 1.228 - manifold into a plane is optimal for this application. We will compare mappings 1.229 - which attempt to preserve size (such as the one used by Caret??) with mappings 1.230 - which preserve angle (conformal maps). 1.231 - Although there is much 2-D organization in anatomy, there are also struc- 1.232 - tures whose shape is fundamentally 3-dimensional. If possible, we would like 1.233 - the method we develop to include a statistical test that warns the user if the 1.234 - assumption of 2-D structure seems to be wrong. 1.235 - todo: replace aim # bullet pts with #s 1.236 +manifold into a plane is optimal for this application. We will compare mappings 1.237 +which attempt to preserve size (such as the one used by Caret??) with mappings 1.238 +which preserve angle (conformal maps). 1.239 + Although there is much 2-D organization in anatomy, there are also struc- 1.240 +tures whose shape is fundamentally 3-dimensional. If possible, we would like 1.241 +the method we develop to include a statistical test that warns the user if the 1.242 +assumption of 2-D structure seems to be wrong. 1.243 + todo: replace aim # bullet pts with #s 1.244 12 1.245 1.246
2.1 Binary file grant.odt has changed
3.1 Binary file grant.pdf has changed
4.1 --- a/grant.txt Sun Apr 12 04:01:58 2009 -0700 4.2 +++ b/grant.txt Sun Apr 12 15:34:12 2009 -0700 4.3 @@ -79,8 +79,7 @@ 4.4 4.5 **Similarity scores** 4.6 4.7 - 4.8 -todo 4.9 +A crucial choice when designing a clustering method is how to measure similarity, across either pairs of instances, or clusters, or both. There is much overlap between scoring methods for feature selection (discussed above under Aim 1) and scoring methods for similarity. 4.10 4.11 4.12 **Spatially contiguous clusters; image segmentation** 4.13 @@ -137,6 +136,23 @@ 4.14 4.15 4.16 === Related work === 4.17 +There does not appear to be much work on the automated analysis of spatial gene expression data. 4.18 + 4.19 +There is a substantial body of work on the analysis of gene expression data, however, most of this concerns gene expression data which is not fundamentally spatial, for example, microarray datasets. In some cases, a few locations have been sampled, but such a dataset is still of a fundamentally different character than a dataset containing a large grid of sampling points distributed over space. In relating gene expression to anatomy, it is the spatial aspects of the problem which are the most important. 4.20 + 4.21 +As noted above, there has been much work on both supervised learning and clustering, and there are many available algorithms for each. Many of these algorithms are flexible enough to accomodate new scoring measures; and the performance of most of the algorithms is greatly affected by preprocessing and by the choice of which representation to use for feature values. We think it likely that for this application, the development of domain-specific scoring measures (such as gradient similarity, which is discussed in Preliminary Work) will be necessary in order to achieve the best results. In essence, the machine learning community has provided algorithms, but the scientist must provide a framework for representing the problem domain, and the way that this framework is set up has a large impact on performance. Creating a good framework can require creatively reconceptualizing the problem domain, and is not merely a mechanical "fine-tuning" of numerical parameters. Therefore, the completion of Aims 1 and 2 involves more than just reimplementing an existing algorithm, and more than just choosing between a set of existing algorithms, and will constitute a substantial contribution to biology. 4.22 + 4.23 +We are aware of one other effort to computationally analyze spatial gene expression data. 4.24 + 4.25 + 4.26 +In the Preliminary Work, we show that 4.27 + 4.28 +The creation of a domain-specific scoring measure may be required in order to achieve good performance, and it is not impossible that the algorithms themselves will have to be extended. We plan to test out existing algorithms and scoring measures, 4.29 + 4.30 +Therefore, we anticipate 4.31 + 4.32 +Therefore, it is unclear which of the 4.33 + 4.34 todo 4.35 4.36 vs. AGEA -- i wrote something on this but i'm going to rewrite it 4.37 @@ -199,14 +215,14 @@ 4.38 todo 4.39 4.40 4.41 -=== Aim 1 (and Aim 3) === 4.42 +=== Specific to Aim 1 (and Aim 3) === 4.43 **Forward stepwise logistic regression** 4.44 todo 4.45 4.46 4.47 **SVM on all genes at once** 4.48 4.49 -In order to see how well one can do when looking at all genes at once, we ran a support vector machine to classify cortical surface pixels based on their gene expression profiles. We achieved classification accuracy of about 81%\footnote{Using the Shogun SVM package (todo:cite), with parameters type=GMNPSVM (multiclass b-SVM), kernal = gaussian with sigma = 0.1, c = 10, epsilon = 1e-1 -- these are the first parameters we tried, so presumably performance would improve with different choices of parameters. 5-fold cross-validation.}. As noted above, however, a classifier that looks at all the genes at once isn't practically useful. 4.50 +In order to see how well one can do when looking at all genes at once, we ran a support vector machine to classify cortical surface pixels based on their gene expression profiles. We achieved classification accuracy of about 81%\footnote{5-fold cross-validation.}. As noted above, however, a classifier that looks at all the genes at once isn't practically useful. 4.51 4.52 The requirement to find combinations of only a small number of genes limits us from straightforwardly applying many of the most simple techniques from the field of supervised machine learning. In the parlance of machine learning, our task combines feature selection with supervised learning. 4.53 4.54 @@ -217,12 +233,12 @@ 4.55 todo 4.56 4.57 4.58 -=== Aim 2 (and Aim 3) === 4.59 - 4.60 -=== Raw dimensionality reduction results === 4.61 - 4.62 - 4.63 -=== Dimensionality reduction plus K-means or spectral clustering === 4.64 +=== Specific to Aim 2 (and Aim 3) === 4.65 + 4.66 +**Raw dimensionality reduction results** 4.67 + 4.68 + 4.69 +**Dimensionality reduction plus K-means or spectral clustering** 4.70 4.71 4.72 4.73 @@ -244,9 +260,7 @@ 4.74 4.75 == Research plan == 4.76 4.77 -todo 4.78 - 4.79 -amongst other thigns: 4.80 +todo amongst other things: 4.81 4.82 4.83 **Develop algorithms that find genetic markers for anatomical regions**