cg

changeset 22:69aa7c47c0e5

.
author bshanks@bshanks.dyndns.org
date Mon Apr 13 03:13:10 2009 -0700 (16 years ago)
parents b9643c30e352
children 161319ea4991
files grant.html grant.odt grant.pdf grant.txt
line diff
1.1 --- a/grant.html Mon Apr 13 03:10:37 2009 -0700 1.2 +++ b/grant.html Mon Apr 13 03:13:10 2009 -0700 1.3 @@ -278,21 +278,16 @@ 1.4 however, most of this concerns gene expression data which is not fundamentally 1.5 spatial, for example, microarray datasets. 1.6 As noted above, there has been much work on both supervised learning and 1.7 - clustering, and there are many available algorithms for each. Many of these 1.8 - algorithms are flexible enough to accomodate new scoring measures; and the 1.9 - performance of most of the algorithms is greatly affected by preprocessing and 1.10 - by the choice of which representation to use for feature values. We think it likely 1.11 - that for this application, the development of domain-specific scoring measures 1.12 - (such as gradient similarity, which is discussed in Preliminary Work) will be 1.13 - necessary in order to achieve the best results. In essence, the machine learning 1.14 - community has provided algorithms, but the scientist must provide a framework 1.15 - for representing the problem domain, and the way that this framework is set 1.16 - up has a large impact on performance. Creating a good framework can require 1.17 - creatively reconceptualizing the problem domain, and is not merely a mechanical 1.18 - “fine-tuning” of numerical parameters. Therefore, the completion of Aims 1 1.19 - and 2 involves more than just reimplementing an existing algorithm, and more 1.20 - than just choosing between a set of existing algorithms, and will constitute a 1.21 - substantial contribution to biology. 1.22 + clustering, and there are many available algorithms for each. However, the 1.23 + completion of Aims 1 and 2 involves more than just choosing between a set of 1.24 + existing algorithms, and will constitute a substantial contribution to biology. 1.25 + The algorithms require the scientist to provide a framework for representing the 1.26 + problem domain, and the way that this framework is set up has a large impact 1.27 + on performance. Creating a good framework can require creatively reconcep- 1.28 + tualizing the problem domain, and is not merely a mechanical “fine-tuning” 1.29 + of numerical parameters. For example, we believe that domain-specific scoring 1.30 + measures (such as gradient similarity, which is discussed in Preliminary Work) 1.31 + may be necessary in order to achieve the best results in this application. 1.32 We are aware of two existing efforts to relate spatial gene expression data to 1.33 anatomy through computational methods. 1.34 [?] describes an analysis of the anatomy of the hippocampus using the ABA 1.35 @@ -311,7 +306,10 @@ 1.36 selves will have to be extended. We plan to test out existing algorithms and 1.37 scoring measures, 1.38 Therefore, we anticipate 1.39 -___ 1.40 + Therefore, it is unclear which of the 1.41 + todo 1.42 + vs. AGEA – i wrote something on this but i’m going to rewrite it 1.43 +__________________________ 1.44 2We ran “vanilla” NNMF, whereas the paper under discussion used a modified method. 1.45 Their main modification consisted of adding a soft spatial contiguity constraint. However, 1.46 on our dataset, NNMF naturally produced spatially contiguous clusters, so no additional 1.47 @@ -320,9 +318,6 @@ 1.48 any more impressive than the non-hierarchial variant. 1.49 7 1.50 1.51 - Therefore, it is unclear which of the 1.52 - todo 1.53 - vs. AGEA – i wrote something on this but i’m going to rewrite it 1.54 Preliminary work 1.55 Format conversion between SEV, MATLAB, NIFTI 1.56 todo 1.57 @@ -353,6 +348,10 @@ 1.58 detected via pointwise analyses, consider Fig. . The top row of Fig. displays the 1.59 3 genes which most match area AUD, according to a pointwise method5. The 1.60 bottom row displays the 3 genes which most match AUD according to a method 1.61 + which considers local geometry6 The pointwise method in the top row identifies 1.62 + genes which express more strongly in AUD than outside of it; its weakness is that 1.63 + this includes many areas which don’t have a salient border matching the areal 1.64 + border. The geometric method identifies genes whose salient expression border 1.65 __________________________ 1.66 3“WW, C2 and coiled-coil domain containing 1”; EntrezGene ID 211652 1.67 4“mitochondrial translational initiation factor 2”; EntrezGene ID 76784 1.68 @@ -360,6 +359,9 @@ 1.69 surface pixel was within area AUD, and the predictor variable was the value of the expression 1.70 of the gene underneath that pixel. The resulting scores were used to rank the genes in terms 1.71 of how well they predict area AUD. 1.72 + 6For each gene the gradient similarity (see section ??) between (a) a map of the expression 1.73 +of each gene on the cortical surface and (b) the shape of area AUD, was calculated, and this 1.74 +was used to rank the genes. 1.75 8 1.76 1.77 1.78 @@ -381,10 +383,6 @@ 1.79 genes which (individually) best match area AUD, according to gradient similar- 1.80 ity. From left to right and top to bottom, the genes are Ssr1, Efcbp1, Aph1a, 1.81 Ptk7, Aph1a again, and Lepr 1.82 - which considers local geometry6 The pointwise method in the top row identifies 1.83 - genes which express more strongly in AUD than outside of it; its weakness is that 1.84 - this includes many areas which don’t have a salient border matching the areal 1.85 - border. The geometric method identifies genes whose salient expression border 1.86 seems to partially line up with the border of AUD; its weakness is that this 1.87 includes genes which don’t express over the entire area. Genes which have high 1.88 rankings using both pointwise and border criteria, such as Aph1a in the example, 1.89 @@ -399,13 +397,6 @@ 1.90 In order to see how well one can do when looking at all genes at once, we 1.91 ran a support vector machine to classify cortical surface pixels based on their 1.92 gene expression profiles. We achieved classification accuracy of about 81%7. 1.93 -__________________________ 1.94 - 6For each gene the gradient similarity (see section ??) between (a) a map of the expression 1.95 -of each gene on the cortical surface and (b) the shape of area AUD, was calculated, and this 1.96 -was used to rank the genes. 1.97 - 75-fold cross-validation. 1.98 - 10 1.99 - 1.100 As noted above, however, a classifier that looks at all the genes at once isn’t 1.101 practically useful. 1.102 The requirement to find combinations of only a small number of genes limits 1.103 @@ -414,6 +405,10 @@ 1.104 our task combines feature selection with supervised learning. 1.105 Decision trees 1.106 todo 1.107 +____________________ 1.108 + 75-fold cross-validation. 1.109 + 10 1.110 + 1.111 Specific to Aim 2 (and Aim 3) 1.112 Raw dimensionality reduction results 1.113 todo 1.114 @@ -444,8 +439,6 @@ 1.115 aries are misdrawn, or because it does not “really” exist as a single area, 1.116 at least on the genetic level. We will develop extensions to our procedure 1.117 which (a) detect when a difficult area could be fit if its boundary were 1.118 - 11 1.119 - 1.120 redrawn slightly, and (b) detect when a difficult area could be combined 1.121 with adjacent areas to create a larger area which can be fit. 1.122 Apply these algorithms to the cortex 1.123 @@ -454,6 +447,8 @@ 1.124 LAB formats. 1.125 2. Flatmap the ABA cortex data: map the ABA data onto a plane and draw 1.126 the cortical area boundaries onto it. 1.127 + 11 1.128 + 1.129 3. Find layer boundaries: cluster similar voxels together in order to auto- 1.130 matically find the cortical layer boundaries. 1.131 4. Run the procedures that we developed on the cortex: we will present, for 1.132 @@ -482,20 +477,20 @@ 1.133 (as is the case with the cortex). In the former case, the manifold of interest is 1.134 a plane, but in the latter case it is curved. If the manifold is curved, there are 1.135 various methods for mapping the manifold into a plane. 1.136 + The method that we will develop will begin by mapping the data into a 1.137 +2-D plane. Although the manifold that characterized cortical areas is known 1.138 +to be the cortical surface, it remains to be seen which method of mapping the 1.139 +manifold into a plane is optimal for this application. We will compare mappings 1.140 +which attempt to preserve size (such as the one used by Caret??) with mappings 1.141 +which preserve angle (conformal maps). 1.142 + Although there is much 2-D organization in anatomy, there are also struc- 1.143 +tures whose shape is fundamentally 3-dimensional. If possible, we would like 1.144 +the method we develop to include a statistical test that warns the user if the 1.145 +assumption of 2-D structure seems to be wrong. 1.146 + if we need citations for aim 3 significance, http://www.sciencedirect. 1.147 +com/science?_ob=ArticleURL&_udi=B6WSS-4V70FHY-9&_user=4429&_coverDate= 1.148 12 1.149 1.150 - The method that we will develop will begin by mapping the data into a 1.151 - 2-D plane. Although the manifold that characterized cortical areas is known 1.152 - to be the cortical surface, it remains to be seen which method of mapping the 1.153 - manifold into a plane is optimal for this application. We will compare mappings 1.154 - which attempt to preserve size (such as the one used by Caret??) with mappings 1.155 - which preserve angle (conformal maps). 1.156 - Although there is much 2-D organization in anatomy, there are also struc- 1.157 - tures whose shape is fundamentally 3-dimensional. If possible, we would like 1.158 - the method we develop to include a statistical test that warns the user if the 1.159 - assumption of 2-D structure seems to be wrong. 1.160 - if we need citations for aim 3 significance, http://www.sciencedirect. 1.161 - com/science?_ob=ArticleURL&_udi=B6WSS-4V70FHY-9&_user=4429&_coverDate= 1.162 12%2F26%2F2008&_rdoc=1&_fmt=full&_orig=na&_cdi=7054&_docanchor=&_acct= 1.163 C000059602&_version=1&_urlVersion=0&_userid=4429&md5=551eccc743a2bfe6e992eee0c3194203# 1.164 app2 has examples of genetic targeting to specific anatomical regions
2.1 Binary file grant.odt has changed
3.1 Binary file grant.pdf has changed
4.1 --- a/grant.txt Mon Apr 13 03:10:37 2009 -0700 4.2 +++ b/grant.txt Mon Apr 13 03:13:10 2009 -0700 4.3 @@ -140,7 +140,7 @@ 4.4 4.5 There is a substantial body of work on the analysis of gene expression data, however, most of this concerns gene expression data which is not fundamentally spatial, for example, microarray datasets. 4.6 4.7 -As noted above, there has been much work on both supervised learning and clustering, and there are many available algorithms for each. Many of these algorithms are flexible enough to accomodate new scoring measures; and the performance of most of the algorithms is greatly affected by preprocessing and by the choice of which representation to use for feature values. We think it likely that for this application, the development of domain-specific scoring measures (such as gradient similarity, which is discussed in Preliminary Work) will be necessary in order to achieve the best results. In essence, the machine learning community has provided algorithms, but the scientist must provide a framework for representing the problem domain, and the way that this framework is set up has a large impact on performance. Creating a good framework can require creatively reconceptualizing the problem domain, and is not merely a mechanical "fine-tuning" of numerical parameters. Therefore, the completion of Aims 1 and 2 involves more than just reimplementing an existing algorithm, and more than just choosing between a set of existing algorithms, and will constitute a substantial contribution to biology. 4.8 +As noted above, there has been much work on both supervised learning and clustering, and there are many available algorithms for each. However, the completion of Aims 1 and 2 involves more than just choosing between a set of existing algorithms, and will constitute a substantial contribution to biology. The algorithms require the scientist to provide a framework for representing the problem domain, and the way that this framework is set up has a large impact on performance. Creating a good framework can require creatively reconceptualizing the problem domain, and is not merely a mechanical "fine-tuning" of numerical parameters. For example, we believe that domain-specific scoring measures (such as gradient similarity, which is discussed in Preliminary Work) may be necessary in order to achieve the best results in this application. 4.9 4.10 We are aware of two existing efforts to relate spatial gene expression data to anatomy through computational methods. 4.11