cg
changeset 20:c2609c6e7736
.
author | bshanks@bshanks.dyndns.org |
---|---|
date | Mon Apr 13 03:07:26 2009 -0700 (16 years ago) |
parents | 717d4025b861 |
children | b9643c30e352 |
files | grant.html grant.odt grant.pdf grant.txt |
line diff
1.1 --- a/grant.html Sun Apr 12 15:35:00 2009 -0700
1.2 +++ b/grant.html Mon Apr 13 03:07:26 2009 -0700
1.3 @@ -297,9 +297,28 @@
1.4 and 2 involves more than just reimplementing an existing algorithm, and more
1.5 than just choosing between a set of existing algorithms, and will constitute a
1.6 substantial contribution to biology.
1.7 - We are aware of one other effort to computationally analyze spatial gene
1.8 - expression data.
1.9 + We are aware of two existing efforts to relate spatial gene expression data to
1.10 + anatomy through computational methods.
1.11 + [?] describes an analysis of the anatomy of the hippocampus using the ABA
1.12 + dataset. In addition to manual analysis, two clustering methods were employed,
1.13 + a modified Non-negative Matrix Factorization (NNMF), and a hierarchial bifur-
1.14 + cation clustering scheme based on correlation as the similarity score. The paper
1.15 + yielded impressive results, proving the usefulness of such research. We have run
1.16 + NNMF on the cortical dataset and while the results are promising (see Prelim-
1.17 + inary Data), we think that it will be possible to find a better method2 (we also
1.18 + think that more automation of the parts that this paper’s authors did manually
1.19 + will be possible).
1.20 + and [?] describes AGEA. todo
1.21 In the Preliminary Work, we show that
1.22 +__________________________
1.23 + 2We ran “vanilla” NNMF, whereas the paper under discussion used a modified method.
1.24 +Their main modification consisted of adding a soft spatial contiguity constraint. However,
1.25 +on our dataset, NNMF naturally produced spatially contiguous clusters, so no additional
1.26 +constraint was needed. The paper under discussion mentions that they also tried a hierarchial
1.27 +variant of NNMF, but since they didn’t report its results, we assume that the result were not
1.28 +any more impressive than the non-hierarchial variant.
1.29 + 7
1.30 +
1.31 The creation of a domain-specific scoring measure may be required in order
1.32 to achieve good performance, and it is not impossible that the algorithms them-
1.33 selves will have to be extended. We plan to test out existing algorithms and
1.34 @@ -311,22 +330,20 @@
1.35 Preliminary work
1.36 Format conversion between SEV, MATLAB, NIFTI
1.37 todo
1.38 - 7
1.39 -
1.40 Flatmap of cortex
1.41 todo
1.42 Using combinations of multiple genes is necessary and sufficient to
1.43 delineate some cortical areas
1.44 Here we give an example of a cortical area which is not marked by any
1.45 single gene, but which can be identified combinatorially. according to logistic
1.46 - regression, gene wwc12 is the best fit single gene for predicting whether or not a
1.47 + regression, gene wwc13 is the best fit single gene for predicting whether or not a
1.48 pixel on the cortical surface belongs to the motor area (area MO). The upper-left
1.49 picture in Figure shows wwc1’s spatial expression pattern over the cortex. The
1.50 lower-right boundary of MO is represented reasonably well by this gene, however
1.51 the gene overshoots the upper-left boundary. This flattened 2-D representation
1.52 does not show it, but the area corresponding to the overshoot is the medial
1.53 surface of the cortex. MO is only found on the lateral surface (todo).
1.54 - Gnee mtif23 is shown in figure the upper-right of Fig. . Mtif2 captures MO’s
1.55 + Gnee mtif24 is shown in figure the upper-right of Fig. . Mtif2 captures MO’s
1.56 upper-left boundary, but not its lower-right boundary. Mtif2 does not express
1.57 very much on the medial surface. By adding together the values at each pixel
1.58 in these two figures, we get the lower-left of Figure . This combination captures
1.59 @@ -338,28 +355,9 @@
1.60 information
1.61 To show that local geometry can provide useful information that cannot be
1.62 detected via pointwise analyses, consider Fig. . The top row of Fig. displays the
1.63 - 3 genes which most match area AUD, according to a pointwise method4. The
1.64 - bottom row displays the 3 genes which most match AUD according to a method
1.65 - which considers local geometry5 The pointwise method in the top row identifies
1.66 - genes which express more strongly in AUD than outside of it; its weakness is that
1.67 - this includes many areas which don’t have a salient border matching the areal
1.68 - border. The geometric method identifies genes whose salient expression border
1.69 - seems to partially line up with the border of AUD; its weakness is that this
1.70 - includes genes which don’t express over the entire area. Genes which have high
1.71 - rankings using both pointwise and border criteria, such as Aph1a in the example,
1.72 - may be particularly good markers. None of these genes are, individually, a
1.73 - perfect marker for AUD; we deliberately chose a “difficult” area in order to
1.74 - better contrast pointwise with geometric methods.
1.75 __________________________
1.76 - 2“WW, C2 and coiled-coil domain containing 1”; EntrezGene ID 211652
1.77 - 3“mitochondrial translational initiation factor 2”; EntrezGene ID 76784
1.78 - 4For each gene, a logistic regression in which the response variable was whether or not a
1.79 -surface pixel was within area AUD, and the predictor variable was the value of the expression
1.80 -of the gene underneath that pixel. The resulting scores were used to rank the genes in terms
1.81 -of how well they predict area AUD.
1.82 - 5For each gene the gradient similarity (see section ??) between (a) a map of the expression
1.83 -of each gene on the cortical surface and (b) the shape of area AUD, was calculated, and this
1.84 -was used to rank the genes.
1.85 + 3“WW, C2 and coiled-coil domain containing 1”; EntrezGene ID 211652
1.86 + 4“mitochondrial translational initiation factor 2”; EntrezGene ID 76784
1.87 8
1.88
1.89
1.90 @@ -372,6 +370,8 @@
1.91 the boundary of region MO. Pixels are colored approximately according to the
1.92 density of expressing cells underneath each pixel, with red meaning a lot of
1.93 expression and blue meaning little.
1.94 + 9
1.95 +
1.96
1.97
1.98 Figure 2: The top row shows the three genes which (individually) best predict
1.99 @@ -379,16 +379,36 @@
1.100 genes which (individually) best match area AUD, according to gradient similar-
1.101 ity. From left to right and top to bottom, the genes are Ssr1, Efcbp1, Aph1a,
1.102 Ptk7, Aph1a again, and Lepr
1.103 - 9
1.104 -
1.105 + 3 genes which most match area AUD, according to a pointwise method5. The
1.106 + bottom row displays the 3 genes which most match AUD according to a method
1.107 + which considers local geometry6 The pointwise method in the top row identifies
1.108 + genes which express more strongly in AUD than outside of it; its weakness is that
1.109 + this includes many areas which don’t have a salient border matching the areal
1.110 + border. The geometric method identifies genes whose salient expression border
1.111 + seems to partially line up with the border of AUD; its weakness is that this
1.112 + includes genes which don’t express over the entire area. Genes which have high
1.113 + rankings using both pointwise and border criteria, such as Aph1a in the example,
1.114 + may be particularly good markers. None of these genes are, individually, a
1.115 + perfect marker for AUD; we deliberately chose a “difficult” area in order to
1.116 + better contrast pointwise with geometric methods.
1.117 Areas which can be identified by single genes
1.118 todo
1.119 Specific to Aim 1 (and Aim 3)
1.120 Forward stepwise logistic regression todo
1.121 SVM on all genes at once
1.122 +__________________________
1.123 + 5For each gene, a logistic regression in which the response variable was whether or not a
1.124 +surface pixel was within area AUD, and the predictor variable was the value of the expression
1.125 +of the gene underneath that pixel. The resulting scores were used to rank the genes in terms
1.126 +of how well they predict area AUD.
1.127 + 6For each gene the gradient similarity (see section ??) between (a) a map of the expression
1.128 +of each gene on the cortical surface and (b) the shape of area AUD, was calculated, and this
1.129 +was used to rank the genes.
1.130 + 10
1.131 +
1.132 In order to see how well one can do when looking at all genes at once, we
1.133 ran a support vector machine to classify cortical surface pixels based on their
1.134 - gene expression profiles. We achieved classification accuracy of about 81%6.
1.135 + gene expression profiles. We achieved classification accuracy of about 81%7.
1.136 As noted above, however, a classifier that looks at all the genes at once isn’t
1.137 practically useful.
1.138 The requirement to find combinations of only a small number of genes limits
1.139 @@ -399,6 +419,8 @@
1.140 todo
1.141 Specific to Aim 2 (and Aim 3)
1.142 Raw dimensionality reduction results
1.143 + todo
1.144 + (might want to incld nnMF since mentioned above)
1.145 Dimensionality reduction plus K-means or spectral clustering
1.146 Many areas are captured by clusters of genes
1.147 todo
1.148 @@ -417,13 +439,13 @@
1.149 ing: for areas that cannot be identified by any single gene, identify them
1.150 with a handful of genes. We will consider both (a) algorithms that incre-
1.151 mentally/greedily combine single gene markers into sets, such as forward
1.152 -__________________________
1.153 - 65-fold cross-validation.
1.154 - 10
1.155 -
1.156 stepwise regression and decision trees, and also (b) supervised learning
1.157 techniques which use soft constraints to minimize the number of features,
1.158 such as sparse support vector machines.
1.159 +__________________________
1.160 + 75-fold cross-validation.
1.161 + 11
1.162 +
1.163 4. Extend the procedure to handle difficult areas by combining or redrawing
1.164 the boundaries: An area may be difficult to identify because the bound-
1.165 aries are misdrawn, or because it does not “really” exist as a single area,
1.166 @@ -456,27 +478,34 @@
1.167 clustering to create anatomical maps
1.168 6. Run this algorithm on the cortex: present a hierarchial, genoarchitectonic
1.169 map of the cortex
1.170 - 11
1.171 -
1.172 - _______________________________________________________________________________________________________ stuff i dunno where to put yet (there is more scattered through grant-
1.173 +______________________________________________
1.174 + stuff i dunno where to put yet (there is more scattered through grant-
1.175 oldtext):
1.176 Principle 4: Work in 2-D whenever possible
1.177 - In anatomy, the manifold of interest is usually either defined by a combina-
1.178 -tion of two relevant anatomical axes (todo), or by the surface of the structure
1.179 -(as is the case with the cortex). In the former case, the manifold of interest is
1.180 -a plane, but in the latter case it is curved. If the manifold is curved, there are
1.181 -various methods for mapping the manifold into a plane.
1.182 - The method that we will develop will begin by mapping the data into a
1.183 -2-D plane. Although the manifold that characterized cortical areas is known
1.184 -to be the cortical surface, it remains to be seen which method of mapping the
1.185 -manifold into a plane is optimal for this application. We will compare mappings
1.186 -which attempt to preserve size (such as the one used by Caret??) with mappings
1.187 -which preserve angle (conformal maps).
1.188 - Although there is much 2-D organization in anatomy, there are also struc-
1.189 -tures whose shape is fundamentally 3-dimensional. If possible, we would like
1.190 -the method we develop to include a statistical test that warns the user if the
1.191 -assumption of 2-D structure seems to be wrong.
1.192 - todo: replace aim # bullet pts with #s
1.193 12
1.194
1.195 -
1.196 + In anatomy, the manifold of interest is usually either defined by a combina-
1.197 + tion of two relevant anatomical axes (todo), or by the surface of the structure
1.198 + (as is the case with the cortex). In the former case, the manifold of interest is
1.199 + a plane, but in the latter case it is curved. If the manifold is curved, there are
1.200 + various methods for mapping the manifold into a plane.
1.201 + The method that we will develop will begin by mapping the data into a
1.202 + 2-D plane. Although the manifold that characterized cortical areas is known
1.203 + to be the cortical surface, it remains to be seen which method of mapping the
1.204 + manifold into a plane is optimal for this application. We will compare mappings
1.205 + which attempt to preserve size (such as the one used by Caret??) with mappings
1.206 + which preserve angle (conformal maps).
1.207 + Although there is much 2-D organization in anatomy, there are also struc-
1.208 + tures whose shape is fundamentally 3-dimensional. If possible, we would like
1.209 + the method we develop to include a statistical test that warns the user if the
1.210 + assumption of 2-D structure seems to be wrong.
1.211 + if we need citations for aim 3 significance, http://www.sciencedirect.
1.212 + com/science?_ob=ArticleURL&_udi=B6WSS-4V70FHY-9&_user=4429&_coverDate=
1.213 + 12%2F26%2F2008&_rdoc=1&_fmt=full&_orig=na&_cdi=7054&_docanchor=&_acct=
1.214 + C000059602&_version=1&_urlVersion=0&_userid=4429&md5=551eccc743a2bfe6e992eee0c3194203#
1.215 + app2 has examples of genetic targeting to specific anatomical regions
1.216 + —
1.217 + note:
1.218 + 13
1.219 +
1.220 +
2.1 Binary file grant.odt has changed
3.1 Binary file grant.pdf has changed
4.1 --- a/grant.txt Sun Apr 12 15:35:00 2009 -0700
4.2 +++ b/grant.txt Mon Apr 13 03:07:26 2009 -0700
4.3 @@ -142,7 +142,12 @@
4.4
4.5 As noted above, there has been much work on both supervised learning and clustering, and there are many available algorithms for each. Many of these algorithms are flexible enough to accomodate new scoring measures; and the performance of most of the algorithms is greatly affected by preprocessing and by the choice of which representation to use for feature values. We think it likely that for this application, the development of domain-specific scoring measures (such as gradient similarity, which is discussed in Preliminary Work) will be necessary in order to achieve the best results. In essence, the machine learning community has provided algorithms, but the scientist must provide a framework for representing the problem domain, and the way that this framework is set up has a large impact on performance. Creating a good framework can require creatively reconceptualizing the problem domain, and is not merely a mechanical "fine-tuning" of numerical parameters. Therefore, the completion of Aims 1 and 2 involves more than just reimplementing an existing algorithm, and more than just choosing between a set of existing algorithms, and will constitute a substantial contribution to biology.
4.6
4.7 -We are aware of one other effort to computationally analyze spatial gene expression data.
4.8 +We are aware of two existing efforts to relate spatial gene expression data to anatomy through computational methods.
4.9 +
4.10 +\cite{thompson_genomic_2008} describes an analysis of the anatomy of the hippocampus using the ABA dataset. In addition to manual analysis, two clustering methods were employed, a modified Non-negative Matrix Factorization (NNMF), and a hierarchial bifurcation clustering scheme based on correlation as the similarity score. The paper yielded impressive results, proving the usefulness of such research. We have run NNMF on the cortical dataset and while the results are promising (see Preliminary Data), we think that it will be possible to find a better method\footnote{We ran "vanilla" NNMF, whereas the paper under discussion used a modified method. Their main modification consisted of adding a soft spatial contiguity constraint. However, on our dataset, NNMF naturally produced spatially contiguous clusters, so no additional constraint was needed. The paper under discussion mentions that they also tried a hierarchial variant of NNMF, but since they didn't report its results, we assume that the result were not any more impressive than the non-hierarchial variant.} (we also think that more automation of the parts that this paper's authors did manually will be possible).
4.11 +
4.12 +
4.13 + and \cite{ng_anatomic_2009} describes AGEA. todo
4.14
4.15
4.16 In the Preliminary Work, we show that
4.17 @@ -237,6 +242,9 @@
4.18
4.19 **Raw dimensionality reduction results**
4.20
4.21 +todo
4.22 +
4.23 +(might want to incld nnMF since mentioned above)
4.24
4.25 **Dimensionality reduction plus K-means or spectral clustering**
4.26
4.27 @@ -309,4 +317,8 @@
4.28
4.29
4.30
4.31 -todo: replace aim # bullet pts with #s
4.32 +if we need citations for aim 3 significance, http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6WSS-4V70FHY-9&_user=4429&_coverDate=12%2F26%2F2008&_rdoc=1&_fmt=full&_orig=na&_cdi=7054&_docanchor=&_acct=C000059602&_version=1&_urlVersion=0&_userid=4429&md5=551eccc743a2bfe6e992eee0c3194203#app2 has examples of genetic targeting to specific anatomical regions
4.33 +
4.34 +---
4.35 +
4.36 +note: