cg
changeset 21:b9643c30e352
.
author | bshanks@bshanks.dyndns.org |
---|---|
date | Mon Apr 13 03:10:37 2009 -0700 (16 years ago) |
parents | c2609c6e7736 |
children | 69aa7c47c0e5 |
files | grant.doc grant.html grant.odt grant.pdf grant.txt |
line diff
1.1 Binary file grant.doc has changed
2.1 --- a/grant.html Mon Apr 13 03:07:26 2009 -0700
2.2 +++ b/grant.html Mon Apr 13 03:10:37 2009 -0700
2.3 @@ -276,11 +276,7 @@
2.4 gene expression data.
2.5 There is a substantial body of work on the analysis of gene expression data,
2.6 however, most of this concerns gene expression data which is not fundamentally
2.7 - spatial, for example, microarray datasets. In some cases, a few locations have
2.8 - been sampled, but such a dataset is still of a fundamentally different character
2.9 - than a dataset containing a large grid of sampling points distributed over space.
2.10 - In relating gene expression to anatomy, it is the spatial aspects of the problem
2.11 - which are the most important.
2.12 + spatial, for example, microarray datasets.
2.13 As noted above, there has been much work on both supervised learning and
2.14 clustering, and there are many available algorithms for each. Many of these
2.15 algorithms are flexible enough to accomodate new scoring measures; and the
2.16 @@ -310,7 +306,12 @@
2.17 will be possible).
2.18 and [?] describes AGEA. todo
2.19 In the Preliminary Work, we show that
2.20 -__________________________
2.21 + The creation of a domain-specific scoring measure may be required in order
2.22 + to achieve good performance, and it is not impossible that the algorithms them-
2.23 + selves will have to be extended. We plan to test out existing algorithms and
2.24 + scoring measures,
2.25 + Therefore, we anticipate
2.26 +___
2.27 2We ran “vanilla” NNMF, whereas the paper under discussion used a modified method.
2.28 Their main modification consisted of adding a soft spatial contiguity constraint. However,
2.29 on our dataset, NNMF naturally produced spatially contiguous clusters, so no additional
2.30 @@ -319,11 +320,6 @@
2.31 any more impressive than the non-hierarchial variant.
2.32 7
2.33
2.34 - The creation of a domain-specific scoring measure may be required in order
2.35 - to achieve good performance, and it is not impossible that the algorithms them-
2.36 - selves will have to be extended. We plan to test out existing algorithms and
2.37 - scoring measures,
2.38 - Therefore, we anticipate
2.39 Therefore, it is unclear which of the
2.40 todo
2.41 vs. AGEA – i wrote something on this but i’m going to rewrite it
2.42 @@ -355,9 +351,15 @@
2.43 information
2.44 To show that local geometry can provide useful information that cannot be
2.45 detected via pointwise analyses, consider Fig. . The top row of Fig. displays the
2.46 + 3 genes which most match area AUD, according to a pointwise method5. The
2.47 + bottom row displays the 3 genes which most match AUD according to a method
2.48 __________________________
2.49 3“WW, C2 and coiled-coil domain containing 1”; EntrezGene ID 211652
2.50 4“mitochondrial translational initiation factor 2”; EntrezGene ID 76784
2.51 + 5For each gene, a logistic regression in which the response variable was whether or not a
2.52 +surface pixel was within area AUD, and the predictor variable was the value of the expression
2.53 +of the gene underneath that pixel. The resulting scores were used to rank the genes in terms
2.54 +of how well they predict area AUD.
2.55 8
2.56
2.57
2.58 @@ -379,8 +381,6 @@
2.59 genes which (individually) best match area AUD, according to gradient similar-
2.60 ity. From left to right and top to bottom, the genes are Ssr1, Efcbp1, Aph1a,
2.61 Ptk7, Aph1a again, and Lepr
2.62 - 3 genes which most match area AUD, according to a pointwise method5. The
2.63 - bottom row displays the 3 genes which most match AUD according to a method
2.64 which considers local geometry6 The pointwise method in the top row identifies
2.65 genes which express more strongly in AUD than outside of it; its weakness is that
2.66 this includes many areas which don’t have a salient border matching the areal
2.67 @@ -396,19 +396,16 @@
2.68 Specific to Aim 1 (and Aim 3)
2.69 Forward stepwise logistic regression todo
2.70 SVM on all genes at once
2.71 -__________________________
2.72 - 5For each gene, a logistic regression in which the response variable was whether or not a
2.73 -surface pixel was within area AUD, and the predictor variable was the value of the expression
2.74 -of the gene underneath that pixel. The resulting scores were used to rank the genes in terms
2.75 -of how well they predict area AUD.
2.76 - 6For each gene the gradient similarity (see section ??) between (a) a map of the expression
2.77 -of each gene on the cortical surface and (b) the shape of area AUD, was calculated, and this
2.78 -was used to rank the genes.
2.79 - 10
2.80 -
2.81 In order to see how well one can do when looking at all genes at once, we
2.82 ran a support vector machine to classify cortical surface pixels based on their
2.83 gene expression profiles. We achieved classification accuracy of about 81%7.
2.84 +__________________________
2.85 + 6For each gene the gradient similarity (see section ??) between (a) a map of the expression
2.86 +of each gene on the cortical surface and (b) the shape of area AUD, was calculated, and this
2.87 +was used to rank the genes.
2.88 + 75-fold cross-validation.
2.89 + 10
2.90 +
2.91 As noted above, however, a classifier that looks at all the genes at once isn’t
2.92 practically useful.
2.93 The requirement to find combinations of only a small number of genes limits
2.94 @@ -442,15 +439,13 @@
2.95 stepwise regression and decision trees, and also (b) supervised learning
2.96 techniques which use soft constraints to minimize the number of features,
2.97 such as sparse support vector machines.
2.98 -__________________________
2.99 - 75-fold cross-validation.
2.100 - 11
2.101 -
2.102 4. Extend the procedure to handle difficult areas by combining or redrawing
2.103 the boundaries: An area may be difficult to identify because the bound-
2.104 aries are misdrawn, or because it does not “really” exist as a single area,
2.105 at least on the genetic level. We will develop extensions to our procedure
2.106 which (a) detect when a difficult area could be fit if its boundary were
2.107 + 11
2.108 +
2.109 redrawn slightly, and (b) detect when a difficult area could be combined
2.110 with adjacent areas to create a larger area which can be fit.
2.111 Apply these algorithms to the cortex
2.112 @@ -482,13 +477,13 @@
2.113 stuff i dunno where to put yet (there is more scattered through grant-
2.114 oldtext):
2.115 Principle 4: Work in 2-D whenever possible
2.116 + In anatomy, the manifold of interest is usually either defined by a combina-
2.117 +tion of two relevant anatomical axes (todo), or by the surface of the structure
2.118 +(as is the case with the cortex). In the former case, the manifold of interest is
2.119 +a plane, but in the latter case it is curved. If the manifold is curved, there are
2.120 +various methods for mapping the manifold into a plane.
2.121 12
2.122
2.123 - In anatomy, the manifold of interest is usually either defined by a combina-
2.124 - tion of two relevant anatomical axes (todo), or by the surface of the structure
2.125 - (as is the case with the cortex). In the former case, the manifold of interest is
2.126 - a plane, but in the latter case it is curved. If the manifold is curved, there are
2.127 - various methods for mapping the manifold into a plane.
2.128 The method that we will develop will begin by mapping the data into a
2.129 2-D plane. Although the manifold that characterized cortical areas is known
2.130 to be the cortical surface, it remains to be seen which method of mapping the
3.1 Binary file grant.odt has changed
4.1 Binary file grant.pdf has changed
5.1 --- a/grant.txt Mon Apr 13 03:07:26 2009 -0700
5.2 +++ b/grant.txt Mon Apr 13 03:10:37 2009 -0700
5.3 @@ -138,7 +138,7 @@
5.4 === Related work ===
5.5 There does not appear to be much work on the automated analysis of spatial gene expression data.
5.6
5.7 -There is a substantial body of work on the analysis of gene expression data, however, most of this concerns gene expression data which is not fundamentally spatial, for example, microarray datasets. In some cases, a few locations have been sampled, but such a dataset is still of a fundamentally different character than a dataset containing a large grid of sampling points distributed over space. In relating gene expression to anatomy, it is the spatial aspects of the problem which are the most important.
5.8 +There is a substantial body of work on the analysis of gene expression data, however, most of this concerns gene expression data which is not fundamentally spatial, for example, microarray datasets.
5.9
5.10 As noted above, there has been much work on both supervised learning and clustering, and there are many available algorithms for each. Many of these algorithms are flexible enough to accomodate new scoring measures; and the performance of most of the algorithms is greatly affected by preprocessing and by the choice of which representation to use for feature values. We think it likely that for this application, the development of domain-specific scoring measures (such as gradient similarity, which is discussed in Preliminary Work) will be necessary in order to achieve the best results. In essence, the machine learning community has provided algorithms, but the scientist must provide a framework for representing the problem domain, and the way that this framework is set up has a large impact on performance. Creating a good framework can require creatively reconceptualizing the problem domain, and is not merely a mechanical "fine-tuning" of numerical parameters. Therefore, the completion of Aims 1 and 2 involves more than just reimplementing an existing algorithm, and more than just choosing between a set of existing algorithms, and will constitute a substantial contribution to biology.
5.11