cg
diff grant.txt @ 19:717d4025b861
.
author | bshanks@bshanks.dyndns.org |
---|---|
date | Sun Apr 12 15:35:00 2009 -0700 (16 years ago) |
parents | ff9b47f2c7d3 |
children | c2609c6e7736 |
line diff
1.1 --- a/grant.txt Sun Apr 12 04:01:58 2009 -0700
1.2 +++ b/grant.txt Sun Apr 12 15:35:00 2009 -0700
1.3 @@ -79,8 +79,7 @@
1.4
1.5 **Similarity scores**
1.6
1.7 -
1.8 -todo
1.9 +A crucial choice when designing a clustering method is how to measure similarity, across either pairs of instances, or clusters, or both. There is much overlap between scoring methods for feature selection (discussed above under Aim 1) and scoring methods for similarity.
1.10
1.11
1.12 **Spatially contiguous clusters; image segmentation**
1.13 @@ -137,6 +136,23 @@
1.14
1.15
1.16 === Related work ===
1.17 +There does not appear to be much work on the automated analysis of spatial gene expression data.
1.18 +
1.19 +There is a substantial body of work on the analysis of gene expression data, however, most of this concerns gene expression data which is not fundamentally spatial, for example, microarray datasets. In some cases, a few locations have been sampled, but such a dataset is still of a fundamentally different character than a dataset containing a large grid of sampling points distributed over space. In relating gene expression to anatomy, it is the spatial aspects of the problem which are the most important.
1.20 +
1.21 +As noted above, there has been much work on both supervised learning and clustering, and there are many available algorithms for each. Many of these algorithms are flexible enough to accomodate new scoring measures; and the performance of most of the algorithms is greatly affected by preprocessing and by the choice of which representation to use for feature values. We think it likely that for this application, the development of domain-specific scoring measures (such as gradient similarity, which is discussed in Preliminary Work) will be necessary in order to achieve the best results. In essence, the machine learning community has provided algorithms, but the scientist must provide a framework for representing the problem domain, and the way that this framework is set up has a large impact on performance. Creating a good framework can require creatively reconceptualizing the problem domain, and is not merely a mechanical "fine-tuning" of numerical parameters. Therefore, the completion of Aims 1 and 2 involves more than just reimplementing an existing algorithm, and more than just choosing between a set of existing algorithms, and will constitute a substantial contribution to biology.
1.22 +
1.23 +We are aware of one other effort to computationally analyze spatial gene expression data.
1.24 +
1.25 +
1.26 +In the Preliminary Work, we show that
1.27 +
1.28 +The creation of a domain-specific scoring measure may be required in order to achieve good performance, and it is not impossible that the algorithms themselves will have to be extended. We plan to test out existing algorithms and scoring measures,
1.29 +
1.30 +Therefore, we anticipate
1.31 +
1.32 +Therefore, it is unclear which of the
1.33 +
1.34 todo
1.35
1.36 vs. AGEA -- i wrote something on this but i'm going to rewrite it
1.37 @@ -199,14 +215,14 @@
1.38 todo
1.39
1.40
1.41 -=== Aim 1 (and Aim 3) ===
1.42 +=== Specific to Aim 1 (and Aim 3) ===
1.43 **Forward stepwise logistic regression**
1.44 todo
1.45
1.46
1.47 **SVM on all genes at once**
1.48
1.49 -In order to see how well one can do when looking at all genes at once, we ran a support vector machine to classify cortical surface pixels based on their gene expression profiles. We achieved classification accuracy of about 81%\footnote{Using the Shogun SVM package (todo:cite), with parameters type=GMNPSVM (multiclass b-SVM), kernal = gaussian with sigma = 0.1, c = 10, epsilon = 1e-1 -- these are the first parameters we tried, so presumably performance would improve with different choices of parameters. 5-fold cross-validation.}. As noted above, however, a classifier that looks at all the genes at once isn't practically useful.
1.50 +In order to see how well one can do when looking at all genes at once, we ran a support vector machine to classify cortical surface pixels based on their gene expression profiles. We achieved classification accuracy of about 81%\footnote{5-fold cross-validation.}. As noted above, however, a classifier that looks at all the genes at once isn't practically useful.
1.51
1.52 The requirement to find combinations of only a small number of genes limits us from straightforwardly applying many of the most simple techniques from the field of supervised machine learning. In the parlance of machine learning, our task combines feature selection with supervised learning.
1.53
1.54 @@ -217,12 +233,12 @@
1.55 todo
1.56
1.57
1.58 -=== Aim 2 (and Aim 3) ===
1.59 -
1.60 -=== Raw dimensionality reduction results ===
1.61 -
1.62 -
1.63 -=== Dimensionality reduction plus K-means or spectral clustering ===
1.64 +=== Specific to Aim 2 (and Aim 3) ===
1.65 +
1.66 +**Raw dimensionality reduction results**
1.67 +
1.68 +
1.69 +**Dimensionality reduction plus K-means or spectral clustering**
1.70
1.71
1.72
1.73 @@ -244,9 +260,7 @@
1.74
1.75 == Research plan ==
1.76
1.77 -todo
1.78 -
1.79 -amongst other thigns:
1.80 +todo amongst other things:
1.81
1.82
1.83 **Develop algorithms that find genetic markers for anatomical regions**