# HG changeset patch # User bshanks@bshanks.dyndns.org # Date 1239575652 25200 # Node ID 5d6dfc57654a2440ab9babca070d729621fc218b # Parent ff9b47f2c7d3b7cb345ad4b19cdb03f49e9cfa61 . --- a/grant.html Sun Apr 12 04:01:58 2009 -0700 +++ b/grant.html Sun Apr 12 15:34:12 2009 -0700 @@ -145,7 +145,10 @@ outcome of clustering may be a hierarchial tree of clusters, rather than a single set of clusters which partition the voxels. This is called hierarchial clustering. Similarity scores - todo + A crucial choice when designing a clustering method is how to measure + similarity, across either pairs of instances, or clusters, or both. There is much + overlap between scoring methods for feature selection (discussed above under + Aim 1) and scoring methods for similarity. Spatially contiguous clusters; image segmentation We have shown that aim 2 is a type of clustering task. In fact, it is a special type of clustering task because we have an additional constraint on @@ -173,11 +176,11 @@ algorithms perform better on small numbers of features. There are techniques which “summarize” a larger number of features using a smaller number of fea- tures; these techniques go by the name of feature extraction or dimensionality + 4 + reduction. The small set of features that such a technique yields is called the reduced feature set. After the reduced feature set is created, the instances may be replaced by reduced instances, which have as their features the reduced fea- - 4 - ture set rather than the original feature set of all gene expression levels. Note that the features in the reduced feature set do not necessarily correspond to genes; each feature in the reduced set may be any function of the set of gene @@ -213,11 +216,7 @@ this fashion. Aim 3 Background - The cortex is divided into areas and layers. To a first approximation, the - parcellation of the cortex into areas can be drawn as a 2-D map on the surface of - the cortex. In the third dimension, the boundaries between the areas continue - downwards into the cortical depth, perpendicular to the surface. The layer -__________________________ +_______________ 1This would seem to contradict our finding in aim 1 that some cortical areas are combina- torially coded by multiple genes. However, it is possible that the currently accepted cortical maps divide the cortex into subregions which are unnatural from the point of view of gene @@ -225,6 +224,10 @@ be identified by single genes. 5 + The cortex is divided into areas and layers. To a first approximation, the + parcellation of the cortex into areas can be drawn as a 2-D map on the surface of + the cortex. In the third dimension, the boundaries between the areas continue + downwards into the cortical depth, perpendicular to the surface. The layer boundaries run parallel to the surface. One can picture an area of the cortex as a slice of many-layered cake. Although it is known that different cortical areas have distinct roles in both @@ -266,14 +269,50 @@ While we do not here propose to analyze human gene expression data, it is conceivable that the methods we propose to develop could be used to suggest modifications to the human cortical map as well. + 6 + Related work - todo + There does not appear to be much work on the automated analysis of spatial + gene expression data. + There is a substantial body of work on the analysis of gene expression data, + however, most of this concerns gene expression data which is not fundamentally + spatial, for example, microarray datasets. In some cases, a few locations have + been sampled, but such a dataset is still of a fundamentally different character + than a dataset containing a large grid of sampling points distributed over space. + In relating gene expression to anatomy, it is the spatial aspects of the problem + which are the most important. + As noted above, there has been much work on both supervised learning and + clustering, and there are many available algorithms for each. Many of these + algorithms are flexible enough to accomodate new scoring measures; and the + performance of most of the algorithms is greatly affected by preprocessing and + by the choice of which representation to use for feature values. We think it likely + that for this application, the development of domain-specific scoring measures + (such as gradient similarity, which is discussed in Preliminary Work) will be + necessary in order to achieve the best results. In essence, the machine learning + community has provided algorithms, but the scientist must provide a framework + for representing the problem domain, and the way that this framework is set + up has a large impact on performance. Creating a good framework can require + creatively reconceptualizing the problem domain, and is not merely a mechanical + “fine-tuning” of numerical parameters. Therefore, the completion of Aims 1 + and 2 involves more than just reimplementing an existing algorithm, and more + than just choosing between a set of existing algorithms, and will constitute a + substantial contribution to biology. + We are aware of one other effort to computationally analyze spatial gene + expression data. + In the Preliminary Work, we show that + The creation of a domain-specific scoring measure may be required in order + to achieve good performance, and it is not impossible that the algorithms them- + selves will have to be extended. We plan to test out existing algorithms and + scoring measures, + Therefore, we anticipate + Therefore, it is unclear which of the + todo vs. AGEA – i wrote something on this but i’m going to rewrite it - 6 - Preliminary work Format conversion between SEV, MATLAB, NIFTI todo + 7 + Flatmap of cortex todo Using combinations of multiple genes is necessary and sufficient to @@ -305,6 +344,12 @@ genes which express more strongly in AUD than outside of it; its weakness is that this includes many areas which don’t have a salient border matching the areal border. The geometric method identifies genes whose salient expression border + seems to partially line up with the border of AUD; its weakness is that this + includes genes which don’t express over the entire area. Genes which have high + rankings using both pointwise and border criteria, such as Aph1a in the example, + may be particularly good markers. None of these genes are, individually, a + perfect marker for AUD; we deliberately chose a “difficult” area in order to + better contrast pointwise with geometric methods. __________________________ 2“WW, C2 and coiled-coil domain containing 1”; EntrezGene ID 211652 3“mitochondrial translational initiation factor 2”; EntrezGene ID 76784 @@ -315,7 +360,7 @@ 5For each gene the gradient similarity (see section ??) between (a) a map of the expression of each gene on the cortical surface and (b) the shape of area AUD, was calculated, and this was used to rank the genes. - 7 + 8 @@ -327,8 +372,6 @@ the boundary of region MO. Pixels are colored approximately according to the density of expressing cells underneath each pixel, with red meaning a lot of expression and blue meaning little. - 8 - Figure 2: The top row shows the three genes which (individually) best predict @@ -336,15 +379,11 @@ genes which (individually) best match area AUD, according to gradient similar- ity. From left to right and top to bottom, the genes are Ssr1, Efcbp1, Aph1a, Ptk7, Aph1a again, and Lepr - seems to partially line up with the border of AUD; its weakness is that this - includes genes which don’t express over the entire area. Genes which have high - rankings using both pointwise and border criteria, such as Aph1a in the example, - may be particularly good markers. None of these genes are, individually, a - perfect marker for AUD; we deliberately chose a “difficult” area in order to - better contrast pointwise with geometric methods. + 9 + Areas which can be identified by single genes todo - Aim 1 (and Aim 3) + Specific to Aim 1 (and Aim 3) Forward stepwise logistic regression todo SVM on all genes at once In order to see how well one can do when looking at all genes at once, we @@ -354,27 +393,18 @@ practically useful. The requirement to find combinations of only a small number of genes limits us from straightforwardly applying many of the most simple techniques from -__________________________ - 6Using the Shogun SVM package (todo:cite), with parameters type=GMNPSVM (multi- -class b-SVM), kernal = gaussian with sigma = 0.1, c = 10, epsilon = 1e-1 – these are the -first parameters we tried, so presumably performance would improve with different choices of -parameters. 5-fold cross-validation. - 9 - the field of supervised machine learning. In the parlance of machine learning, our task combines feature selection with supervised learning. Decision trees todo - Aim 2 (and Aim 3) - Raw dimensionality reduction results - Dimensionality reduction plus K-means or spectral clus- - tering - Many areas are captured by clusters of genes + Specific to Aim 2 (and Aim 3) + Raw dimensionality reduction results + Dimensionality reduction plus K-means or spectral clustering + Many areas are captured by clusters of genes todo todo Research plan - todo - amongst other thigns: + todo amongst other things: Develop algorithms that find genetic markers for anatomical re- gions 1. Develop scoring measures for evaluating how good individual genes are at @@ -387,6 +417,10 @@ ing: for areas that cannot be identified by any single gene, identify them with a handful of genes. We will consider both (a) algorithms that incre- mentally/greedily combine single gene markers into sets, such as forward +__________________________ + 65-fold cross-validation. + 10 + stepwise regression and decision trees, and also (b) supervised learning techniques which use soft constraints to minimize the number of features, such as sparse support vector machines. @@ -397,8 +431,6 @@ which (a) detect when a difficult area could be fit if its boundary were redrawn slightly, and (b) detect when a difficult area could be combined with adjacent areas to create a larger area which can be fit. - 10 - Apply these algorithms to the cortex 1. Create open source format conversion tools: we will create tools to bulk download the ABA dataset and to convert between SEV, NIFTI and MAT- @@ -424,8 +456,9 @@ clustering to create anatomical maps 6. Run this algorithm on the cortex: present a hierarchial, genoarchitectonic map of the cortex -______________________________________________ - stuff i dunno where to put yet (there is more scattered through grant- + 11 + + _______________________________________________________________________________________________________ stuff i dunno where to put yet (there is more scattered through grant- oldtext): Principle 4: Work in 2-D whenever possible In anatomy, the manifold of interest is usually either defined by a combina- @@ -436,16 +469,14 @@ The method that we will develop will begin by mapping the data into a 2-D plane. Although the manifold that characterized cortical areas is known to be the cortical surface, it remains to be seen which method of mapping the - 11 - - manifold into a plane is optimal for this application. We will compare mappings - which attempt to preserve size (such as the one used by Caret??) with mappings - which preserve angle (conformal maps). - Although there is much 2-D organization in anatomy, there are also struc- - tures whose shape is fundamentally 3-dimensional. If possible, we would like - the method we develop to include a statistical test that warns the user if the - assumption of 2-D structure seems to be wrong. - todo: replace aim # bullet pts with #s +manifold into a plane is optimal for this application. We will compare mappings +which attempt to preserve size (such as the one used by Caret??) with mappings +which preserve angle (conformal maps). + Although there is much 2-D organization in anatomy, there are also struc- +tures whose shape is fundamentally 3-dimensional. If possible, we would like +the method we develop to include a statistical test that warns the user if the +assumption of 2-D structure seems to be wrong. + todo: replace aim # bullet pts with #s 12 Binary file grant.odt has changed Binary file grant.pdf has changed --- a/grant.txt Sun Apr 12 04:01:58 2009 -0700 +++ b/grant.txt Sun Apr 12 15:34:12 2009 -0700 @@ -79,8 +79,7 @@ **Similarity scores** - -todo +A crucial choice when designing a clustering method is how to measure similarity, across either pairs of instances, or clusters, or both. There is much overlap between scoring methods for feature selection (discussed above under Aim 1) and scoring methods for similarity. **Spatially contiguous clusters; image segmentation** @@ -137,6 +136,23 @@ === Related work === +There does not appear to be much work on the automated analysis of spatial gene expression data. + +There is a substantial body of work on the analysis of gene expression data, however, most of this concerns gene expression data which is not fundamentally spatial, for example, microarray datasets. In some cases, a few locations have been sampled, but such a dataset is still of a fundamentally different character than a dataset containing a large grid of sampling points distributed over space. In relating gene expression to anatomy, it is the spatial aspects of the problem which are the most important. + +As noted above, there has been much work on both supervised learning and clustering, and there are many available algorithms for each. Many of these algorithms are flexible enough to accomodate new scoring measures; and the performance of most of the algorithms is greatly affected by preprocessing and by the choice of which representation to use for feature values. We think it likely that for this application, the development of domain-specific scoring measures (such as gradient similarity, which is discussed in Preliminary Work) will be necessary in order to achieve the best results. In essence, the machine learning community has provided algorithms, but the scientist must provide a framework for representing the problem domain, and the way that this framework is set up has a large impact on performance. Creating a good framework can require creatively reconceptualizing the problem domain, and is not merely a mechanical "fine-tuning" of numerical parameters. Therefore, the completion of Aims 1 and 2 involves more than just reimplementing an existing algorithm, and more than just choosing between a set of existing algorithms, and will constitute a substantial contribution to biology. + +We are aware of one other effort to computationally analyze spatial gene expression data. + + +In the Preliminary Work, we show that + +The creation of a domain-specific scoring measure may be required in order to achieve good performance, and it is not impossible that the algorithms themselves will have to be extended. We plan to test out existing algorithms and scoring measures, + +Therefore, we anticipate + +Therefore, it is unclear which of the + todo vs. AGEA -- i wrote something on this but i'm going to rewrite it @@ -199,14 +215,14 @@ todo -=== Aim 1 (and Aim 3) === +=== Specific to Aim 1 (and Aim 3) === **Forward stepwise logistic regression** todo **SVM on all genes at once** -In order to see how well one can do when looking at all genes at once, we ran a support vector machine to classify cortical surface pixels based on their gene expression profiles. We achieved classification accuracy of about 81%\footnote{Using the Shogun SVM package (todo:cite), with parameters type=GMNPSVM (multiclass b-SVM), kernal = gaussian with sigma = 0.1, c = 10, epsilon = 1e-1 -- these are the first parameters we tried, so presumably performance would improve with different choices of parameters. 5-fold cross-validation.}. As noted above, however, a classifier that looks at all the genes at once isn't practically useful. +In order to see how well one can do when looking at all genes at once, we ran a support vector machine to classify cortical surface pixels based on their gene expression profiles. We achieved classification accuracy of about 81%\footnote{5-fold cross-validation.}. As noted above, however, a classifier that looks at all the genes at once isn't practically useful. The requirement to find combinations of only a small number of genes limits us from straightforwardly applying many of the most simple techniques from the field of supervised machine learning. In the parlance of machine learning, our task combines feature selection with supervised learning. @@ -217,12 +233,12 @@ todo -=== Aim 2 (and Aim 3) === - -=== Raw dimensionality reduction results === - - -=== Dimensionality reduction plus K-means or spectral clustering === +=== Specific to Aim 2 (and Aim 3) === + +**Raw dimensionality reduction results** + + +**Dimensionality reduction plus K-means or spectral clustering** @@ -244,9 +260,7 @@ == Research plan == -todo - -amongst other thigns: +todo amongst other things: **Develop algorithms that find genetic markers for anatomical regions**