# HG changeset patch # User bshanks@bshanks.dyndns.org # Date 1239617246 25200 # Node ID c2609c6e77363e48990770afac851894460bdc6b # Parent 717d4025b8613deaaac62982571c654d9de982a2 . --- a/grant.html Sun Apr 12 15:35:00 2009 -0700 +++ b/grant.html Mon Apr 13 03:07:26 2009 -0700 @@ -297,9 +297,28 @@ and 2 involves more than just reimplementing an existing algorithm, and more than just choosing between a set of existing algorithms, and will constitute a substantial contribution to biology. - We are aware of one other effort to computationally analyze spatial gene - expression data. + We are aware of two existing efforts to relate spatial gene expression data to + anatomy through computational methods. + [?] describes an analysis of the anatomy of the hippocampus using the ABA + dataset. In addition to manual analysis, two clustering methods were employed, + a modified Non-negative Matrix Factorization (NNMF), and a hierarchial bifur- + cation clustering scheme based on correlation as the similarity score. The paper + yielded impressive results, proving the usefulness of such research. We have run + NNMF on the cortical dataset and while the results are promising (see Prelim- + inary Data), we think that it will be possible to find a better method2 (we also + think that more automation of the parts that this paper’s authors did manually + will be possible). + and [?] describes AGEA. todo In the Preliminary Work, we show that +__________________________ + 2We ran “vanilla” NNMF, whereas the paper under discussion used a modified method. +Their main modification consisted of adding a soft spatial contiguity constraint. However, +on our dataset, NNMF naturally produced spatially contiguous clusters, so no additional +constraint was needed. The paper under discussion mentions that they also tried a hierarchial +variant of NNMF, but since they didn’t report its results, we assume that the result were not +any more impressive than the non-hierarchial variant. + 7 + The creation of a domain-specific scoring measure may be required in order to achieve good performance, and it is not impossible that the algorithms them- selves will have to be extended. We plan to test out existing algorithms and @@ -311,22 +330,20 @@ Preliminary work Format conversion between SEV, MATLAB, NIFTI todo - 7 - Flatmap of cortex todo Using combinations of multiple genes is necessary and sufficient to delineate some cortical areas Here we give an example of a cortical area which is not marked by any single gene, but which can be identified combinatorially. according to logistic - regression, gene wwc12 is the best fit single gene for predicting whether or not a + regression, gene wwc13 is the best fit single gene for predicting whether or not a pixel on the cortical surface belongs to the motor area (area MO). The upper-left picture in Figure shows wwc1’s spatial expression pattern over the cortex. The lower-right boundary of MO is represented reasonably well by this gene, however the gene overshoots the upper-left boundary. This flattened 2-D representation does not show it, but the area corresponding to the overshoot is the medial surface of the cortex. MO is only found on the lateral surface (todo). - Gnee mtif23 is shown in figure the upper-right of Fig. . Mtif2 captures MO’s + Gnee mtif24 is shown in figure the upper-right of Fig. . Mtif2 captures MO’s upper-left boundary, but not its lower-right boundary. Mtif2 does not express very much on the medial surface. By adding together the values at each pixel in these two figures, we get the lower-left of Figure . This combination captures @@ -338,28 +355,9 @@ information To show that local geometry can provide useful information that cannot be detected via pointwise analyses, consider Fig. . The top row of Fig. displays the - 3 genes which most match area AUD, according to a pointwise method4. The - bottom row displays the 3 genes which most match AUD according to a method - which considers local geometry5 The pointwise method in the top row identifies - genes which express more strongly in AUD than outside of it; its weakness is that - this includes many areas which don’t have a salient border matching the areal - border. The geometric method identifies genes whose salient expression border - seems to partially line up with the border of AUD; its weakness is that this - includes genes which don’t express over the entire area. Genes which have high - rankings using both pointwise and border criteria, such as Aph1a in the example, - may be particularly good markers. None of these genes are, individually, a - perfect marker for AUD; we deliberately chose a “difficult” area in order to - better contrast pointwise with geometric methods. __________________________ - 2“WW, C2 and coiled-coil domain containing 1”; EntrezGene ID 211652 - 3“mitochondrial translational initiation factor 2”; EntrezGene ID 76784 - 4For each gene, a logistic regression in which the response variable was whether or not a -surface pixel was within area AUD, and the predictor variable was the value of the expression -of the gene underneath that pixel. The resulting scores were used to rank the genes in terms -of how well they predict area AUD. - 5For each gene the gradient similarity (see section ??) between (a) a map of the expression -of each gene on the cortical surface and (b) the shape of area AUD, was calculated, and this -was used to rank the genes. + 3“WW, C2 and coiled-coil domain containing 1”; EntrezGene ID 211652 + 4“mitochondrial translational initiation factor 2”; EntrezGene ID 76784 8 @@ -372,6 +370,8 @@ the boundary of region MO. Pixels are colored approximately according to the density of expressing cells underneath each pixel, with red meaning a lot of expression and blue meaning little. + 9 + Figure 2: The top row shows the three genes which (individually) best predict @@ -379,16 +379,36 @@ genes which (individually) best match area AUD, according to gradient similar- ity. From left to right and top to bottom, the genes are Ssr1, Efcbp1, Aph1a, Ptk7, Aph1a again, and Lepr - 9 - + 3 genes which most match area AUD, according to a pointwise method5. The + bottom row displays the 3 genes which most match AUD according to a method + which considers local geometry6 The pointwise method in the top row identifies + genes which express more strongly in AUD than outside of it; its weakness is that + this includes many areas which don’t have a salient border matching the areal + border. The geometric method identifies genes whose salient expression border + seems to partially line up with the border of AUD; its weakness is that this + includes genes which don’t express over the entire area. Genes which have high + rankings using both pointwise and border criteria, such as Aph1a in the example, + may be particularly good markers. None of these genes are, individually, a + perfect marker for AUD; we deliberately chose a “difficult” area in order to + better contrast pointwise with geometric methods. Areas which can be identified by single genes todo Specific to Aim 1 (and Aim 3) Forward stepwise logistic regression todo SVM on all genes at once +__________________________ + 5For each gene, a logistic regression in which the response variable was whether or not a +surface pixel was within area AUD, and the predictor variable was the value of the expression +of the gene underneath that pixel. The resulting scores were used to rank the genes in terms +of how well they predict area AUD. + 6For each gene the gradient similarity (see section ??) between (a) a map of the expression +of each gene on the cortical surface and (b) the shape of area AUD, was calculated, and this +was used to rank the genes. + 10 + In order to see how well one can do when looking at all genes at once, we ran a support vector machine to classify cortical surface pixels based on their - gene expression profiles. We achieved classification accuracy of about 81%6. + gene expression profiles. We achieved classification accuracy of about 81%7. As noted above, however, a classifier that looks at all the genes at once isn’t practically useful. The requirement to find combinations of only a small number of genes limits @@ -399,6 +419,8 @@ todo Specific to Aim 2 (and Aim 3) Raw dimensionality reduction results + todo + (might want to incld nnMF since mentioned above) Dimensionality reduction plus K-means or spectral clustering Many areas are captured by clusters of genes todo @@ -417,13 +439,13 @@ ing: for areas that cannot be identified by any single gene, identify them with a handful of genes. We will consider both (a) algorithms that incre- mentally/greedily combine single gene markers into sets, such as forward -__________________________ - 65-fold cross-validation. - 10 - stepwise regression and decision trees, and also (b) supervised learning techniques which use soft constraints to minimize the number of features, such as sparse support vector machines. +__________________________ + 75-fold cross-validation. + 11 + 4. Extend the procedure to handle difficult areas by combining or redrawing the boundaries: An area may be difficult to identify because the bound- aries are misdrawn, or because it does not “really” exist as a single area, @@ -456,27 +478,34 @@ clustering to create anatomical maps 6. Run this algorithm on the cortex: present a hierarchial, genoarchitectonic map of the cortex - 11 - - _______________________________________________________________________________________________________ stuff i dunno where to put yet (there is more scattered through grant- +______________________________________________ + stuff i dunno where to put yet (there is more scattered through grant- oldtext): Principle 4: Work in 2-D whenever possible - In anatomy, the manifold of interest is usually either defined by a combina- -tion of two relevant anatomical axes (todo), or by the surface of the structure -(as is the case with the cortex). In the former case, the manifold of interest is -a plane, but in the latter case it is curved. If the manifold is curved, there are -various methods for mapping the manifold into a plane. - The method that we will develop will begin by mapping the data into a -2-D plane. Although the manifold that characterized cortical areas is known -to be the cortical surface, it remains to be seen which method of mapping the -manifold into a plane is optimal for this application. We will compare mappings -which attempt to preserve size (such as the one used by Caret??) with mappings -which preserve angle (conformal maps). - Although there is much 2-D organization in anatomy, there are also struc- -tures whose shape is fundamentally 3-dimensional. If possible, we would like -the method we develop to include a statistical test that warns the user if the -assumption of 2-D structure seems to be wrong. - todo: replace aim # bullet pts with #s 12 - + In anatomy, the manifold of interest is usually either defined by a combina- + tion of two relevant anatomical axes (todo), or by the surface of the structure + (as is the case with the cortex). In the former case, the manifold of interest is + a plane, but in the latter case it is curved. If the manifold is curved, there are + various methods for mapping the manifold into a plane. + The method that we will develop will begin by mapping the data into a + 2-D plane. Although the manifold that characterized cortical areas is known + to be the cortical surface, it remains to be seen which method of mapping the + manifold into a plane is optimal for this application. We will compare mappings + which attempt to preserve size (such as the one used by Caret??) with mappings + which preserve angle (conformal maps). + Although there is much 2-D organization in anatomy, there are also struc- + tures whose shape is fundamentally 3-dimensional. If possible, we would like + the method we develop to include a statistical test that warns the user if the + assumption of 2-D structure seems to be wrong. + if we need citations for aim 3 significance, http://www.sciencedirect. + com/science?_ob=ArticleURL&_udi=B6WSS-4V70FHY-9&_user=4429&_coverDate= + 12%2F26%2F2008&_rdoc=1&_fmt=full&_orig=na&_cdi=7054&_docanchor=&_acct= + C000059602&_version=1&_urlVersion=0&_userid=4429&md5=551eccc743a2bfe6e992eee0c3194203# + app2 has examples of genetic targeting to specific anatomical regions + — + note: + 13 + + Binary file grant.odt has changed Binary file grant.pdf has changed --- a/grant.txt Sun Apr 12 15:35:00 2009 -0700 +++ b/grant.txt Mon Apr 13 03:07:26 2009 -0700 @@ -142,7 +142,12 @@ As noted above, there has been much work on both supervised learning and clustering, and there are many available algorithms for each. Many of these algorithms are flexible enough to accomodate new scoring measures; and the performance of most of the algorithms is greatly affected by preprocessing and by the choice of which representation to use for feature values. We think it likely that for this application, the development of domain-specific scoring measures (such as gradient similarity, which is discussed in Preliminary Work) will be necessary in order to achieve the best results. In essence, the machine learning community has provided algorithms, but the scientist must provide a framework for representing the problem domain, and the way that this framework is set up has a large impact on performance. Creating a good framework can require creatively reconceptualizing the problem domain, and is not merely a mechanical "fine-tuning" of numerical parameters. Therefore, the completion of Aims 1 and 2 involves more than just reimplementing an existing algorithm, and more than just choosing between a set of existing algorithms, and will constitute a substantial contribution to biology. -We are aware of one other effort to computationally analyze spatial gene expression data. +We are aware of two existing efforts to relate spatial gene expression data to anatomy through computational methods. + +\cite{thompson_genomic_2008} describes an analysis of the anatomy of the hippocampus using the ABA dataset. In addition to manual analysis, two clustering methods were employed, a modified Non-negative Matrix Factorization (NNMF), and a hierarchial bifurcation clustering scheme based on correlation as the similarity score. The paper yielded impressive results, proving the usefulness of such research. We have run NNMF on the cortical dataset and while the results are promising (see Preliminary Data), we think that it will be possible to find a better method\footnote{We ran "vanilla" NNMF, whereas the paper under discussion used a modified method. Their main modification consisted of adding a soft spatial contiguity constraint. However, on our dataset, NNMF naturally produced spatially contiguous clusters, so no additional constraint was needed. The paper under discussion mentions that they also tried a hierarchial variant of NNMF, but since they didn't report its results, we assume that the result were not any more impressive than the non-hierarchial variant.} (we also think that more automation of the parts that this paper's authors did manually will be possible). + + + and \cite{ng_anatomic_2009} describes AGEA. todo In the Preliminary Work, we show that @@ -237,6 +242,9 @@ **Raw dimensionality reduction results** +todo + +(might want to incld nnMF since mentioned above) **Dimensionality reduction plus K-means or spectral clustering** @@ -309,4 +317,8 @@ -todo: replace aim # bullet pts with #s +if we need citations for aim 3 significance, http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6WSS-4V70FHY-9&_user=4429&_coverDate=12%2F26%2F2008&_rdoc=1&_fmt=full&_orig=na&_cdi=7054&_docanchor=&_acct=C000059602&_version=1&_urlVersion=0&_userid=4429&md5=551eccc743a2bfe6e992eee0c3194203#app2 has examples of genetic targeting to specific anatomical regions + +--- + +note: