# HG changeset patch # User bshanks@bshanks.dyndns.org # Date 1239676710 25200 # Node ID c435e5da52111fb085182c9a2a96d9a18827b42b # Parent 6d023f15572e5cd691219629e49b5ce291c86433 . Binary file grant.doc has changed --- a/grant.html Mon Apr 13 14:53:12 2009 -0700 +++ b/grant.html Mon Apr 13 19:38:30 2009 -0700 @@ -195,8 +195,8 @@ [3 ] describes an analysis of the anatomy of the hippocampus using the ABA dataset. In addition to manual analysis, two clustering methods were employed, a modified Non-negative Matrix Factorization (NNMF), and a hierarchial recursive bifurcation clustering scheme based on correlation as the similarity score. The paper yielded impressive results, proving the -usefulness of such research. We have run NNMF on the cortical dataset and while the results are promising (see Preliminary -Data), we think that it will be possible to find a better method3 (we also think that more automation of the parts that this +usefulness of such research. We have run NNMF on the cortical dataset3 and while the results are promising (see Preliminary +Data), we think that it will be possible to find a better method (we also think that more automation of the parts that this paper’s authors did manually will be possible). [2 ] describes AGEA, ”Anatomic Gene Expression Atlas”. AGEA is an analysis tool for the ABA dataset. AGEA has three components: @@ -206,19 +206,33 @@ expression profile of the seed voxel and every other voxel. * Clusters: AGEA includes a precomputed hierarchial clustering of voxels based on a recursive bifurcation algorithm with correlation as the similarity metric. -At first glance AGEA seems similar to this proposal, but in fact it is different. -Gene Finder is different from our Aim 1 in at least four ways. First, although the user chooses a seed voxel, Gene Finder, -not the user, chooses the cluster for which genes will be found, and in our experience it never chooses cortical areas, instead -preferring cortical layers. Therefore, Gene Finder cannot be used to find marker genes for cortical areas. Second, Gene Finder -finds only single genes, whereas we will also look for combinations of genes. Third, gene finder can only use overexpression -as a marker, whereas we will also look for underexpression. Fourth, Gene Finder uses a simple pointwise metric (“expression -energy ratio”, which captures overexpression), whereas we will also use geometric metrics such as gradient similarity. -The hierarchial clustering is different from our Aim 2 in at least two ways. todo -_________________________________________ +Gene Finder is different from our Aim 1 in at least four ways. First, although the user chooses a seed voxel, Gene +Finder, not the user, chooses the cluster for which genes will be found, and in our experience it never chooses cortical areas, +instead preferring cortical layers4. Therefore, Gene Finder cannot be used to find marker genes for cortical areas. Second, +Gene Finder finds only single genes, whereas we will also look for combinations of genes5. Third, gene finder can only use +overexpression as a marker, whereas in the Preliminary Data we show that underexpression can also be used. Fourth, Gene +Finder uses a simple pointwise score6, whereas we will also use geometric metrics such as gradient similarity. +The hierarchial clustering is different from our Aim 2 in at least three ways. First, the clustering finds clusters cor- +responding to layers, but no clusters corresponding to areas7 8 Our Aim 2 will not be accomplished until a clustering is +produced which yields areas. Second, AGEA uses perhaps the simplest possible similarity score (correlation), and does no +dimensionality reduction before calculating similarity. While it is possible that a more complex system will not do any better +than this, we believe further exploration of alternative methods of scoring and dimensionality reduction is warranted. Third, +AGEA did not look at clusters of genes; in Preliminary Data we have shown that clusters of genes may identify intersting +spatial subregions such as cortical areas. +_______ 3We ran “vanilla” NNMF, whereas the paper under discussion used a modified method. Their main modification consisted of adding a soft spatial contiguity constraint. However, on our dataset, NNMF naturally produced spatially contiguous clusters, so no additional constraint was needed. The paper under discussion mentions that they also tried a hierarchial variant of NNMF, but since they didn’t report its results, we assume that those result were not any more impressive than the results of the non-hierarchial variant. + 4Because of the way in which Gene Finder chooses a cluster, layers will always be preferred to areas if pairwise correlations between the gene +expression of voxels in different areas but the same layer are stronger than pairwise correlatios between the gene expression of voxels in different +layers but the same area. This appears to be the case. + 5See Preliminary Data for an example of an area which cannot be marked by any single gene in the dataset, but which can be marked by a +combination. + 6“Expression energy ratio”, which captures overexpression. + 7This is for the same reason as in footnote 4. + 8There are clusters which presumably correspond to the intersection of a layer and an area, but since one area will have many layer-area +intersection clusters, further work is needed to make sense of these. @@ -234,12 +248,12 @@ todo Using combinations of multiple genes is necessary and sufficient to delineate some cortical areas Here we give an example of a cortical area which is not marked by any single gene, but which can be identified combi- -natorially. according to logistic regression, gene wwc14 is the best fit single gene for predicting whether or not a pixel on +natorially. according to logistic regression, gene wwc19 is the best fit single gene for predicting whether or not a pixel on the cortical surface belongs to the motor area (area MO). The upper-left picture in Figure shows wwc1’s spatial expression pattern over the cortex. The lower-right boundary of MO is represented reasonably well by this gene, however the gene overshoots the upper-left boundary. This flattened 2-D representation does not show it, but the area corresponding to the overshoot is the medial surface of the cortex. MO is only found on the lateral surface (todo). -Gnee mtif25 is shown in figure the upper-right of Fig. . Mtif2 captures MO’s upper-left boundary, but not its lower-right +Gnee mtif210 is shown in figure the upper-right of Fig. . Mtif2 captures MO’s upper-left boundary, but not its lower-right boundary. Mtif2 does not express very much on the medial surface. By adding together the values at each pixel in these two figures, we get the lower-left of Figure . This combination captures area MO much better than any single gene. Correlation todo @@ -247,16 +261,16 @@ Gradient similarity todo Geometric and pointwise scoring methods provide complementary information To show that local geometry can provide useful information that cannot be detected via pointwise analyses, consider Fig. -. The top row of Fig. displays the 3 genes which most match area AUD, according to a pointwise method6. The bottom -row displays the 3 genes which most match AUD according to a method which considers local geometry7 The pointwise +. The top row of Fig. displays the 3 genes which most match area AUD, according to a pointwise method11. The bottom +row displays the 3 genes which most match AUD according to a method which considers local geometry12 The pointwise method in the top row identifies genes which express more strongly in AUD than outside of it; its weakness is that this _________________________________________ - 4“WW, C2 and coiled-coil domain containing 1”; EntrezGene ID 211652 - 5“mitochondrial translational initiation factor 2”; EntrezGene ID 76784 - 6For each gene, a logistic regression in which the response variable was whether or not a surface pixel was within area AUD, and the predictor + 9“WW, C2 and coiled-coil domain containing 1”; EntrezGene ID 211652 + 10“mitochondrial translational initiation factor 2”; EntrezGene ID 76784 + 11For each gene, a logistic regression in which the response variable was whether or not a surface pixel was within area AUD, and the predictor variable was the value of the expression of the gene underneath that pixel. The resulting scores were used to rank the genes in terms of how well they predict area AUD. - 7For each gene the gradient similarity (see section ??) between (a) a map of the expression of each gene on the cortical surface and (b) the + 12For each gene the gradient similarity (see section ??) between (a) a map of the expression of each gene on the cortical surface and (b) the shape of area AUD, was calculated, and this was used to rank the genes. @@ -275,7 +289,7 @@ Forward stepwise logistic regression todo SVM on all genes at once In order to see how well one can do when looking at all genes at once, we ran a support vector machine to classify cortical -surface pixels based on their gene expression profiles. We achieved classification accuracy of about 81%8. As noted above, +surface pixels based on their gene expression profiles. We achieved classification accuracy of about 81%13. As noted above, however, a classifier that looks at all the genes at once isn’t practically useful. The requirement to find combinations of only a small number of genes limits us from straightforwardly applying many of the most simple techniques from the field of supervised machine learning. In the parlance of machine learning, our task @@ -291,7 +305,7 @@ todo todo _________________________________________ - 85-fold cross-validation. + 135-fold cross-validation. Research plan todo amongst other things: Develop algorithms that find genetic markers for anatomical regions Binary file grant.odt has changed Binary file grant.pdf has changed --- a/grant.txt Mon Apr 13 14:53:12 2009 -0700 +++ b/grant.txt Mon Apr 13 19:38:30 2009 -0700 @@ -147,7 +147,7 @@ \cite{thompson_genomic_2008} describes an analysis of the anatomy of the hippocampus using the ABA dataset. In addition to manual analysis, two clustering methods were employed, a modified Non-negative Matrix -Factorization (NNMF), and a hierarchial recursive bifurcation clustering scheme based on correlation as the similarity score. The paper yielded impressive results, proving the usefulness of such research. We have run NNMF on the cortical dataset and while the results are promising (see Preliminary Data), we think that it will be possible to find a better method\footnote{We ran "vanilla" NNMF, whereas the paper under discussion used a modified method. Their main modification consisted of adding a soft spatial contiguity constraint. However, on our dataset, NNMF naturally produced spatially contiguous clusters, so no additional constraint was needed. The paper under discussion mentions that they also tried a hierarchial variant of NNMF, but since they didn't report its results, we assume that those result were not any more impressive than the results of the non-hierarchial variant.} (we also think that more automation of the parts that this paper's authors did manually will be possible). +Factorization (NNMF), and a hierarchial recursive bifurcation clustering scheme based on correlation as the similarity score. The paper yielded impressive results, proving the usefulness of such research. We have run NNMF on the cortical dataset\footnote{We ran "vanilla" NNMF, whereas the paper under discussion used a modified method. Their main modification consisted of adding a soft spatial contiguity constraint. However, on our dataset, NNMF naturally produced spatially contiguous clusters, so no additional constraint was needed. The paper under discussion mentions that they also tried a hierarchial variant of NNMF, but since they didn't report its results, we assume that those result were not any more impressive than the results of the non-hierarchial variant.} and while the results are promising (see Preliminary Data), we think that it will be possible to find a better method (we also think that more automation of the parts that this paper's authors did manually will be possible). \cite{ng_anatomic_2009} describes AGEA, "Anatomic Gene Expression @@ -164,11 +164,9 @@ * Clusters: AGEA includes a precomputed hierarchial clustering of voxels based on a recursive bifurcation algorithm with correlation as the similarity metric. -At first glance AGEA seems similar to this proposal, but in fact it is different. - -Gene Finder is different from our Aim 1 in at least four ways. First, although the user chooses a seed voxel, Gene Finder, not the user, chooses the cluster for which genes will be found, and in our experience it never chooses cortical areas, instead preferring cortical layers. Therefore, Gene Finder cannot be used to find marker genes for cortical areas. Second, Gene Finder finds only single genes, whereas we will also look for combinations of genes. Third, gene finder can only use overexpression as a marker, whereas we will also look for underexpression. Fourth, Gene Finder uses a simple pointwise metric ("expression energy ratio", which captures overexpression), whereas we will also use geometric metrics such as gradient similarity. - -The hierarchial clustering is different from our Aim 2 in at least two ways. todo +Gene Finder is different from our Aim 1 in at least four ways. First, although the user chooses a seed voxel, Gene Finder, not the user, chooses the cluster for which genes will be found, and in our experience it never chooses cortical areas, instead preferring cortical layers\footnote{\label{layersNotAreas}Because of the way in which Gene Finder chooses a cluster, layers will always be preferred to areas if pairwise correlations between the gene expression of voxels in different areas but the same layer are stronger than pairwise correlatios between the gene expression of voxels in different layers but the same area. This appears to be the case.}. Therefore, Gene Finder cannot be used to find marker genes for cortical areas. Second, Gene Finder finds only single genes, whereas we will also look for combinations of genes\footnote{See Preliminary Data for an example of an area which cannot be marked by any single gene in the dataset, but which can be marked by a combination.}. Third, gene finder can only use overexpression as a marker, whereas in the Preliminary Data we show that underexpression can also be used. Fourth, Gene Finder uses a simple pointwise score\footnote{"Expression energy ratio", which captures overexpression.}, whereas we will also use geometric metrics such as gradient similarity. + +The hierarchial clustering is different from our Aim 2 in at least three ways. First, the clustering finds clusters corresponding to layers, but no clusters corresponding to areas\footnote{This is for the same reason as in footnote \ref{layersNotAreas}.} \footnote{There are clusters which presumably correspond to the intersection of a layer and an area, but since one area will have many layer-area intersection clusters, further work is needed to make sense of these.} Our Aim 2 will not be accomplished until a clustering is produced which yields areas. Second, AGEA uses perhaps the simplest possible similarity score (correlation), and does no dimensionality reduction before calculating similarity. While it is possible that a more complex system will not do any better than this, we believe further exploration of alternative methods of scoring and dimensionality reduction is warranted. Third, AGEA did not look at clusters of genes; in Preliminary Data we have shown that clusters of genes may identify intersting spatial subregions such as cortical areas.