# HG changeset patch # User bshanks@bshanks.dyndns.org # Date 1239618121 25200 # Node ID 9d0cc9c66ecd7c601d96b8943a7943b5576a4661 # Parent 8ff9b7b5c242e199c20553dcf430b309f77bdc44 . --- a/grant.html Mon Apr 13 03:21:04 2009 -0700 +++ b/grant.html Mon Apr 13 03:22:01 2009 -0700 @@ -22,6 +22,8 @@ All algorithms that we develop will be implemented in an open-source soft- ware toolkit. The toolkit, as well as the machine-readable datasets developed in aim (3), will be published and freely available for others to use. + 1 + Background and significance Aim 1 Machine learning terminology: supervised learning @@ -35,8 +37,6 @@ this a classification task, because each voxel is being assigned to a class (namely, its subregion). Therefore, an understanding of the relationship between the combination of - 1 - their expression levels and the locations of the subregions may be expressed as a function. The input to this function is a voxel, along with the gene expression levels within that voxel; the output is the subregional identity of the target @@ -68,6 +68,8 @@ procedures are called “stepwise” or “greedy”. Although the classifier itself may only look at the gene expression data within each voxel before classifying that voxel, the learning algorithm which constructs + 2 + the classifier may look over the entire dataset. We can categorize score-based feature selection methods depending on how the score of calculated. Often the score calculation consists of assigning a sub-score to each voxel, and then @@ -83,8 +85,6 @@ Above, we defined an “instance” as the combination of a voxel with the “associated gene expression data”. In our case this refers to the expression level of genes within the voxel, but should we include the expression levels of all - 2 - genes, or only a few of them? It is too much to hope that every anatomical region of interest will be iden- tified by a single gene. For example, in the cortex, there are some areas which @@ -116,6 +116,8 @@ evidence of the complementary nature of pointwise and local scoring methods. Principle 4: Work in 2-D whenever possible There are many anatomical structures which are commonly characterized in + 3 + terms of a two-dimensional manifold. When it is known that the structure that one is looking for is two-dimensional, the results may be improved by allowing the analysis algorithm to take advantage of this prior knowledge. In addition, @@ -128,8 +130,6 @@ of machine learning. One thing that you can do with such a dataset is to group instances together. A set of similar instances is called a cluster, and the activity of finding grouping the data into clusters is called clustering or cluster analysis. - 3 - The task of deciding how to carve up a structure into anatomical subregions can be put into these terms. The instances are once again voxels (or pixels) along with their associated gene expression profiles. We make the assumption @@ -162,6 +162,8 @@ image into clusters, usually contiguous clusters. Aim 2 is similar to an image segmentation task. There are two main differences; in our task, there are thou- sands of color channels (one for each gene), rather than just three. There are + 4 + imaging tasks which use more than three colors, however, for example multispec- tral imaging and hyperspectral imaging, which are often used to process satellite imagery. A more crucial difference is that there are various cues which are ap- @@ -176,8 +178,6 @@ algorithms perform better on small numbers of features. There are techniques which “summarize” a larger number of features using a smaller number of fea- tures; these techniques go by the name of feature extraction or dimensionality - 4 - reduction. The small set of features that such a technique yields is called the reduced feature set. After the reduced feature set is created, the instances may be replaced by reduced instances, which have as their features the reduced fea- @@ -208,6 +208,8 @@ This is because many genes have an expression pattern which seems to pick out a single, spatially continguous subregion. Therefore, it seems likely that an anatomically interesting subregion will have multiple genes which each individ- + 5 + ually pick it out1. This suggests the following procedure: cluster together genes which pick out similar subregions, and then to use the more popular common subregions as the final clusters. In the Preliminary Data we show that a num- @@ -216,14 +218,6 @@ this fashion. Aim 3 Background -_______________ - 1This would seem to contradict our finding in aim 1 that some cortical areas are combina- -torially coded by multiple genes. However, it is possible that the currently accepted cortical -maps divide the cortex into subregions which are unnatural from the point of view of gene -expression; perhaps there is some other way to map the cortex for which each subregion can -be identified by single genes. - 5 - The cortex is divided into areas and layers. To a first approximation, the parcellation of the cortex into areas can be drawn as a 2-D map on the surface of the cortex. In the third dimension, the boundaries between the areas continue @@ -254,6 +248,14 @@ finding markers for each individual cortical areas, we will find a small panel of genes that can find many of the areal boundaries at once. This panel of marker genes will allow the development of an ISH protocol that will allow +__________________________ + 1This would seem to contradict our finding in aim 1 that some cortical areas are combina- +torially coded by multiple genes. However, it is possible that the currently accepted cortical +maps divide the cortex into subregions which are unnatural from the point of view of gene +expression; perhaps there is some other way to map the cortex for which each subregion can +be identified by single genes. + 6 + experimenters to more easily identify which anatomical areas are present in small samples of cortex. The method developed in aim (3) will provide a genoarchitectonic viewpoint @@ -269,8 +271,6 @@ While we do not here propose to analyze human gene expression data, it is conceivable that the methods we propose to develop could be used to suggest modifications to the human cortical map as well. - 6 - Related work There does not appear to be much work on the automated analysis of spatial gene expression data. @@ -297,23 +297,26 @@ yielded impressive results, proving the usefulness of such research. We have run NNMF on the cortical dataset and while the results are promising (see Prelim- inary Data), we think that it will be possible to find a better method2 (we also +__________________________ + 2We ran “vanilla” NNMF, whereas the paper under discussion used a modified method. +Their main modification consisted of adding a soft spatial contiguity constraint. However, +on our dataset, NNMF naturally produced spatially contiguous clusters, so no additional + 7 + think that more automation of the parts that this paper’s authors did manually will be possible). and [?] describes AGEA. todo +__________________________ +constraint was needed. The paper under discussion mentions that they also tried a hierarchial +variant of NNMF, but since they didn’t report its results, we assume that those result were +not any more impressive than the results of the non-hierarchial variant. + 8 + Preliminary work Format conversion between SEV, MATLAB, NIFTI todo Flatmap of cortex todo -_______________________ - 2We ran “vanilla” NNMF, whereas the paper under discussion used a modified method. -Their main modification consisted of adding a soft spatial contiguity constraint. However, -on our dataset, NNMF naturally produced spatially contiguous clusters, so no additional -constraint was needed. The paper under discussion mentions that they also tried a hierarchial -variant of NNMF, but since they didn’t report its results, we assume that those result were -not any more impressive than the results of the non-hierarchial variant. - 7 - Using combinations of multiple genes is necessary and sufficient to delineate some cortical areas Here we give an example of a cortical area which is not marked by any @@ -343,15 +346,7 @@ genes which express more strongly in AUD than outside of it; its weakness is that this includes many areas which don’t have a salient border matching the areal border. The geometric method identifies genes whose salient expression border - seems to partially line up with the border of AUD; its weakness is that this - includes genes which don’t express over the entire area. Genes which have high - rankings using both pointwise and border criteria, such as Aph1a in the example, - may be particularly good markers. None of these genes are, individually, a - perfect marker for AUD; we deliberately chose a “difficult” area in order to - better contrast pointwise with geometric methods. - Areas which can be identified by single genes - todo -____________________ +__________________________ 3“WW, C2 and coiled-coil domain containing 1”; EntrezGene ID 211652 4“mitochondrial translational initiation factor 2”; EntrezGene ID 76784 5For each gene, a logistic regression in which the response variable was whether or not a @@ -361,7 +356,7 @@ 6For each gene the gradient similarity (see section ??) between (a) a map of the expression of each gene on the cortical surface and (b) the shape of area AUD, was calculated, and this was used to rank the genes. - 8 + 9 @@ -373,6 +368,8 @@ the boundary of region MO. Pixels are colored approximately according to the density of expressing cells underneath each pixel, with red meaning a lot of expression and blue meaning little. + 10 + Figure 2: The top row shows the three genes which (individually) best predict @@ -380,8 +377,14 @@ genes which (individually) best match area AUD, according to gradient similar- ity. From left to right and top to bottom, the genes are Ssr1, Efcbp1, Aph1a, Ptk7, Aph1a again, and Lepr - 9 - + seems to partially line up with the border of AUD; its weakness is that this + includes genes which don’t express over the entire area. Genes which have high + rankings using both pointwise and border criteria, such as Aph1a in the example, + may be particularly good markers. None of these genes are, individually, a + perfect marker for AUD; we deliberately chose a “difficult” area in order to + better contrast pointwise with geometric methods. + Areas which can be identified by single genes + todo Specific to Aim 1 (and Aim 3) Forward stepwise logistic regression todo SVM on all genes at once @@ -396,6 +399,10 @@ our task combines feature selection with supervised learning. Decision trees todo +____________________ + 75-fold cross-validation. + 11 + Specific to Aim 2 (and Aim 3) Raw dimensionality reduction results todo @@ -404,6 +411,8 @@ Many areas are captured by clusters of genes todo todo + 12 + Research plan todo amongst other things: Develop algorithms that find genetic markers for anatomical re- @@ -419,10 +428,6 @@ with a handful of genes. We will consider both (a) algorithms that incre- mentally/greedily combine single gene markers into sets, such as forward stepwise regression and decision trees, and also (b) supervised learning -__________________________ - 75-fold cross-validation. - 10 - techniques which use soft constraints to minimize the number of features, such as sparse support vector machines. 4. Extend the procedure to handle difficult areas by combining or redrawing @@ -446,6 +451,8 @@ at once. Develop algorithms to suggest a division of a structure into anatom- ical parts + 13 + 1. Explore dimensionality reduction algorithms applied to pixels: including TODO 2. Explore dimensionality reduction algorithms applied to genes: including @@ -457,9 +464,8 @@ clustering to create anatomical maps 6. Run this algorithm on the cortex: present a hierarchial, genoarchitectonic map of the cortex - 11 - - _______________________________________________________________________________________________________ stuff i dunno where to put yet (there is more scattered through grant- +______________________________________________ + stuff i dunno where to put yet (there is more scattered through grant- oldtext): Principle 4: Work in 2-D whenever possible In anatomy, the manifold of interest is usually either defined by a combina- @@ -484,6 +490,6 @@ app2 has examples of genetic targeting to specific anatomical regions — note: - 12 - - + 14 + + Binary file grant.odt has changed Binary file grant.pdf has changed --- a/grant.txt Mon Apr 13 03:21:04 2009 -0700 +++ b/grant.txt Mon Apr 13 03:22:01 2009 -0700 @@ -13,6 +13,7 @@ All algorithms that we develop will be implemented in an open-source software toolkit. The toolkit, as well as the machine-readable datasets developed in aim (3), will be published and freely available for others to use. +\newpage == Background and significance == @@ -151,6 +152,8 @@ +\newpage + == Preliminary work == === Format conversion between SEV, MATLAB, NIFTI === @@ -254,6 +257,9 @@ todo + + +\newpage == Research plan == todo amongst other things: