# HG changeset patch # User bshanks@bshanks.dyndns.org # Date 1239534118 25200 # Node ID ff9b47f2c7d3b7cb345ad4b19cdb03f49e9cfa61 # Parent 796116742ec59b5e5d4ce23e54399bcbb09c2304 . Binary file grant.doc has changed --- a/grant.html Sun Apr 12 03:39:30 2009 -0700 +++ b/grant.html Sun Apr 12 04:01:58 2009 -0700 @@ -5,14 +5,14 @@ spatial variation in gene expression to anatomy. We want to find marker genes for specific anatomical regions, and also to draw new anatomical maps based on gene expression patterns. We have three specific aims: - (1) develop an algorithm to screen spatial gene expression data for combina- - tions of marker genes which selectively target anatomical regions - (2) develop an algorithm to suggest new ways of carving up a structure into - anatomical subregions, based on spatial patterns in gene expression - (3) create a 2-D “flat map” dataset of the mouse cerebral cortex that contains - a flattened version of the Allen Mouse Brain Atlas ISH data, as well as - the boundaries of cortical anatomical areas. Use this dataset to validate - the methods developed in (1) and (2). + (1) develop an algorithm to screen spatial gene expression data for combi- + nations of marker genes which selectively target anatomical regions + (2) develop an algorithm to suggest new ways of carving up a structure into + anatomical subregions, based on spatial patterns in gene expression + (3) create a 2-D “flat map” dataset of the mouse cerebral cortex that con- + tains a flattened version of the Allen Mouse Brain Atlas ISH data, as well as + the boundaries of cortical anatomical areas. Use this dataset to validate the + methods developed in (1) and (2). In addition to validating the usefulness of the algorithms, the application of these methods to cerebral cortex will produce immediate benefits, because there are currently no known genetic markers for many cortical areas. The results @@ -35,10 +35,10 @@ this a classification task, because each voxel is being assigned to a class (namely, its subregion). Therefore, an understanding of the relationship between the combination of + 1 + their expression levels and the locations of the subregions may be expressed as a function. The input to this function is a voxel, along with the gene expression - 1 - levels within that voxel; the output is the subregional identity of the target voxel, that is, the subregion to which the target voxel belongs. We call this function a classifier. In general, the input to a classifier is called an instance, @@ -83,10 +83,10 @@ Above, we defined an “instance” as the combination of a voxel with the “associated gene expression data”. In our case this refers to the expression level of genes within the voxel, but should we include the expression levels of all + 2 + genes, or only a few of them? It is too much to hope that every anatomical region of interest will be iden- - 2 - tified by a single gene. For example, in the cortex, there are some areas which are not clearly delineated by any gene included in the Allen Brain Atlas (ABA) dataset. However, at least some of these areas can be delineated by looking @@ -128,11 +128,11 @@ of machine learning. One thing that you can do with such a dataset is to group instances together. A set of similar instances is called a cluster, and the activity of finding grouping the data into clusters is called clustering or cluster analysis. + 3 + The task of deciding how to carve up a structure into anatomical subregions can be put into these terms. The instances are once again voxels (or pixels) along with their associated gene expression profiles. We make the assumption - 3 - that voxels from the same subregion have similar gene expression profiles, at least compared to the other subregions. This means that clustering voxels is the same as finding potential subregions; we seek a partitioning of the voxels @@ -176,11 +176,11 @@ reduction. The small set of features that such a technique yields is called the reduced feature set. After the reduced feature set is created, the instances may be replaced by reduced instances, which have as their features the reduced fea- + 4 + ture set rather than the original feature set of all gene expression levels. Note that the features in the reduced feature set do not necessarily correspond to genes; each feature in the reduced set may be any function of the set of gene - 4 - expression levels. Another use for dimensionality reduction is to visualize the relationships between subregions. For example, one might want to make a 2-D plot upon @@ -217,9 +217,7 @@ parcellation of the cortex into areas can be drawn as a 2-D map on the surface of the cortex. In the third dimension, the boundaries between the areas continue downwards into the cortical depth, perpendicular to the surface. The layer - boundaries run parallel to the surface. One can picture an area of the cortex as - a slice of many-layered cake. -___ +__________________________ 1This would seem to contradict our finding in aim 1 that some cortical areas are combina- torially coded by multiple genes. However, it is possible that the currently accepted cortical maps divide the cortex into subregions which are unnatural from the point of view of gene @@ -227,6 +225,8 @@ be identified by single genes. 5 + boundaries run parallel to the surface. One can picture an area of the cortex as + a slice of many-layered cake. Although it is known that different cortical areas have distinct roles in both normal functioning and in disease processes, there are no known marker genes for many cortical areas. When it is necessary to divide a tissue sample into @@ -292,6 +292,9 @@ very much on the medial surface. By adding together the values at each pixel in these two figures, we get the lower-left of Figure . This combination captures area MO much better than any single gene. + Correlation todo + Conditional entropy todo + Gradient similarity todo Geometric and pointwise scoring methods provide complementary information To show that local geometry can provide useful information that cannot be @@ -302,9 +305,6 @@ genes which express more strongly in AUD than outside of it; its weakness is that this includes many areas which don’t have a salient border matching the areal border. The geometric method identifies genes whose salient expression border - seems to partially line up with the border of AUD; its weakness is that this - includes genes which don’t express over the entire area. Genes which have high - rankings using both pointwise and border criteria, such as Aph1a in the example, __________________________ 2“WW, C2 and coiled-coil domain containing 1”; EntrezGene ID 211652 3“mitochondrial translational initiation factor 2”; EntrezGene ID 76784 @@ -336,13 +336,17 @@ genes which (individually) best match area AUD, according to gradient similar- ity. From left to right and top to bottom, the genes are Ssr1, Efcbp1, Aph1a, Ptk7, Aph1a again, and Lepr + seems to partially line up with the border of AUD; its weakness is that this + includes genes which don’t express over the entire area. Genes which have high + rankings using both pointwise and border criteria, such as Aph1a in the example, may be particularly good markers. None of these genes are, individually, a perfect marker for AUD; we deliberately chose a “difficult” area in order to better contrast pointwise with geometric methods. Areas which can be identified by single genes todo Aim 1 (and Aim 3) - SVM on all genes at once + Forward stepwise logistic regression todo + SVM on all genes at once In order to see how well one can do when looking at all genes at once, we ran a support vector machine to classify cortical surface pixels based on their gene expression profiles. We achieved classification accuracy of about 81%6. @@ -350,17 +354,17 @@ practically useful. The requirement to find combinations of only a small number of genes limits us from straightforwardly applying many of the most simple techniques from - the field of supervised machine learning. In the parlance of machine learning, - our task combines feature selection with supervised learning. - Decision trees - todo -____________________ +__________________________ 6Using the Shogun SVM package (todo:cite), with parameters type=GMNPSVM (multi- class b-SVM), kernal = gaussian with sigma = 0.1, c = 10, epsilon = 1e-1 – these are the first parameters we tried, so presumably performance would improve with different choices of parameters. 5-fold cross-validation. 9 + the field of supervised machine learning. In the parlance of machine learning, + our task combines feature selection with supervised learning. + Decision trees + todo Aim 2 (and Aim 3) Raw dimensionality reduction results Dimensionality reduction plus K-means or spectral clus- @@ -393,12 +397,12 @@ which (a) detect when a difficult area could be fit if its boundary were redrawn slightly, and (b) detect when a difficult area could be combined with adjacent areas to create a larger area which can be fit. + 10 + Apply these algorithms to the cortex 1. Create open source format conversion tools: we will create tools to bulk download the ABA dataset and to convert between SEV, NIFTI and MAT- LAB formats. - 10 - 2. Flatmap the ABA cortex data: map the ABA data onto a plane and draw the cortical area boundaries onto it. 3. Find layer boundaries: cluster similar voxels together in order to auto- @@ -432,13 +436,16 @@ The method that we will develop will begin by mapping the data into a 2-D plane. Although the manifold that characterized cortical areas is known to be the cortical surface, it remains to be seen which method of mapping the -manifold into a plane is optimal for this application. We will compare mappings -which attempt to preserve size (such as the one used by Caret??) with mappings -which preserve angle (conformal maps). - Although there is much 2-D organization in anatomy, there are also struc- -tures whose shape is fundamentally 3-dimensional. If possible, we would like -the method we develop to include a statistical test that warns the user if the -assumption of 2-D structure seems to be wrong. 11 - + manifold into a plane is optimal for this application. We will compare mappings + which attempt to preserve size (such as the one used by Caret??) with mappings + which preserve angle (conformal maps). + Although there is much 2-D organization in anatomy, there are also struc- + tures whose shape is fundamentally 3-dimensional. If possible, we would like + the method we develop to include a statistical test that warns the user if the + assumption of 2-D structure seems to be wrong. + todo: replace aim # bullet pts with #s + 12 + + Binary file grant.odt has changed Binary file grant.pdf has changed --- a/grant.txt Sun Apr 12 03:39:30 2009 -0700 +++ b/grant.txt Sun Apr 12 04:01:58 2009 -0700 @@ -1,10 +1,12 @@ == Specific aims == -Massive new datasets obtained with techniques such as in situ hybridization (ISH) and BAC-transgenics allow the expression levels of many genes at many locations to be compared. Our goal is to develop automated methods to relate spatial variation in gene expression to anatomy. We want to find marker genes for specific anatomical regions, and also to draw new anatomical maps based on gene expression patterns. We have three specific aims: - -(1) develop an algorithm to screen spatial gene expression data for combinations of marker genes which selectively target anatomical regions -(2) develop an algorithm to suggest new ways of carving up a structure into anatomical subregions, based on spatial patterns in gene expression -(3) create a 2-D "flat map" dataset of the mouse cerebral cortex that contains a flattened version of the Allen Mouse Brain Atlas ISH data, as well as the boundaries of cortical anatomical areas. Use this dataset to validate the methods developed in (1) and (2). +Massive new datasets obtained with techniques such as in situ hybridization (ISH) and BAC-transgenics allow the expression levels of many genes at many locations to be compared. Our goal is to develop automated methods to relate spatial variation in gene expression to anatomy. We want to find marker genes for specific anatomical regions, and also to draw new anatomical maps based on gene expression patterns. We have three specific aims:\\ + +(1) develop an algorithm to screen spatial gene expression data for combinations of marker genes which selectively target anatomical regions\\ + +(2) develop an algorithm to suggest new ways of carving up a structure into anatomical subregions, based on spatial patterns in gene expression\\ + +(3) create a 2-D "flat map" dataset of the mouse cerebral cortex that contains a flattened version of the Allen Mouse Brain Atlas ISH data, as well as the boundaries of cortical anatomical areas. Use this dataset to validate the methods developed in (1) and (2).\\ In addition to validating the usefulness of the algorithms, the application of these methods to cerebral cortex will produce immediate benefits, because there are currently no known genetic markers for many cortical areas. The results of the project will support the development of new ways to selectively target cortical areas, and it will support the development of a method for identifying the cortical areal boundaries present in small tissue samples. @@ -164,7 +166,14 @@ \caption{Upper left: $wwc1$. Upper right: $mtif2$. Lower left: wwc1 + mtif2 (each pixel's value on the lower left is the sum of the corresponding pixels in the upper row). Within each picture, the vertical axis roughly corresponds to anterior at the top and posterior at the bottom, and the horizontal axis roughly corresponds to medial at the left and lateral at the right. The red outline is the boundary of region MO. Pixels are colored approximately according to the density of expressing cells underneath each pixel, with red meaning a lot of expression and blue meaning little.} \end{figure} - +**Correlation** +todo + +**Conditional entropy** +todo + +**Gradient similarity** +todo **Geometric and pointwise scoring methods provide complementary information** @@ -191,7 +200,8 @@ === Aim 1 (and Aim 3) === - +**Forward stepwise logistic regression** +todo **SVM on all genes at once** @@ -283,3 +293,6 @@ Although there is much 2-D organization in anatomy, there are also structures whose shape is fundamentally 3-dimensional. If possible, we would like the method we develop to include a statistical test that warns the user if the assumption of 2-D structure seems to be wrong. + + +todo: replace aim # bullet pts with #s