cg

changeset 87:f04ea2784509

.
author bshanks@bshanks.dyndns.org
date Tue Apr 21 05:34:25 2009 -0700 (16 years ago)
parents aafe6f8c3593
children ae1e1da359d2
files grant.doc grant.html grant.odt grant.pdf grant.txt
line diff
1.1 Binary file grant.doc has changed
2.1 --- a/grant.html Tue Apr 21 04:05:54 2009 -0700 2.2 +++ b/grant.html Tue Apr 21 05:34:25 2009 -0700 2.3 @@ -22,7 +22,17 @@ 2.4 tissue samples. 2.5 All algorithms that we develop will be implemented in a GPL open-source software toolkit. The toolkit, as well as the 2.6 machine-readable datasets developed in aim (3), will be published and freely available for others to use. 2.7 -Background and significance 2.8 +The challenge topic 2.9 +This proposal addresses challenge topic 06-HG-101. Massive new datasets obtained with techniques such as in situ hybridiza- 2.10 +tion (ISH), immunohistochemistry, in situ transgenic reporter, microarray voxelation, and others, allow the expression levels 2.11 +of many genes at many locations to be compared. Our goal is to develop automated methods to relate spatial variation in 2.12 +gene expression to anatomy. We want to find marker genes for specific anatomical regions, and also to draw new anatomical 2.13 +maps based on gene expression patterns. 2.14 +The Challenge and Potential impact 2.15 +Now we will discuss each of our three aims in turn. For each aim, we will develop a conceptual framework for thinking 2.16 +about the task, and we will present our strategy for solving it. Next we will discuss related work. At the conclusion of each 2.17 +section, we will summarize why our strategy is different from what has been done before. At the end of this section, we will 2.18 +describe the potential impact. 2.19 Aim 1: Given a map of regions, find genes that mark the regions 2.20 Machine learning terminology The task of looking for marker genes for known anatomical regions means that one is 2.21 looking for a set of genes such that, if the expression level of those genes is known, then the locations of the regions can be 2.22 @@ -62,6 +72,8 @@ 2.23 However, at least some of these areas can be delineated by looking at combinations of genes (an example of an area for 2.24 which multiple genes are necessary and sufficient is provided in Preliminary Studies, Figure 4). Therefore, each instance 2.25 should contain multiple features (genes). 2.26 +_______ 2.27 + 1Strictly speaking, the features are gene expression levels, but we’ll call them genes. 2.28 Principle 2: Only look at combinations of small numbers of genes 2.29 When the classifier classifies a voxel, it is only allowed to look at the expression of the genes which have been selected 2.30 as features. The more data that are available to a classifier, the better that it can do. For example, perhaps there are weak 2.31 @@ -75,8 +87,6 @@ 2.32 The requirement to find combinations of only a small number of genes limits us from straightforwardly applying many 2.33 of the most simple techniques from the field of supervised machine learning. In the parlance of machine learning, our task 2.34 combines feature selection with supervised learning. 2.35 -_________________________________________ 2.36 - 1Strictly speaking, the features are gene expression levels, but we’ll call them genes. 2.37 Principle 3: Use geometry in feature selection 2.38 When doing feature selection with score-based methods, the simplest thing to do would be to score the performance of 2.39 each voxel by itself and then combine these scores (pointwise scoring). A more powerful approach is to also use information 2.40 @@ -115,6 +125,12 @@ 2.41 search for underexpression. Third, Gene Finder uses a simple pointwise score5, whereas we will also use geometric scores 2.42 such as gradient similarity (described in Preliminary Studies). Figures 4, 2, and 3 in the Preliminary Studies section contains 2.43 evidence that each of our three choices is the right one. 2.44 +_________________________________________ 2.45 + 2By “fundamentally spatial” we mean that there is information from a large number of spatial locations indexed by spatial coordinates; not 2.46 +just data which have only a few different locations or which is indexed by anatomical label. 2.47 + 3Actually, many of these projects use quadrilaterals instead of square pixels; but we will refer to them as pixels for simplicity. 2.48 + 4the number of true pixels in the intersection of the two images, divided by the number of pixels in their union. 2.49 + 5“Expression energy ratio”, which captures overexpression. 2.50 [6 ] looks at the mean expression level of genes within anatomical regions, and applies a Student’s t-test with Bonferroni 2.51 correction to determine whether the mean expression level of a gene is significantly higher in the target region. Like AGEA, 2.52 this is a pointwise measure (only the mean expression level per pixel is being analyzed), it is not being used to look for 2.53 @@ -127,12 +143,6 @@ 2.54 scoring methods. 2.55 Aim 2: From gene expression data, discover a map of regions 2.56 Machine learning terminology: clustering 2.57 -_ 2.58 - 2By “fundamentally spatial” we mean that there is information from a large number of spatial locations indexed by spatial coordinates; not 2.59 -just data which have only a few different locations or which is indexed by anatomical label. 2.60 - 3Actually, many of these projects use quadrilaterals instead of square pixels; but we will refer to them as pixels for simplicity. 2.61 - 4the number of true pixels in the intersection of the two images, divided by the number of pixels in their union. 2.62 - 5“Expression energy ratio”, which captures overexpression. 2.63 If one is given a dataset consisting merely of instances, with no class labels, then analysis of the dataset is referred to as 2.64 unsupervised learning in the jargon of machine learning. One thing that you can do with such a dataset is to group instances 2.65 together. A set of similar instances is called a cluster, and the activity of finding grouping the data into clusters is called 2.66 @@ -170,23 +180,19 @@ 2.67 strategy to achieve this goal is to first cluster the features (genes). There are two ways that clusters of genes could be used. 2.68 Gene clusters could be used as part of dimensionality reduction: rather than have one feature for each gene, we could 2.69 have one reduced feature for each gene cluster. 2.70 +__ 2.71 + 6There are imaging tasks which use more than three colors, for example multispectral imaging and hyperspectral imaging, which are often 2.72 +used to process satellite imagery. 2.73 + 7First, because the number of features in the reduced dataset is less than in the original dataset, the running time of clustering algorithms 2.74 +may be much less. Second, it is thought that some clustering algorithms may give better results on reduced data. 2.75 Gene clusters could also be used to directly yield a clustering on instances. This is because many genes have an expression 2.76 -pattern which seems to pick out a single, spatially continguous region. Therefore, it seems likely that an anatomically 2.77 +patternwhich seems to pick out a single, spatially continguous region. Therefore, it seems likely that an anatomically 2.78 interesting region will have multiple genes which each individually pick it out8. This suggests the following procedure: 2.79 cluster together genes which pick out similar regions, and then to use the more popular common regions as the final clusters. 2.80 In Preliminary Studies, Figure 7, we show that a number of anatomically recognized cortical regions, as well as some 2.81 “superregions” formed by lumping together a few regions, are associated with gene clusters in this fashion. 2.82 The task of clustering both the instances and the features is called co-clustering, and there are a number of co-clustering 2.83 algorithms. 2.84 -________________________________ 2.85 - 6There are imaging tasks which use more than three colors, for example multispectral imaging and hyperspectral imaging, which are often 2.86 -used to process satellite imagery. 2.87 - 7First, because the number of features in the reduced dataset is less than in the original dataset, the running time of clustering algorithms 2.88 -may be much less. Second, it is thought that some clustering algorithms may give better results on reduced data. 2.89 - 8This would seem to contradict our finding in aim 1 that some cortical areas are combinatorially coded by multiple genes. However, it is 2.90 -possible that the currently accepted cortical maps divide the cortex into regions which are unnatural from the point of view of gene expression; 2.91 -perhaps there is some other way to map the cortex for which each region can be identified by single genes. Another possibility is that, although 2.92 -the cluster prototype fits an anatomical region, the individual genes are each somewhat different from the prototype. 2.93 Related work 2.94 Some researchers have attempted to parcellate cortex on the basis of non-gene expression data. For example, [15], [2], [16], 2.95 and [1 ] associate spots on the cortex with the radial profile9 of response to some stain ([10] uses MRI), extract features from 2.96 @@ -226,8 +232,17 @@ 2.97 Background 2.98 The cortex is divided into areas and layers. Because of the cortical columnar organization, the parcellation of the cortex 2.99 into areas can be drawn as a 2-D map on the surface of the cortex. In the third dimension, the boundaries between the 2.100 +_________________________________________ 2.101 + 8This would seem to contradict our finding in aim 1 that some cortical areas are combinatorially coded by multiple genes. However, it is 2.102 +possible that the currently accepted cortical maps divide the cortex into regions which are unnatural from the point of view of gene expression; 2.103 +perhaps there is some other way to map the cortex for which each region can be identified by single genes. Another possibility is that, although 2.104 +the cluster prototype fits an anatomical region, the individual genes are each somewhat different from the prototype. 2.105 + 9A radial profile is a profile along a line perpendicular to the cortical surface. 2.106 + 10We ran “vanilla” NNMF, whereas the paper under discussion used a modified method. Their main modification consisted of adding a soft 2.107 +spatial contiguity constraint. However, on our dataset, NNMF naturally produced spatially contiguous clusters, so no additional constraint was 2.108 +needed. The paper under discussion also mentions that they tried a hierarchial variant of NNMF, which we have not yet tried. 2.109 areas continue downwards into the cortical depth, perpendicular to the surface. The layer boundaries run parallel to the 2.110 -surface. One can picture an area of the cortex as a slice of a six-layered cake11. 2.111 +surface.One can picture an area of the cortex as a slice of a six-layered cake11. 2.112 It is known that different cortical areas have distinct roles in both normal functioning and in disease processes, yet there 2.113 are no known marker genes for most cortical areas. When it is necessary to divide a tissue sample into cortical areas, this is 2.114 a manual process that requires a skilled human to combine multiple visual cues and interpret them in the context of their 2.115 @@ -238,12 +253,6 @@ 2.116 Franklin[14] on the other. While the maps are certainly very similar in their general arrangement, significant differences 2.117 remain. 2.118 The Allen Mouse Brain Atlas dataset 2.119 -__ 2.120 - 9A radial profile is a profile along a line perpendicular to the cortical surface. 2.121 - 10We ran “vanilla” NNMF, whereas the paper under discussion used a modified method. Their main modification consisted of adding a soft 2.122 -spatial contiguity constraint. However, on our dataset, NNMF naturally produced spatially contiguous clusters, so no additional constraint was 2.123 -needed. The paper under discussion also mentions that they tried a hierarchial variant of NNMF, which we have not yet tried. 2.124 - 11Outside of isocortex, the number of layers varies. 2.125 The Allen Mouse Brain Atlas (ABA) data were produced by doing in-situ hybridization on slices of male, 56-day-old 2.126 C57BL/6J mouse brains. Pictures were taken of the processed slice, and these pictures were semi-automatically analyzed 2.127 to create a digital measurement of gene expression levels at each location in each slice. Per slice, cellular spatial resolution 2.128 @@ -259,23 +268,6 @@ 2.129 EMAGE, most of the other resources have not (yet) extracted the expression intensity from the ISH images and registered 2.130 the results into a single 3-D space, and to our knowledge only ABA and EMAGE make this form of data available for public 2.131 download from the website14. Many of these resources focus on developmental gene expression. 2.132 -Significance 2.133 -The method developed in aim (1) will be applied to each cortical area to find a set of marker genes such that the 2.134 -combinatorial expression pattern of those genes uniquely picks out the target area. Finding marker genes will be useful for 2.135 -drug discovery as well as for experimentation because marker genes can be used to design interventions which selectively 2.136 -target individual cortical areas. 2.137 -The application of the marker gene finding algorithm to the cortex will also support the development of new neuroanatom- 2.138 -ical methods. In addition to finding markers for each individual cortical areas, we will find a small panel of genes that can 2.139 -find many of the areal boundaries at once. This panel of marker genes will allow the development of an ISH protocol that 2.140 -will allow experimenters to more easily identify which anatomical areas are present in small samples of cortex. 2.141 -The method developed in aim (2) will provide a genoarchitectonic viewpoint that will contribute to the creation of a 2.142 -better map. The development of present-day cortical maps was driven by the application of histological stains. If a different 2.143 -set of stains had been available which identified a different set of features, then today’s cortical maps may have come out 2.144 -differently. It is likely that there are many repeated, salient spatial patterns in the gene expression which have not yet been 2.145 -captured by any stain. Therefore, cortical anatomy needs to incorporate what we can learn from looking at the patterns of 2.146 -gene expression. 2.147 -While we do not here propose to analyze human gene expression data, it is conceivable that the methods we propose to 2.148 -develop could be used to suggest modifications to the human cortical map as well. 2.149 Related work 2.150 [13 ] describes the application of AGEA to the cortex. The paper describes interesting results on the structure of correlations 2.151 between voxel gene expression profiles within a handful of cortical areas. However, this sort of analysis is not related to either 2.152 @@ -289,7 +281,8 @@ 2.153 Our project is guided by a concrete application with a well-specified criterion of success (how well we can find marker 2.154 genes for / reproduce the layout of cortical areas), which will provide a solid basis for comparing different methods. 2.155 _________________________________________ 2.156 - 12The sagittal data do not cover the entire cortex, and also have greater registration error[13]. Genes were selected by the Allen Institute for 2.157 + 11Outside of isocortex, the number of layers varies. 2.158 + 12The sagittal data do not cover the entire cortex, and also have greater registration error[13]. Genes were selected by the Allen Institute for 2.159 coronal sectioning based on, “classes of known neuroscientific interest... or through post hoc identification of a marked non-ubiquitous expression 2.160 pattern”[13]. 2.161 13Other such resources include GENSAT[8], GenePaint[24], its sister project GeneAtlas[5], BGEM[12], EMAGE[23], EurExpress (http://www. 2.162 @@ -305,7 +298,30 @@ 2.163 intersection of a layer and an area, but since one area will have many layer-area intersection clusters, further work is needed to make sense of 2.164 these). The reason that Gene Finder cannot the find marker genes for cortical areas is that, although the user chooses a seed voxel, Gene Finder 2.165 chooses the ROI for which genes will be found, and it creates that ROI by (pairwise voxel correlation) clustering around the seed. 2.166 -Preliminary Studies 2.167 +Significance 2.168 +The method developed in aim (1) will be applied to each cortical area to find a set of marker genes such that the combinatorial 2.169 +expression pattern of those genes uniquely picks out the target area. Finding marker genes will be useful for drug discovery 2.170 +as well as for experimentation because marker genes can be used to design interventions which selectively target individual 2.171 +cortical areas. 2.172 +The application of the marker gene finding algorithm to the cortex will also support the development of new neuroanatom- 2.173 +ical methods. In addition to finding markers for each individual cortical areas, we will find a small panel of genes that can 2.174 +find many of the areal boundaries at once. This panel of marker genes will allow the development of an ISH protocol that 2.175 +will allow experimenters to more easily identify which anatomical areas are present in small samples of cortex. 2.176 +The method developed in aim (2) will provide a genoarchitectonic viewpoint that will contribute to the creation of a 2.177 +better map. The development of present-day cortical maps was driven by the application of histological stains. If a different 2.178 +set of stains had been available which identified a different set of features, then today’s cortical maps may have come out 2.179 +differently. It is likely that there are many repeated, salient spatial patterns in the gene expression which have not yet been 2.180 +captured by any stain. Therefore, cortical anatomy needs to incorporate what we can learn from looking at the patterns of 2.181 +gene expression. 2.182 +While we do not here propose to analyze human gene expression data, it is conceivable that the methods we propose 2.183 +to develop could be used to suggest modifications to the human cortical map as well. In fact, the methods we will develop 2.184 +will be applicable to other datasets beyond the brain. We will provide an open-source toolbox to allow other researchers 2.185 +to easily use our methods. With these methods, researchers with gene expression for any area of the body will be able to 2.186 +efficiently find marker genes for anatomical regions, or to use gene expression to discover new anatomical patterning. As 2.187 +described above, marker genes have a variety of uses in the development of drugs and experimental manipulations, and in 2.188 +the anatomical characterization of tissue samples. The discovery of new ways to carve up anatomical structures into regions 2.189 +will widely impact all areas of biology. 2.190 +The approach: Preliminary Studies 2.191 2.192 2.193 Figure 1: Top row: Genes Nfic and 2.194 @@ -591,8 +607,8 @@ 2.195 cluster voxels. 2.196 _____________________________ 2.197 195-fold cross-validation. 2.198 -Research Design and Methods 2.199 -Flatmapping and segmentation of cortical layers** 2.200 +The approach: what we plan to do 2.201 +Flatmap and segment cortical layers 2.202 There are multiple ways to flatten 3-D data into 2-D. We will compare mappings from manifolds to planes which attempt 2.203 to preserve size (such as the one used by Caret[7]) with mappings which preserve angle (conformal maps). Our method will 2.204 include a statistical test that warns the user if the assumption of 2-D structure seems to be wrong. 2.205 @@ -633,15 +649,48 @@ 2.206 # compare using clustering scores 2.207 # multivariate gradient similarity 2.208 # deep belief nets 2.209 -Apply these algorithms to the cortex Using the methods developed in Aim 1, we will present, for each cortical area, 2.210 -a short list of markers to identify that area; and we will also present lists of “panels” of genes that can be used to delineate 2.211 -_________________________________________ 2.212 +Apply these algorithms to the cortex 2.213 +___ 2.214 20Already, for each cortical area, we have used the C4.5 algorithm to find a decision tree for that area. We achieved good classification accuracy 2.215 on our training set, but the number of genes that appeared in each tree was too large. We plan to implement a pruning procedure to generate 2.216 trees that use fewer genes 2.217 -many areas at once. Using the methods developed in Aim 2, we will present one or more hierarchial cortical maps. We will 2.218 -identifyand explain how the statistical structure in the gene expression data led to any unexpected or interesting features 2.219 -of thesemaps. 2.220 +Using the methods developed in Aim 1, we will present, for each cortical area, a short list of markers to identify that 2.221 +area; and we will also present lists of “panels” of genes that can be used to delineate many areas at once. Using the methods 2.222 +developed in Aim 2, we will present one or more hierarchial cortical maps. We will identify and explain how the statistical 2.223 +structure in the gene expression data led to any unexpected or interesting features of these maps. 2.224 +Timeline and milestones 2.225 +Aim 1 2.226 +∙Oct-Nov 2009: develop an automated mechanism for segmenting the cortical voxels into layers 2.227 +∙Nov 2009 (milestone): a preliminary automated mechanism for segmenting the cortical voxels into layers 2.228 +∙Oct 2009-Feb 2010: develop scoring methods and to test them in various supervised learning frameworks. Also test 2.229 +out various dimensionality reduction schemes in combination with supervised learning. 2.230 +∙Dec 2009-April 2010: create or extend supervised learning frameworks which use multivariate versions of the best 2.231 +scoring methods 2.232 +∙January 2010 (milestone): submit a publication on single marker genes for cortical areas 2.233 +∙February-June 2010: explore the best way to integrate radial profiles with supervised learning. Explore the best way 2.234 +to make supervised learning techniques robust against incorrect labels (i.e. when the areas drawn on the input cortical 2.235 +map are slightly off). Quantitatively compare the performance of different supervised learning techniques. 2.236 +∙May-July 2010: Validate marker genes found in the ABA dataset by checking against other gene expression datasets 2.237 +∙June 2010: submit a paper describing a method fulfilling Aim 1 2.238 +∙July 2010: submit a paper describing combinations of marker genes for each cortical area, and a small number of 2.239 +marker genes that can, in combination, define most of the areas at once 2.240 +∙April-July 2010: create documentation and unit tests for software toolbox for Aim 1. 2.241 +∙August 2010-: respond to user bug reports for Aim 1 software toolbox. 2.242 +Aim 2 2.243 +∙April-September 2010: explore dimensionality reduction algorithms for Aim 2 2.244 +∙June-November 2010: explore standard hierarchial clustering algorithms, used in combination with dimensionality 2.245 +reduction, for Aim 2 2.246 +∙July-December 2010: explore co-clustering algorithms. Think about how radial profile information can be used for 2.247 +Aim 2. Adapt clustering algorithms to use radial profile information. 2.248 +∙January-March 2011: Quantitatively compare the performance of different dimensionality reduction and clustering 2.249 +techniques. Quantitatively compare the value of different flatmapping methods and ways of representing radial profiles. 2.250 +∙January-June 2011: using the methods developed for Aim 2, explore the genomic anatomy of the cortex. Read the 2.251 +literature and talk to people to learn about research related to unexpected and interesting discoveries. 2.252 +∙February-May 2011: create documentation and unit tests for software toolbox for Aim 2. 2.253 +∙June 2011-: respond to user bug reports for Aim 1 software toolbox. 2.254 +∙March 2011: submit a paper describing a method fulfilling Aim 2 2.255 +∙May 2011: submit a paper on the genomic anatomy of the cortex, using the methods developed in Aim 2 2.256 +∙May-August 2011: revisit Aim 1 to see if what was learned during Aim 2 can improve the methods for Aim 1. 2.257 Bibliography & References Cited 2.258 [1]Chris Adamson, Leigh Johnston, Terrie Inder, Sandra Rees, Iven Mareels, and Gary Egan. A Tracking Approach to 2.259 Parcellation of the Cerebral Cortex, volume Volume 3749/2005 of Lecture Notes in Computer Science, pages 294–301.
3.1 Binary file grant.odt has changed
4.1 Binary file grant.pdf has changed
5.1 --- a/grant.txt Tue Apr 21 04:05:54 2009 -0700 5.2 +++ b/grant.txt Tue Apr 21 05:34:25 2009 -0700 5.3 @@ -23,7 +23,13 @@ 5.4 5.5 \newpage 5.6 5.7 -== Background and significance == 5.8 +== The challenge topic == 5.9 + 5.10 +This proposal addresses challenge topic 06-HG-101. Massive new datasets obtained with techniques such as in situ hybridization (ISH), immunohistochemistry, in situ transgenic reporter, microarray voxelation, and others, allow the expression levels of many genes at many locations to be compared. Our goal is to develop automated methods to relate spatial variation in gene expression to anatomy. We want to find marker genes for specific anatomical regions, and also to draw new anatomical maps based on gene expression patterns. 5.11 + 5.12 +== The Challenge and Potential impact == 5.13 + 5.14 +Now we will discuss each of our three aims in turn. For each aim, we will develop a conceptual framework for thinking about the task, and we will present our strategy for solving it. Next we will discuss related work. At the conclusion of each section, we will summarize why our strategy is different from what has been done before. At the end of this section, we will describe the potential impact. 5.15 5.16 === Aim 1: Given a map of regions, find genes that mark the regions === 5.17 5.18 @@ -201,7 +207,20 @@ 5.19 5.20 5.21 5.22 -\vspace{0.3cm}**Significance** 5.23 + 5.24 +=== Related work === 5.25 + 5.26 +\cite{ng_anatomic_2009} describes the application of AGEA to the cortex. The paper describes interesting results on the structure of correlations between voxel gene expression profiles within a handful of cortical areas. However, this sort of analysis is not related to either of our aims, as it neither finds marker genes, nor does it suggest a cortical map based on gene expression data. Neither of the other components of AGEA can be applied to cortical areas; AGEA's Gene Finder cannot be used to find marker genes for the cortical areas; and AGEA's hierarchial clustering does not produce clusters corresponding to the cortical areas\footnote{In both cases, the cause is that pairwise correlations between the gene expression of voxels in different areas but the same layer are often stronger than pairwise correlations between the gene expression of voxels in different layers but the same area. Therefore, a pairwise voxel correlation clustering algorithm will tend to create clusters representing cortical layers, not areas (there may be clusters which presumably correspond to the intersection of a layer and an area, but since one area will have many layer-area intersection clusters, further work is needed to make sense of these). The reason that Gene Finder cannot the find marker genes for cortical areas is that, although the user chooses a seed voxel, Gene Finder chooses the ROI for which genes will be found, and it creates that ROI by (pairwise voxel correlation) clustering around the seed.}. 5.27 + 5.28 + 5.29 +%% Most of the projects which have been discussed have been done by the same groups that develop the public datasets. Although these projects make their algorithms available for use on their own website, none of them have released an open-source software toolkit; instead, users are restricted to using the provided algorithms only on their own dataset. 5.30 + 5.31 +In summary, for all three aims, (a) only one of the previous projects explores combinations of marker genes, (b) there has been almost no comparison of different algorithms or scoring methods, and (c) there has been no work on computationally finding marker genes for cortical areas, or on finding a hierarchial clustering that will yield a map of cortical areas de novo from gene expression data. 5.32 + 5.33 +Our project is guided by a concrete application with a well-specified criterion of success (how well we can find marker genes for \begin{latex}/\end{latex} reproduce the layout of cortical areas), which will provide a solid basis for comparing different methods. 5.34 + 5.35 + 5.36 +== Significance == 5.37 5.38 The method developed in aim (1) will be applied to each cortical area to find a set of marker genes such that the combinatorial expression pattern of those genes uniquely picks out the target area. Finding marker genes will be useful for drug discovery as well as for experimentation because marker genes can be used to design interventions which selectively target individual cortical areas. 5.39 5.40 @@ -209,28 +228,16 @@ 5.41 5.42 5.43 %% Since the number of classes of stains is small compared to the number of genes, 5.44 + 5.45 The method developed in aim (2) will provide a genoarchitectonic viewpoint that will contribute to the creation of a better map. The development of present-day cortical maps was driven by the application of histological stains. If a different set of stains had been available which identified a different set of features, then today's cortical maps may have come out differently. It is likely that there are many repeated, salient spatial patterns in the gene expression which have not yet been captured by any stain. Therefore, cortical anatomy needs to incorporate what we can learn from looking at the patterns of gene expression. 5.46 5.47 - 5.48 -While we do not here propose to analyze human gene expression data, it is conceivable that the methods we propose to develop could be used to suggest modifications to the human cortical map as well. 5.49 - 5.50 - 5.51 -=== Related work === 5.52 - 5.53 -\cite{ng_anatomic_2009} describes the application of AGEA to the cortex. The paper describes interesting results on the structure of correlations between voxel gene expression profiles within a handful of cortical areas. However, this sort of analysis is not related to either of our aims, as it neither finds marker genes, nor does it suggest a cortical map based on gene expression data. Neither of the other components of AGEA can be applied to cortical areas; AGEA's Gene Finder cannot be used to find marker genes for the cortical areas; and AGEA's hierarchial clustering does not produce clusters corresponding to the cortical areas\footnote{In both cases, the cause is that pairwise correlations between the gene expression of voxels in different areas but the same layer are often stronger than pairwise correlations between the gene expression of voxels in different layers but the same area. Therefore, a pairwise voxel correlation clustering algorithm will tend to create clusters representing cortical layers, not areas (there may be clusters which presumably correspond to the intersection of a layer and an area, but since one area will have many layer-area intersection clusters, further work is needed to make sense of these). The reason that Gene Finder cannot the find marker genes for cortical areas is that, although the user chooses a seed voxel, Gene Finder chooses the ROI for which genes will be found, and it creates that ROI by (pairwise voxel correlation) clustering around the seed.}. 5.54 - 5.55 - 5.56 -%% Most of the projects which have been discussed have been done by the same groups that develop the public datasets. Although these projects make their algorithms available for use on their own website, none of them have released an open-source software toolkit; instead, users are restricted to using the provided algorithms only on their own dataset. 5.57 - 5.58 -In summary, for all three aims, (a) only one of the previous projects explores combinations of marker genes, (b) there has been almost no comparison of different algorithms or scoring methods, and (c) there has been no work on computationally finding marker genes for cortical areas, or on finding a hierarchial clustering that will yield a map of cortical areas de novo from gene expression data. 5.59 - 5.60 -Our project is guided by a concrete application with a well-specified criterion of success (how well we can find marker genes for \begin{latex}/\end{latex} reproduce the layout of cortical areas), which will provide a solid basis for comparing different methods. 5.61 +While we do not here propose to analyze human gene expression data, it is conceivable that the methods we propose to develop could be used to suggest modifications to the human cortical map as well. In fact, the methods we will develop will be applicable to other datasets beyond the brain. We will provide an open-source toolbox to allow other researchers to easily use our methods. With these methods, researchers with gene expression for any area of the body will be able to efficiently find marker genes for anatomical regions, or to use gene expression to discover new anatomical patterning. As described above, marker genes have a variety of uses in the development of drugs and experimental manipulations, and in the anatomical characterization of tissue samples. The discovery of new ways to carve up anatomical structures into regions will widely impact all areas of biology. 5.62 + 5.63 5.64 5.65 5.66 \newpage 5.67 - 5.68 -== Preliminary Studies == 5.69 +== The approach: Preliminary Studies == 5.70 \begin{wrapfigure}{L}{0.35\textwidth}\centering 5.71 %%\includegraphics[scale=.27]{singlegene_SS_corr_top_1_2365_jet.eps}\includegraphics[scale=.27]{singlegene_SS_corr_top_2_242_jet.eps}\includegraphics[scale=.27]{singlegene_SS_corr_top_3_654_jet.eps} 5.72 %%\\ 5.73 @@ -445,10 +452,10 @@ 5.74 5.75 5.76 \newpage 5.77 -== Research Design and Methods == 5.78 - 5.79 - 5.80 -\vspace{0.3cm}**Flatmapping and segmentation of cortical layers** 5.81 +== The approach: what we plan to do == 5.82 + 5.83 + 5.84 +\vspace{0.3cm}**Flatmap and segment cortical layers** 5.85 5.86 %%In anatomy, the manifold of interest is usually either defined by a combination of two relevant anatomical axes (todo), or by the surface of the structure (as is the case with the cortex). In the former case, the manifold of interest is a plane, but in the latter case it is curved. If the manifold is curved, there are various methods for mapping the manifold into a plane. 5.87 5.88 @@ -514,6 +521,7 @@ 5.89 5.90 5.91 \vspace{0.3cm}**Apply these algorithms to the cortex** 5.92 + 5.93 Using the methods developed in Aim 1, we will present, for each cortical area, a short list of markers to identify that area; and we will also present lists of "panels" of genes that can be used to delineate many areas at once. Using the methods developed in Aim 2, we will present one or more hierarchial cortical maps. We will identify and explain how the statistical structure in the gene expression data led to any unexpected or interesting features of these maps. 5.94 5.95 5.96 @@ -523,6 +531,34 @@ 5.97 %%Presently, we do not have a probabalistic atlas which is registered to the ABA space. However, in anticipation of the availability of such maps, we would like to explore extensions to our Aim 1 techniques which can handle probabalistic maps. 5.98 5.99 5.100 +== Timeline and milestones == 5.101 + 5.102 +=== Aim 1 === 5.103 + 5.104 +* Oct-Nov 2009: develop an automated mechanism for segmenting the cortical voxels into layers 5.105 +* Nov 2009 (milestone): a preliminary automated mechanism for segmenting the cortical voxels into layers 5.106 +* Oct 2009-Feb 2010: develop scoring methods and to test them in various supervised learning frameworks. Also test out various dimensionality reduction schemes in combination with supervised learning. 5.107 +* Dec 2009-April 2010: create or extend supervised learning frameworks which use multivariate versions of the best scoring methods 5.108 +* January 2010 (milestone): submit a publication on single marker genes for cortical areas 5.109 +* February-June 2010: explore the best way to integrate radial profiles with supervised learning. Explore the best way to make supervised learning techniques robust against incorrect labels (i.e. when the areas drawn on the input cortical map are slightly off). Quantitatively compare the performance of different supervised learning techniques. 5.110 +* May-July 2010: Validate marker genes found in the ABA dataset by checking against other gene expression datasets 5.111 +* June 2010: submit a paper describing a method fulfilling Aim 1 5.112 +* July 2010: submit a paper describing combinations of marker genes for each cortical area, and a small number of marker genes that can, in combination, define most of the areas at once 5.113 +* April-July 2010: create documentation and unit tests for software toolbox for Aim 1. 5.114 +* August 2010-: respond to user bug reports for Aim 1 software toolbox. 5.115 + 5.116 +=== Aim 2 === 5.117 +* April-September 2010: explore dimensionality reduction algorithms for Aim 2 5.118 +* June-November 2010: explore standard hierarchial clustering algorithms, used in combination with dimensionality reduction, for Aim 2 5.119 +* July-December 2010: explore co-clustering algorithms. Think about how radial profile information can be used for Aim 2. Adapt clustering algorithms to use radial profile information. 5.120 +* January-March 2011: Quantitatively compare the performance of different dimensionality reduction and clustering techniques. Quantitatively compare the value of different flatmapping methods and ways of representing radial profiles. 5.121 +* January-June 2011: using the methods developed for Aim 2, explore the genomic anatomy of the cortex. Read the literature and talk to people to learn about research related to unexpected and interesting discoveries. 5.122 +* February-May 2011: create documentation and unit tests for software toolbox for Aim 2. 5.123 +* June 2011-: respond to user bug reports for Aim 1 software toolbox. 5.124 +* March 2011: submit a paper describing a method fulfilling Aim 2 5.125 +* May 2011: submit a paper on the genomic anatomy of the cortex, using the methods developed in Aim 2 5.126 +* May-August 2011: revisit Aim 1 to see if what was learned during Aim 2 can improve the methods for Aim 1. 5.127 + 5.128 \newpage 5.129 5.130 \bibliographystyle{plain}