cg

diff grant.html @ 97:1849a5bd1ce9

.
author bshanks@bshanks.dyndns.org
date Wed Apr 22 05:27:25 2009 -0700 (16 years ago)
parents a25a60a4bf43
children a75c226cbdd6
line diff
1.1 --- a/grant.html Tue Apr 21 18:53:40 2009 -0700 1.2 +++ b/grant.html Wed Apr 22 05:27:25 2009 -0700 1.3 @@ -1,834 +1,938 @@ 1.4 Specific aims 1.5 -Massivenew datasets obtained with techniques such as in situ hybridization (ISH), immunohistochemistry, in situ transgenic 1.6 -reporter, microarray voxelation, and others, allow the expression levels of many genes at many locations to be compared. 1.7 -Our goal is to develop automated methods to relate spatial variation in gene expression to anatomy. We want to find marker 1.8 -genes for specific anatomical regions, and also to draw new anatomical maps based on gene expression patterns. We have 1.9 -three specific aims: 1.10 -(1) develop an algorithm to screen spatial gene expression data for combinations of marker genes which selectively target 1.11 -anatomical regions 1.12 -(2) develop an algorithm to suggest new ways of carving up a structure into anatomically distinct regions, based on 1.13 -spatial patterns in gene expression 1.14 -(3) create a 2-D &#8220;flat map&#8221; dataset of the mouse cerebral cortex that contains a flattened version of the Allen Mouse 1.15 -Brain Atlas ISH data, as well as the boundaries of cortical anatomical areas. This will involve extending the functionality of 1.16 -Caret, an existing open-source scientific imaging program. Use this dataset to validate the methods developed in (1) and (2). 1.17 -Although our particular application involves the 3D spatial distribution of gene expression, we anticipate that the methods 1.18 -developed in aims (1) and (2) will generalize to any sort of high-dimensional data over points located in a low-dimensional 1.19 -space. In particular, our method could be applied to genome-wide sequencing data derived from sets of tissues and disease 1.20 -states. 1.21 -In terms of the application of the methods to cerebral cortex, aim (1) is to go from cortical areas to marker genes, 1.22 -and aim (2) is to let the gene profile define the cortical areas. In addition to validating the usefulness of the algorithms, 1.23 -the application of these methods to cortex will produce immediate benefits, because there are currently no known genetic 1.24 -markers for most cortical areas. The results of the project will support the development of new ways to selectively target 1.25 -cortical areas, and it will support the development of a method for identifying the cortical areal boundaries present in small 1.26 -tissue samples. 1.27 -All algorithms that we develop will be implemented in a GPL open-source software toolkit. The toolkit, as well as the 1.28 -machine-readable datasets developed in aim (3), will be published and freely available for others to use. 1.29 +Massive new datasets obtained with techniques such as in situ hybridization (ISH), immunohistochemistry, in 1.30 +situ transgenic reporter, microarray voxelation, and others, allow the expression levels of many genes at many 1.31 +locations to be compared. Our goal is to develop automated methods to relate spatial variation in gene expres- 1.32 +sion to anatomy. We want to find marker genes for specific anatomical regions, and also to draw new anatomical 1.33 +maps based on gene expression patterns. We have three specific aims: 1.34 +(1) develop an algorithm to screen spatial gene expression data for combinations of marker genes which 1.35 +selectively target anatomical regions 1.36 +(2) develop an algorithm to suggest new ways of carving up a structure into anatomically distinct regions, 1.37 +based on spatial patterns in gene expression 1.38 +(3) create a 2-D &#8220;flat map&#8221; dataset of the mouse cerebral cortex that contains a flattened version of the Allen 1.39 +Mouse Brain Atlas ISH data, as well as the boundaries of cortical anatomical areas. This will involve extending 1.40 +the functionality of Caret, an existing open-source scientific imaging program. Use this dataset to validate the 1.41 +methods developed in (1) and (2). 1.42 +Although our particular application involves the 3D spatial distribution of gene expression, we anticipate that 1.43 +the methods developed in aims (1) and (2) will generalize to any sort of high-dimensional data over points located 1.44 +in a low-dimensional space. In particular, our method could be applied to genome-wide sequencing data derived 1.45 +from sets of tissues and disease states. 1.46 +In terms of the application of the methods to cerebral cortex, aim (1) is to go from cortical areas to marker 1.47 +genes, and aim (2) is to let the gene profile define the cortical areas. In addition to validating the usefulness 1.48 +of the algorithms, the application of these methods to cortex will produce immediate benefits, because there 1.49 +are currently no known genetic markers for most cortical areas. The results of the project will support the 1.50 +development of new ways to selectively target cortical areas, and it will support the development of a method for 1.51 +identifying the cortical areal boundaries present in small tissue samples. 1.52 +All algorithms that we develop will be implemented in a GPL open-source software toolkit. The toolkit, as well 1.53 +as the machine-readable datasets developed in aim (3), will be published and freely available for others to use. 1.54 The challenge topic 1.55 -This proposal addresses challenge topic 06-HG-101. Massive new datasets obtained with techniques such as in situ hybridiza- 1.56 -tion (ISH), immunohistochemistry, in situ transgenic reporter, microarray voxelation, and others, allow the expression levels 1.57 -of many genes at many locations to be compared. Our goal is to develop automated methods to relate spatial variation in 1.58 -gene expression to anatomy. We want to find marker genes for specific anatomical regions, and also to draw new anatomical 1.59 -maps based on gene expression patterns. 1.60 +This proposal addresses challenge topic 06-HG-101. Massive new datasets obtained with techniques such as 1.61 +in situ hybridization (ISH), immunohistochemistry, in situ transgenic reporter, microarray voxelation, and others, 1.62 +allow the expression levels of many genes at many locations to be compared. Our goal is to develop automated 1.63 +methods to relate spatial variation in gene expression to anatomy. We want to find marker genes for specific 1.64 +anatomical regions, and also to draw new anatomical maps based on gene expression patterns. 1.65 The Challenge and Potential impact 1.66 -Each of our three aims will be discussed in turn. For each aim, we will develop a conceptual framework for thinking about 1.67 -the task, and we will present our strategy for solving it. Next we will discuss related work. At the conclusion of each section, 1.68 -we will summarize why our strategy is different from what has been done before. At the end of this section, we will describe 1.69 -the potential impact. 1.70 +Each of our three aims will be discussed in turn. For each aim, we will develop a conceptual framework for 1.71 +thinking about the task, and we will present our strategy for solving it. Next we will discuss related work. At the 1.72 +conclusion of each section, we will summarize why our strategy is different from what has been done before. At 1.73 +the end of this section, we will describe the potential impact. 1.74 Aim 1: Given a map of regions, find genes that mark the regions 1.75 -Machine learning terminology: classifiers The task of looking for marker genes for known anatomical regions means 1.76 -that one is looking for a set of genes such that, if the expression level of those genes is known, then the locations of the 1.77 -regions can be inferred. 1.78 -If we define the regions so that they cover the entire anatomical structure to be subdivided, we may say that we are 1.79 -using gene expression in each voxel to assign that voxel to the proper area. We call this a classification task, because each 1.80 -voxel is being assigned to a class (namely, its region). An understanding of the relationship between the combination of 1.81 -their expression levels and the locations of the regions may be expressed as a function. The input to this function is a voxel, 1.82 -along with the gene expression levels within that voxel; the output is the regional identity of the target voxel, that is, the 1.83 -region to which the target voxel belongs. We call this function a classifier. In general, the input to a classifier is called an 1.84 -instance, and the output is called a label (or a class label). 1.85 -The object of aim 1 is not to produce a single classifier, but rather to develop an automated method for determining a 1.86 -classifier for any known anatomical structure. Therefore, we seek a procedure by which a gene expression dataset may be 1.87 -analyzed in concert with an anatomical atlas in order to produce a classifier. The initial gene expression dataset used in 1.88 -the construction of the classifier is called training data. In the machine learning literature, this sort of procedure may be 1.89 -thought of as a supervised learning task, defined as a task in which the goal is to learn a mapping from instances to labels, 1.90 -and the training data consists of a set of instances (voxels) for which the labels (regions) are known. 1.91 -Each gene expression level is called a feature, and the selection of which genes1 to include is called feature selection. 1.92 -Feature selection is one component of the task of learning a classifier. Some methods for learning classifiers start out with 1.93 -a separate feature selection phase, whereas other methods combine feature selection with other aspects of training. 1.94 -One class of feature selection methods assigns some sort of score to each candidate gene. The top-ranked genes are then 1.95 -chosen. Some scoring measures can assign a score to a set of selected genes, not just to a single gene; in this case, a dynamic 1.96 -procedure may be used in which features are added and subtracted from the selected set depending on how much they raise 1.97 -the score. Such procedures are called &#8220;stepwise&#8221; or &#8220;greedy&#8221;. 1.98 -Although the classifier itself may only look at the gene expression data within each voxel before classifying that voxel, the 1.99 -algorithm which constructs the classifier may look over the entire dataset. We can categorize score-based feature selection 1.100 -methods depending on how the score of calculated. Often the score calculation consists of assigning a sub-score to each voxel, 1.101 -and then aggregating these sub-scores into a final score (the aggregation is often a sum or a sum of squares or average). If 1.102 -only information from nearby voxels is used to calculate a voxel&#8217;s sub-score, then we say it is a local scoring method. If only 1.103 -information from the voxel itself is used to calculate a voxel&#8217;s sub-score, then we say it is a pointwise scoring method. 1.104 -Both gene expression data and anatomical atlases have errors, due to a variety of factors. Individual subjects have 1.105 -idiosyncratic anatomy. Subjects may be improperly registred to the atlas. The method used to measure gene expression 1.106 -may be noisy. The atlas may have errors. It is even possible that some areas in the anatomical atlas are &#8220;wrong&#8221; in that 1.107 -they do not have the same shape as the natural domains of gene expression to which they correspond. These sources of error 1.108 -can affect the displacement and the shape of both the gene expression data and the anatomical target areas. Therefore, it 1.109 -is important to use feature selection methods which are robust to these kinds of errors. 1.110 +Machine learning terminology: classifiers The task of looking for marker genes for known anatomical regions 1.111 +means that one is looking for a set of genes such that, if the expression level of those genes is known, then the 1.112 +locations of the regions can be inferred. 1.113 +If we define the regions so that they cover the entire anatomical structure to be subdivided, we may say that 1.114 +we are using gene expression in each voxel to assign that voxel to the proper area. We call this a classification 1.115 +task, because each voxel is being assigned to a class (namely, its region). An understanding of the relationship 1.116 +between the combination of their expression levels and the locations of the regions may be expressed as a 1.117 +function. The input to this function is a voxel, along with the gene expression levels within that voxel; the output is 1.118 +the regional identity of the target voxel, that is, the region to which the target voxel belongs. We call this function 1.119 +a classifier. In general, the input to a classifier is called an instance, and the output is called a label (or a class 1.120 +label). 1.121 +The object of aim 1 is not to produce a single classifier, but rather to develop an automated method for 1.122 +determining a classifier for any known anatomical structure. Therefore, we seek a procedure by which a gene 1.123 +expression dataset may be analyzed in concert with an anatomical atlas in order to produce a classifier. The 1.124 +initial gene expression dataset used in the construction of the classifier is called training data. In the machine 1.125 +learning literature, this sort of procedure may be thought of as a supervised learning task, defined as a task in 1.126 +which the goal is to learn a mapping from instances to labels, and the training data consists of a set of instances 1.127 +(voxels) for which the labels (regions) are known. 1.128 +Each gene expression level is called a feature, and the selection of which genes1 to include is called feature 1.129 +selection. Feature selection is one component of the task of learning a classifier. Some methods for learning 1.130 +classifiers start out with a separate feature selection phase, whereas other methods combine feature selection 1.131 +with other aspects of training. 1.132 +One class of feature selection methods assigns some sort of score to each candidate gene. The top-ranked 1.133 +genes are then chosen. Some scoring measures can assign a score to a set of selected genes, not just to a 1.134 +single gene; in this case, a dynamic procedure may be used in which features are added and subtracted from the 1.135 +selected set depending on how much they raise the score. Such procedures are called &#8220;stepwise&#8221; or &#8220;greedy&#8221;. 1.136 +Although the classifier itself may only look at the gene expression data within each voxel before classifying 1.137 +that voxel, the algorithm which constructs the classifier may look over the entire dataset. We can categorize 1.138 +score-based feature selection methods depending on how the score of calculated. Often the score calculation 1.139 +consists of assigning a sub-score to each voxel, and then aggregating these sub-scores into a final score (the 1.140 +aggregation is often a sum or a sum of squares or average). If only information from nearby voxels is used to 1.141 +calculate a voxel&#8217;s sub-score, then we say it is a local scoring method. If only information from the voxel itself is 1.142 +used to calculate a voxel&#8217;s sub-score, then we say it is a pointwise scoring method. 1.143 +_________________________________________ 1.144 + 1Strictly speaking, the features are gene expression levels, but we&#8217;ll call them genes. 1.145 +Both gene expression data and anatomical atlases have errors, due to a variety of factors. Individual subjects 1.146 +have idiosyncratic anatomy. Subjects may be improperly registred to the atlas. The method used to measure 1.147 +gene expression may be noisy. The atlas may have errors. It is even possible that some areas in the anatomical 1.148 +atlas are &#8220;wrong&#8221; in that they do not have the same shape as the natural domains of gene expression to which 1.149 +they correspond. These sources of error can affect the displacement and the shape of both the gene expression 1.150 +data and the anatomical target areas. Therefore, it is important to use feature selection methods which are 1.151 +robust to these kinds of errors. 1.152 Our strategy for Aim 1 1.153 -Key questions when choosing a learning method are: What are the instances? What are the features? How are the features 1.154 -chosen? Here are four principles that outline our answers to these questions. 1.155 -_________________________________________ 1.156 - 1Strictly speaking, the features are gene expression levels, but we&#8217;ll call them genes. 1.157 +Key questions when choosing a learning method are: What are the instances? What are the features? How are 1.158 +the features chosen? Here are four principles that outline our answers to these questions. 1.159 Principle 1: Combinatorial gene expression 1.160 -It istoo much to hope that every anatomical region of interest will be identified by a single gene. For example, in the 1.161 -cortex, there are some areas which are not clearly delineated by any gene included in the Allen Brain Atlas (ABA) dataset. 1.162 -However, at least some of these areas can be delineated by looking at combinations of genes (an example of an area for 1.163 -which multiple genes are necessary and sufficient is provided in Preliminary Studies, Figure 4). Therefore, each instance 1.164 -should contain multiple features (genes). 1.165 +It is too much to hope that every anatomical region of interest will be identified by a single gene. For example, 1.166 +in the cortex, there are some areas which are not clearly delineated by any gene included in the Allen Brain Atlas 1.167 +(ABA) dataset. However, at least some of these areas can be delineated by looking at combinations of genes 1.168 +(an example of an area for which multiple genes are necessary and sufficient is provided in Preliminary Studies, 1.169 +Figure 4). Therefore, each instance should contain multiple features (genes). 1.170 Principle 2: Only look at combinations of small numbers of genes 1.171 -When the classifier classifies a voxel, it is only allowed to look at the expression of the genes which have been selected 1.172 -as features. The more data that are available to a classifier, the better that it can do. For example, perhaps there are weak 1.173 -correlations over many genes that add up to a strong signal. So, why not include every gene as a feature? The reason is that 1.174 -we wish to employ the classifier in situations in which it is not feasible to gather data about every gene. For example, if we 1.175 -want to use the expression of marker genes as a trigger for some regionally-targeted intervention, then our intervention must 1.176 -contain a molecular mechanism to check the expression level of each marker gene before it triggers. It is currently infeasible 1.177 -to design a molecular trigger that checks the level of more than a handful of genes. Similarly, if the goal is to develop a 1.178 -procedure to do ISH on tissue samples in order to label their anatomy, then it is infeasible to label more than a few genes. 1.179 -Therefore, we must select only a few genes as features. 1.180 -The requirement to find combinations of only a small number of genes limits us from straightforwardly applying many 1.181 -of the most simple techniques from the field of supervised machine learning. In the parlance of machine learning, our task 1.182 -combines feature selection with supervised learning. 1.183 +When the classifier classifies a voxel, it is only allowed to look at the expression of the genes which have 1.184 +been selected as features. The more data that are available to a classifier, the better that it can do. For example, 1.185 +perhaps there are weak correlations over many genes that add up to a strong signal. So, why not include every 1.186 +gene as a feature? The reason is that we wish to employ the classifier in situations in which it is not feasible to 1.187 +gather data about every gene. For example, if we want to use the expression of marker genes as a trigger for 1.188 +some regionally-targeted intervention, then our intervention must contain a molecular mechanism to check the 1.189 +expression level of each marker gene before it triggers. It is currently infeasible to design a molecular trigger that 1.190 +checks the level of more than a handful of genes. Similarly, if the goal is to develop a procedure to do ISH on 1.191 +tissue samples in order to label their anatomy, then it is infeasible to label more than a few genes. Therefore, we 1.192 +must select only a few genes as features. 1.193 +The requirement to find combinations of only a small number of genes limits us from straightforwardly ap- 1.194 +plying many of the most simple techniques from the field of supervised machine learning. In the parlance of 1.195 +machine learning, our task combines feature selection with supervised learning. 1.196 Principle 3: Use geometry in feature selection 1.197 -When doing feature selection with score-based methods, the simplest thing to do would be to score the performance of 1.198 -each voxel by itself and then combine these scores (pointwise scoring). A more powerful approach is to also use information 1.199 -about the geometric relations between each voxel and its neighbors; this requires non-pointwise, local scoring methods. See 1.200 -Preliminary Studies, figure 3 for evidence of the complementary nature of pointwise and local scoring methods. 1.201 +When doing feature selection with score-based methods, the simplest thing to do would be to score the per- 1.202 +formance of each voxel by itself and then combine these scores (pointwise scoring). A more powerful approach 1.203 +is to also use information about the geometric relations between each voxel and its neighbors; this requires non- 1.204 +pointwise, local scoring methods. See Preliminary Studies, figure 3 for evidence of the complementary nature of 1.205 +pointwise and local scoring methods. 1.206 Principle 4: Work in 2-D whenever possible 1.207 -There are many anatomical structures which are commonly characterized in terms of a two-dimensional manifold. When 1.208 -it is known that the structure that one is looking for is two-dimensional, the results may be improved by allowing the analysis 1.209 -algorithm to take advantage of this prior knowledge. In addition, it is easier for humans to visualize and work with 2-D 1.210 -data. Therefore, when possible, the instances should represent pixels, not voxels. 1.211 +There are many anatomical structures which are commonly characterized in terms of a two-dimensional 1.212 +manifold. When it is known that the structure that one is looking for is two-dimensional, the results may be 1.213 +improved by allowing the analysis algorithm to take advantage of this prior knowledge. In addition, it is easier for 1.214 +humans to visualize and work with 2-D data. Therefore, when possible, the instances should represent pixels, 1.215 +not voxels. 1.216 Related work 1.217 -There is a substantial body of work on the analysis of gene expression data, most of this concerns gene expression data 1.218 -which are not fundamentally spatial2. 1.219 -As noted above, there has been much work on both supervised learning and there are many available algorithms for 1.220 -each. However, the algorithms require the scientist to provide a framework for representing the problem domain, and the 1.221 -way that this framework is set up has a large impact on performance. Creating a good framework can require creatively 1.222 -reconceptualizing the problem domain, and is not merely a mechanical &#8220;fine-tuning&#8221; of numerical parameters. For example, 1.223 -we believe that domain-specific scoring measures (such as gradient similarity, which is discussed in Preliminary Studies) may 1.224 -be necessary in order to achieve the best results in this application. 1.225 -We are aware of six existing efforts to find marker genes using spatial gene expression data using automated methods. 1.226 -[12 ] mentions the possibility of constructing a spatial region for each gene, and then, for each anatomical structure of 1.227 -interest, computing what proportion of this structure is covered by the gene&#8217;s spatial region. 1.228 -GeneAtlas[5] and EMAGE [25] allow the user to construct a search query by demarcating regions and then specifing 1.229 -either the strength of expression or the name of another gene or dataset whose expression pattern is to be matched. For the 1.230 -similiarity score (match score) between two images (in this case, the query and the gene expression images), GeneAtlas uses 1.231 -the sum of a weighted L1-norm distance between vectors whose components represent the number of cells within a pixel3 1.232 -whose expression is within four discretization levels. EMAGE uses Jaccard similarity4. Neither GeneAtlas nor EMAGE 1.233 -allow one to search for combinations of genes that define a region in concert but not separately. 1.234 -[14 ] describes AGEA, &#8221;Anatomic Gene Expression Atlas&#8221;. AGEA has three components. Gene Finder: The user 1.235 -selects a seed voxel and the system (1) chooses a cluster which includes the seed voxel, (2) yields a list of genes which are 1.236 -overexpressed in that cluster. (note: the ABA website also contains pre-prepared lists of overexpressed genes for selected 1.237 -structures). Correlation: The user selects a seed voxel and the system then shows the user how much correlation there is 1.238 -between the gene expression profile of the seed voxel and every other voxel. Clusters: will be described later 1.239 -_________________________________________ 1.240 - 2By &#8220;fundamentally spatial&#8221; we mean that there is information from a large number of spatial locations indexed by spatial coordinates; not 1.241 -just data which have only a few different locations or which is indexed by anatomical label. 1.242 - 3Actually, many of these projects use quadrilaterals instead of square pixels; but we will refer to them as pixels for simplicity. 1.243 - 4the number of true pixels in the intersection of the two images, divided by the number of pixels in their union. 1.244 -Gene Finder is different from our Aim 1 in at least three ways. First, Gene Finder finds only single genes, whereas we 1.245 -will also look for combinations of genes. Second, gene finder can only use overexpression as a marker, whereas we will also 1.246 -search for underexpression. Third, Gene Finder uses a simple pointwise score5, whereas we will also use geometric scores 1.247 -such as gradient similarity (described in Preliminary Studies). Figures 4, 2, and 3 in the Preliminary Studies section contains 1.248 -evidence that each of our three choices is the right one. 1.249 -[6 ] looks at the mean expression level of genes within anatomical regions, and applies a Student&#8217;s t-test with Bonferroni 1.250 -correction to determine whether the mean expression level of a gene is significantly higher in the target region. Like AGEA, 1.251 -this is a pointwise measure (only the mean expression level per pixel is being analyzed), it is not being used to look for 1.252 -underexpression, and does not look for combinations of genes. 1.253 -[10 ] describes a technique to find combinations of marker genes to pick out an anatomical region. They use an evolutionary 1.254 -algorithm to evolve logical operators which combine boolean (thresholded) images in order to match a target image. Their 1.255 -match score is Jaccard similarity. 1.256 -In summary, there has been fruitful work on finding marker genes, but only one of the previous projects explores 1.257 -combinations of marker genes, and none of these publications compare the results obtained by using different algorithms or 1.258 -scoring methods. 1.259 +There is a substantial body of work on the analysis of gene expression data, most of this concerns gene expres- 1.260 +sion data which are not fundamentally spatial2. 1.261 +As noted above, there has been much work on both supervised learning and there are many available 1.262 +algorithms for each. However, the algorithms require the scientist to provide a framework for representing the 1.263 +problem domain, and the way that this framework is set up has a large impact on performance. Creating a 1.264 +good framework can require creatively reconceptualizing the problem domain, and is not merely a mechanical 1.265 +&#8220;fine-tuning&#8221; of numerical parameters. For example, we believe that domain-specific scoring measures (such 1.266 +as gradient similarity, which is discussed in Preliminary Studies) may be necessary in order to achieve the best 1.267 +results in this application. 1.268 +We are aware of six existing efforts to find marker genes using spatial gene expression data using automated 1.269 +methods. 1.270 +[13 ] mentions the possibility of constructing a spatial region for each gene, and then, for each anatomical 1.271 +structure of interest, computing what proportion of this structure is covered by the gene&#8217;s spatial region. 1.272 +GeneAtlas[5] and EMAGE [26] allow the user to construct a search query by demarcating regions and then 1.273 +specifing either the strength of expression or the name of another gene or dataset whose expression pattern 1.274 +is to be matched. For the similiarity score (match score) between two images (in this case, the query and the 1.275 +gene expression images), GeneAtlas uses the sum of a weighted L1-norm distance between vectors whose 1.276 +components represent the number of cells within a pixel3 whose expression is within four discretization levels. 1.277 +EMAGE uses Jaccard similarity4. Neither GeneAtlas nor EMAGE allow one to search for combinations of genes 1.278 +that define a region in concert but not separately. 1.279 +[15 ] describes AGEA, &#8221;Anatomic Gene Expression Atlas&#8221;. AGEA has three components. Gene Finder: The 1.280 +user selects a seed voxel and the system (1) chooses a cluster which includes the seed voxel, (2) yields a list 1.281 +of genes which are overexpressed in that cluster. (note: the ABA website also contains pre-prepared lists of 1.282 +overexpressed genes for selected structures). Correlation: The user selects a seed voxel and the system then 1.283 +shows the user how much correlation there is between the gene expression profile of the seed voxel and every 1.284 +other voxel. Clusters: will be described later 1.285 +Gene Finder is different from our Aim 1 in at least three ways. First, Gene Finder finds only single genes, 1.286 +whereas we will also look for combinations of genes. Second, gene finder can only use overexpression as a 1.287 +marker, whereas we will also search for underexpression. Third, Gene Finder uses a simple pointwise score5, 1.288 +whereas we will also use geometric scores such as gradient similarity (described in Preliminary Studies). Figures 1.289 +4, 2, and 3 in the Preliminary Studies section contains evidence that each of our three choices is the right one. 1.290 +[6 ] looks at the mean expression level of genes within anatomical regions, and applies a Student&#8217;s t-test 1.291 +with Bonferroni correction to determine whether the mean expression level of a gene is significantly higher in 1.292 +the target region. Like AGEA, this is a pointwise measure (only the mean expression level per pixel is being 1.293 +analyzed), it is not being used to look for underexpression, and does not look for combinations of genes. 1.294 +[10 ] describes a technique to find combinations of marker genes to pick out an anatomical region. They use 1.295 +an evolutionary algorithm to evolve logical operators which combine boolean (thresholded) images in order to 1.296 +match a target image. Their match score is Jaccard similarity. 1.297 +In summary, there has been fruitful work on finding marker genes, but only one of the previous projects 1.298 +explores combinations of marker genes, and none of these publications compare the results obtained by using 1.299 +different algorithms or scoring methods. 1.300 Aim 2: From gene expression data, discover a map of regions 1.301 Machine learning terminology: clustering 1.302 -If one is given a dataset consisting merely of instances, with no class labels, then analysis of the dataset is referred to as 1.303 -unsupervised learning in the jargon of machine learning. One thing that you can do with such a dataset is to group instances 1.304 -together. A set of similar instances is called a cluster, and the activity of finding grouping the data into clusters is called 1.305 -clustering or cluster analysis. 1.306 -The task of deciding how to carve up a structure into anatomical regions can be put into these terms. The instances 1.307 -are once again voxels (or pixels) along with their associated gene expression profiles. We make the assumption that voxels 1.308 -from the same anatomical region have similar gene expression profiles, at least compared to the other regions. This means 1.309 -that clustering voxels is the same as finding potential regions; we seek a partitioning of the voxels into regions, that is, into 1.310 -clusters of voxels with similar gene expression. 1.311 -It is desirable to determine not just one set of regions, but also how these regions relate to each other. The outcome of 1.312 -clustering may be a hierarchial tree of clusters, rather than a single set of clusters which partition the voxels. This is called 1.313 -hierarchial clustering. 1.314 -Similarity scores A crucial choice when designing a clustering method is how to measure similarity, across either pairs 1.315 -of instances, or clusters, or both. There is much overlap between scoring methods for feature selection (discussed above 1.316 -under Aim 1) and scoring methods for similarity. 1.317 -Spatially contiguous clusters; image segmentation We have shown that aim 2 is a type of clustering task. In fact, 1.318 -it is a special type of clustering task because we have an additional constraint on clusters; voxels grouped together into a 1.319 -cluster must be spatially contiguous. In Preliminary Studies, we show that one can get reasonable results without enforcing 1.320 -this constraint; however, we plan to compare these results against other methods which guarantee contiguous clusters. 1.321 -Image segmentation is the task of partitioning the pixels in a digital image into clusters, usually contiguous clusters. Aim 1.322 -2 is similar to an image segmentation task. There are two main differences; in our task, there are thousands of color channels 1.323 -(one for each gene), rather than just three6. A more crucial difference is that there are various cues which are appropriate 1.324 -for detecting sharp object boundaries in a visual scene but which are not appropriate for segmenting abstract spatial data 1.325 -such as gene expression. Although many image segmentation algorithms can be expected to work well for segmenting other 1.326 -sorts of spatially arranged data, some of these algorithms are specialized for visual images. 1.327 -Dimensionality reduction In this section, we discuss reducing the length of the per-pixel gene expression feature 1.328 -vector. By &#8220;dimension&#8221;, we mean the dimension of this vector, not the spatial dimension of the underlying data. 1.329 -Unlike aim 1, there is no externally-imposed need to select only a handful of informative genes for inclusion in the 1.330 -instances. However, some clustering algorithms perform better on small numbers of features7. There are techniques which 1.331 -&#8220;summarize&#8221; a larger number of features using a smaller number of features; these techniques go by the name of feature 1.332 -extraction or dimensionality reduction. The small set of features that such a technique yields is called the reduced feature 1.333 -set. Note that the features in the reduced feature set do not necessarily correspond to genes; each feature in the reduced set 1.334 -may be any function of the set of gene expression levels. 1.335 +2By &#8220;fundamentally spatial&#8221; we mean that there is information from a large number of spatial locations indexed by spatial coordinates; 1.336 +not just data which have only a few different locations or which is indexed by anatomical label. 1.337 +3Actually, many of these projects use quadrilaterals instead of square pixels; but we will refer to them as pixels for simplicity. 1.338 +4the number of true pixels in the intersection of the two images, divided by the number of pixels in their union. 1.339 +5&#8220;Expression energy ratio&#8221;, which captures overexpression. 1.340 +If one is given a dataset consisting merely of instances, with no class labels, then analysis of the dataset is 1.341 +referred to as unsupervised learning in the jargon of machine learning. One thing that you can do with such a 1.342 +dataset is to group instances together. A set of similar instances is called a cluster, and the activity of finding 1.343 +grouping the data into clusters is called clustering or cluster analysis. 1.344 +The task of deciding how to carve up a structure into anatomical regions can be put into these terms. The 1.345 +instances are once again voxels (or pixels) along with their associated gene expression profiles. We make 1.346 +the assumption that voxels from the same anatomical region have similar gene expression profiles, at least 1.347 +compared to the other regions. This means that clustering voxels is the same as finding potential regions; we 1.348 +seek a partitioning of the voxels into regions, that is, into clusters of voxels with similar gene expression. 1.349 +It is desirable to determine not just one set of regions, but also how these regions relate to each other. The 1.350 +outcome of clustering may be a hierarchial tree of clusters, rather than a single set of clusters which partition the 1.351 +voxels. This is called hierarchial clustering. 1.352 +Similarity scores A crucial choice when designing a clustering method is how to measure similarity, across 1.353 +either pairs of instances, or clusters, or both. There is much overlap between scoring methods for feature 1.354 +selection (discussed above under Aim 1) and scoring methods for similarity. 1.355 +Spatially contiguous clusters; image segmentation We have shown that aim 2 is a type of clustering 1.356 +task. In fact, it is a special type of clustering task because we have an additional constraint on clusters; voxels 1.357 +grouped together into a cluster must be spatially contiguous. In Preliminary Studies, we show that one can get 1.358 +reasonable results without enforcing this constraint; however, we plan to compare these results against other 1.359 +methods which guarantee contiguous clusters. 1.360 +Image segmentation is the task of partitioning the pixels in a digital image into clusters, usually contiguous 1.361 +clusters. Aim 2 is similar to an image segmentation task. There are two main differences; in our task, there are 1.362 +thousands of color channels (one for each gene), rather than just three6. A more crucial difference is that there 1.363 +are various cues which are appropriate for detecting sharp object boundaries in a visual scene but which are not 1.364 +appropriate for segmenting abstract spatial data such as gene expression. Although many image segmentation 1.365 +algorithms can be expected to work well for segmenting other sorts of spatially arranged data, some of these 1.366 +algorithms are specialized for visual images. 1.367 +Dimensionality reduction In this section, we discuss reducing the length of the per-pixel gene expression 1.368 +feature vector. By &#8220;dimension&#8221;, we mean the dimension of this vector, not the spatial dimension of the underlying 1.369 +data. 1.370 +Unlike aim 1, there is no externally-imposed need to select only a handful of informative genes for inclusion 1.371 +in the instances. However, some clustering algorithms perform better on small numbers of features7. There are 1.372 +techniques which &#8220;summarize&#8221; a larger number of features using a smaller number of features; these techniques 1.373 +go by the name of feature extraction or dimensionality reduction. The small set of features that such a technique 1.374 +yields is called the reduced feature set. Note that the features in the reduced feature set do not necessarily 1.375 +correspond to genes; each feature in the reduced set may be any function of the set of gene expression levels. 1.376 +Clustering genes rather than voxels Although the ultimate goal is to cluster the instances (voxels or pixels), 1.377 +one strategy to achieve this goal is to first cluster the features (genes). There are two ways that clusters of genes 1.378 +could be used. 1.379 +Gene clusters could be used as part of dimensionality reduction: rather than have one feature for each gene, 1.380 +we could have one reduced feature for each gene cluster. 1.381 +Gene clusters could also be used to directly yield a clustering on instances. This is because many genes 1.382 +have an expression pattern which seems to pick out a single, spatially continguous region. Therefore, it seems 1.383 +likely that an anatomically interesting region will have multiple genes which each individually pick it out8. This 1.384 _________________________________________ 1.385 - 5&#8220;Expression energy ratio&#8221;, which captures overexpression. 1.386 - 6There are imaging tasks which use more than three colors, for example multispectral imaging and hyperspectral imaging, which are often 1.387 -used to process satellite imagery. 1.388 - 7First, because the number of features in the reduced dataset is less than in the original dataset, the running time of clustering algorithms 1.389 -may be much less. Second, it is thought that some clustering algorithms may give better results on reduced data. 1.390 -Clustering genes rather than voxels Although the ultimate goal is to cluster the instances (voxels or pixels), one 1.391 -strategy to achieve this goal is to first cluster the features (genes). There are two ways that clusters of genes could be used. 1.392 -Gene clusters could be used as part of dimensionality reduction: rather than have one feature for each gene, we could 1.393 -have one reduced feature for each gene cluster. 1.394 -Gene clusters could also be used to directly yield a clustering on instances. This is because many genes have an expression 1.395 -pattern which seems to pick out a single, spatially continguous region. Therefore, it seems likely that an anatomically 1.396 -interesting region will have multiple genes which each individually pick it out8. This suggests the following procedure: 1.397 -cluster together genes which pick out similar regions, and then to use the more popular common regions as the final clusters. 1.398 -In Preliminary Studies, Figure 7, we show that a number of anatomically recognized cortical regions, as well as some 1.399 -&#8220;superregions&#8221; formed by lumping together a few regions, are associated with gene clusters in this fashion. 1.400 -The task of clustering both the instances and the features is called co-clustering, and there are a number of co-clustering 1.401 -algorithms. 1.402 + 6There are imaging tasks which use more than three colors, for example multispectral imaging and hyperspectral imaging, which are 1.403 +often used to process satellite imagery. 1.404 + 7First, because the number of features in the reduced dataset is less than in the original dataset, the running time of clustering 1.405 +algorithms may be much less. Second, it is thought that some clustering algorithms may give better results on reduced data. 1.406 + 8This would seem to contradict our finding in aim 1 that some cortical areas are combinatorially coded by multiple genes. However, 1.407 +it is possible that the currently accepted cortical maps divide the cortex into regions which are unnatural from the point of view of gene 1.408 +expression; perhaps there is some other way to map the cortex for which each region can be identified by single genes. Another 1.409 +suggests the following procedure: cluster together genes which pick out similar regions, and then to use the 1.410 +more popular common regions as the final clusters. In Preliminary Studies, Figure 7, we show that a number 1.411 +of anatomically recognized cortical regions, as well as some &#8220;superregions&#8221; formed by lumping together a few 1.412 +regions, are associated with gene clusters in this fashion. 1.413 +The task of clustering both the instances and the features is called co-clustering, and there are a number of 1.414 +co-clustering algorithms. 1.415 Related work 1.416 -Some researchers have attempted to parcellate cortex on the basis of non-gene expression data. For example, [17], [2], [18], 1.417 -and [1 ] associate spots on the cortex with the radial profile9 of response to some stain ([11] uses MRI), extract features from 1.418 -this profile, and then use similarity between surface pixels to cluster. Features used include statistical moments, wavelets, 1.419 -and the excess mass functional. Some of these features are motivated by the presence of tangential lines of stain intensity 1.420 -which correspond to laminar structure. Some methods use standard clustering procedures, whereas others make use of the 1.421 -spatial nature of the data to look for sudden transitions, which are identified as areal borders. 1.422 -[22 ] describes an analysis of the anatomy of the hippocampus using the ABA dataset. In addition to manual analysis, 1.423 -two clustering methods were employed, a modified Non-negative Matrix Factorization (NNMF), and a hierarchial recursive 1.424 -bifurcation clustering scheme based on correlation as the similarity score. The paper yielded impressive results, proving 1.425 -the usefulness of computational genomic anatomy. We have run NNMF on the cortical dataset10 and while the results are 1.426 -promising, they also demonstrate that NNMF is not necessarily the best dimensionality reduction method for this application 1.427 -(see Preliminary Studies, Figure 6). 1.428 -AGEA[14] includes a preset hierarchial clustering of voxels based on a recursive bifurcation algorithm with correlation 1.429 -as the similarity metric. EMAGE[25] allows the user to select a dataset from among a large number of alternatives, or by 1.430 -running a search query, and then to cluster the genes within that dataset. EMAGE clusters via hierarchial complete linkage 1.431 -clustering with un-centred correlation as the similarity score. 1.432 -[6 ] clustered genes, starting out by selecting 135 genes out of 20,000 which had high variance over voxels and which were 1.433 -highly correlated with many other genes. They computed the matrix of (rank) correlations between pairs of these genes, and 1.434 -ordered the rows of this matrix as follows: &#8220;the first row of the matrix was chosen to show the strongest contrast between 1.435 -the highest and lowest correlation coefficient for that row. The remaining rows were then arranged in order of decreasing 1.436 -similarity using a least squares metric&#8221;. The resulting matrix showed four clusters. For each cluster, prototypical spatial 1.437 -expression patterns were created by averaging the genes in the cluster. The prototypes were analyzed manually, without 1.438 -clustering voxels. 1.439 -[10 ] applies their technique for finding combinations of marker genes for the purpose of clustering genes around a &#8220;seed 1.440 -gene&#8221;. They do this by using the pattern of expression of the seed gene as the target image, and then searching for other 1.441 -genes which can be combined to reproduce this pattern. Other genes which are found are considered to be related to the 1.442 -seed. The same team also describes a method[24] for finding &#8220;association rules&#8221; such as, &#8220;if this voxel is expressed in by 1.443 -any gene, then that voxel is probably also expressed in by the same gene&#8221;. This could be useful as part of a procedure for 1.444 -clustering voxels. 1.445 -In summary, although these projects obtained clusterings, there has not been much comparison between different algo- 1.446 -rithms or scoring methods, so it is likely that the best clustering method for this application has not yet been found. The 1.447 -projects using gene expression on cortex did not attempt to make use of the radial profile of gene expression. Also, none of 1.448 -these projects did a separate dimensionality reduction step before clustering pixels, none tried to cluster genes first in order 1.449 -to guide automated clustering of pixels into spatial regions, and none used co-clustering algorithms. 1.450 -_________________________________________ 1.451 - 8This would seem to contradict our finding in aim 1 that some cortical areas are combinatorially coded by multiple genes. However, it is 1.452 -possible that the currently accepted cortical maps divide the cortex into regions which are unnatural from the point of view of gene expression; 1.453 -perhaps there is some other way to map the cortex for which each region can be identified by single genes. Another possibility is that, although 1.454 -the cluster prototype fits an anatomical region, the individual genes are each somewhat different from the prototype. 1.455 - 9A radial profile is a profile along a line perpendicular to the cortical surface. 1.456 - 10We ran &#8220;vanilla&#8221; NNMF, whereas the paper under discussion used a modified method. Their main modification consisted of adding a soft 1.457 -spatial contiguity constraint. However, on our dataset, NNMF naturally produced spatially contiguous clusters, so no additional constraint was 1.458 -needed. The paper under discussion also mentions that they tried a hierarchial variant of NNMF, which we have not yet tried. 1.459 +Some researchers have attempted to parcellate cortex on the basis of non-gene expression data. For example, 1.460 +[18 ], [2 ], [19], and [1] associate spots on the cortex with the radial profile9 of response to some stain ([12] uses 1.461 +MRI), extract features from this profile, and then use similarity between surface pixels to cluster. Features used 1.462 +include statistical moments, wavelets, and the excess mass functional. Some of these features are motivated 1.463 +by the presence of tangential lines of stain intensity which correspond to laminar structure. Some methods use 1.464 +standard clustering procedures, whereas others make use of the spatial nature of the data to look for sudden 1.465 +transitions, which are identified as areal borders. 1.466 +[23 ] describes an analysis of the anatomy of the hippocampus using the ABA dataset. In addition to manual 1.467 +analysis, two clustering methods were employed, a modified Non-negative Matrix Factorization (NNMF), and 1.468 +a hierarchial recursive bifurcation clustering scheme based on correlation as the similarity score. The paper 1.469 +yielded impressive results, proving the usefulness of computational genomic anatomy. We have run NNMF on 1.470 +the cortical dataset10 and while the results are promising, they also demonstrate that NNMF is not necessarily 1.471 +the best dimensionality reduction method for this application (see Preliminary Studies, Figure 6). 1.472 +AGEA[15] includes a preset hierarchial clustering of voxels based on a recursive bifurcation algorithm with 1.473 +correlation as the similarity metric. EMAGE[26] allows the user to select a dataset from among a large number 1.474 +of alternatives, or by running a search query, and then to cluster the genes within that dataset. EMAGE clusters 1.475 +via hierarchial complete linkage clustering with un-centred correlation as the similarity score. 1.476 +[6 ] clustered genes, starting out by selecting 135 genes out of 20,000 which had high variance over voxels and 1.477 +which were highly correlated with many other genes. They computed the matrix of (rank) correlations between 1.478 +pairs of these genes, and ordered the rows of this matrix as follows: &#8220;the first row of the matrix was chosen to 1.479 +show the strongest contrast between the highest and lowest correlation coefficient for that row. The remaining 1.480 +rows were then arranged in order of decreasing similarity using a least squares metric&#8221;. The resulting matrix 1.481 +showed four clusters. For each cluster, prototypical spatial expression patterns were created by averaging the 1.482 +genes in the cluster. The prototypes were analyzed manually, without clustering voxels. 1.483 +[10 ] applies their technique for finding combinations of marker genes for the purpose of clustering genes 1.484 +around a &#8220;seed gene&#8221;. They do this by using the pattern of expression of the seed gene as the target image, and 1.485 +then searching for other genes which can be combined to reproduce this pattern. Other genes which are found 1.486 +are considered to be related to the seed. The same team also describes a method[25] for finding &#8220;association 1.487 +rules&#8221; such as, &#8220;if this voxel is expressed in by any gene, then that voxel is probably also expressed in by the 1.488 +same gene&#8221;. This could be useful as part of a procedure for clustering voxels. 1.489 +In summary, although these projects obtained clusterings, there has not been much comparison between 1.490 +different algorithms or scoring methods, so it is likely that the best clustering method for this application has not 1.491 +yet been found. The projects using gene expression on cortex did not attempt to make use of the radial profile 1.492 +of gene expression. Also, none of these projects did a separate dimensionality reduction step before clustering 1.493 +pixels, none tried to cluster genes first in order to guide automated clustering of pixels into spatial regions, and 1.494 +none used co-clustering algorithms. 1.495 +________ 1.496 +possibility is that, although the cluster prototype fits an anatomical region, the individual genes are each somewhat different from the 1.497 +prototype. 1.498 + 9A radial profile is a profile along a line perpendicular to the cortical surface. 1.499 + 10We ran &#8220;vanilla&#8221; NNMF, whereas the paper under discussion used a modified method. Their main modification consisted of adding 1.500 +a soft spatial contiguity constraint. However, on our dataset, NNMF naturally produced spatially contiguous clusters, so no additional 1.501 +constraint was needed. The paper under discussion also mentions that they tried a hierarchial variant of NNMF, which we have not yet 1.502 +tried. 1.503 Aim 3: apply the methods developed to the cerebral cortex 1.504 Background 1.505 -The cortex is divided into areas and layers. Because of the cortical columnar organization, the parcellation of the cortex 1.506 -into areas can be drawn as a 2-D map on the surface of the cortex. In the third dimension, the boundaries between the 1.507 -areas continue downwards into the cortical depth, perpendicular to the surface. The layer boundaries run parallel to the 1.508 -surface. One can picture an area of the cortex as a slice of a six-layered cake11. 1.509 -It is known that different cortical areas have distinct roles in both normal functioning and in disease processes, yet there 1.510 -are no known marker genes for most cortical areas. When it is necessary to divide a tissue sample into cortical areas, this is 1.511 -a manual process that requires a skilled human to combine multiple visual cues and interpret them in the context of their 1.512 -approximate location upon the cortical surface. 1.513 -Even the questions of how many areas should be recognized in cortex, and what their arrangement is, are still not 1.514 -completely settled. A proposed division of the cortex into areas is called a cortical map. In the rodent, the lack of a single 1.515 -agreed-upon map can be seen by contrasting the recent maps given by Swanson[21] on the one hand, and Paxinos and 1.516 -Franklin[16] on the other. While the maps are certainly very similar in their general arrangement, significant differences 1.517 -remain. 1.518 +The cortex is divided into areas and layers. Because of the cortical columnar organization, the parcellation 1.519 +of the cortex into areas can be drawn as a 2-D map on the surface of the cortex. In the third dimension, the 1.520 +boundaries between the areas continue downwards into the cortical depth, perpendicular to the surface. The 1.521 +layer boundaries run parallel to the surface. One can picture an area of the cortex as a slice of a six-layered 1.522 +cake11 . 1.523 +It is known that different cortical areas have distinct roles in both normal functioning and in disease processes, 1.524 +yet there are no known marker genes for most cortical areas. When it is necessary to divide a tissue sample 1.525 +into cortical areas, this is a manual process that requires a skilled human to combine multiple visual cues and 1.526 +interpret them in the context of their approximate location upon the cortical surface. 1.527 +Even the questions of how many areas should be recognized in cortex, and what their arrangement is, are 1.528 +still not completely settled. A proposed division of the cortex into areas is called a cortical map. In the rodent, 1.529 +the lack of a single agreed-upon map can be seen by contrasting the recent maps given by Swanson[22] on the 1.530 +one hand, and Paxinos and Franklin[17] on the other. While the maps are certainly very similar in their general 1.531 +arrangement, significant differences remain. 1.532 The Allen Mouse Brain Atlas dataset 1.533 -The Allen Mouse Brain Atlas (ABA) data were produced by doing in-situ hybridization on slices of male, 56-day-old 1.534 -C57BL/6J mouse brains. Pictures were taken of the processed slice, and these pictures were semi-automatically analyzed 1.535 -to create a digital measurement of gene expression levels at each location in each slice. Per slice, cellular spatial resolution 1.536 -is achieved. Using this method, a single physical slice can only be used to measure one single gene; many different mouse 1.537 -brains were needed in order to measure the expression of many genes. 1.538 -An automated nonlinear alignment procedure located the 2D data from the various slices in a single 3D coordinate 1.539 -system. In the final 3D coordinate system, voxels are cubes with 200 microns on a side. There are 67x41x58 = 159,326 1.540 -voxels in the 3D coordinate system, of which 51,533 are in the brain[14]. 1.541 -Mus musculus is thought to contain about 22,000 protein-coding genes[27]. The ABA contains data on about 20,000 1.542 -genes in sagittal sections, out of which over 4,000 genes are also measured in coronal sections. Our dataset is derived from 1.543 -only the coronal subset of the ABA12. 1.544 -The ABA is not the only large public spatial gene expression dataset13. With the exception of the ABA, GenePaint, and 1.545 -EMAGE, most of the other resources have not (yet) extracted the expression intensity from the ISH images and registered 1.546 -the results into a single 3-D space, and to our knowledge only ABA and EMAGE make this form of data available for public 1.547 -download from the website14. Many of these resources focus on developmental gene expression. 1.548 +The Allen Mouse Brain Atlas (ABA) data were produced by doing in-situ hybridization on slices of male, 1.549 +56-day-old C57BL/6J mouse brains. Pictures were taken of the processed slice, and these pictures were semi- 1.550 +automatically analyzed to create a digital measurement of gene expression levels at each location in each slice. 1.551 +Per slice, cellular spatial resolution is achieved. Using this method, a single physical slice can only be used 1.552 +to measure one single gene; many different mouse brains were needed in order to measure the expression of 1.553 +many genes. 1.554 +An automated nonlinear alignment procedure located the 2D data from the various slices in a single 3D 1.555 +coordinate system. In the final 3D coordinate system, voxels are cubes with 200 microns on a side. There are 1.556 +67x41x58 = 159,326 voxels in the 3D coordinate system, of which 51,533 are in the brain[15]. 1.557 +Mus musculus is thought to contain about 22,000 protein-coding genes[28]. The ABA contains data on about 1.558 +20,000 genes in sagittal sections, out of which over 4,000 genes are also measured in coronal sections. Our 1.559 +dataset is derived from only the coronal subset of the ABA12. 1.560 +The ABA is not the only large public spatial gene expression dataset13. With the exception of the ABA, 1.561 +GenePaint, and EMAGE, most of the other resources have not (yet) extracted the expression intensity from the 1.562 +ISH images and registered the results into a single 3-D space, and to our knowledge only ABA and EMAGE 1.563 +make this form of data available for public download from the website14. Many of these resources focus on 1.564 +developmental gene expression. 1.565 Related work 1.566 -[14 ] describes the application of AGEA to the cortex. The paper describes interesting results on the structure of correlations 1.567 -between voxel gene expression profiles within a handful of cortical areas. However, this sort of analysis is not related to either 1.568 -of our aims, as it neither finds marker genes, nor does it suggest a cortical map based on gene expression data. Neither of 1.569 -the other components of AGEA can be applied to cortical areas; AGEA&#8217;s Gene Finder cannot be used to find marker genes 1.570 -for the cortical areas; and AGEA&#8217;s hierarchial clustering does not produce clusters corresponding to the cortical areas15. 1.571 -In summary, for all three aims, (a) only one of the previous projects explores combinations of marker genes, (b) there has 1.572 -been almost no comparison of different algorithms or scoring methods, and (c) there has been no work on computationally 1.573 -finding marker genes for cortical areas, or on finding a hierarchial clustering that will yield a map of cortical areas de novo 1.574 -from gene expression data. 1.575 -___________________ 1.576 - 11Outside of isocortex, the number of layers varies. 1.577 - 12The sagittal data do not cover the entire cortex, and also have greater registration error[14]. Genes were selected by the Allen Institute for 1.578 -coronal sectioning based on, &#8220;classes of known neuroscientific interest... or through post hoc identification of a marked non-ubiquitous expression 1.579 -pattern&#8221;[14]. 1.580 - 13Other such resources include GENSAT[8], GenePaint[26], its sister project GeneAtlas[5], BGEM[13], EMAGE[25], EurExpress (http: 1.581 -//www.eurexpress.org/ee/; EurExpress data are also entered into EMAGE), EADHB (http://www.ncl.ac.uk/ihg/EADHB/database/$EADHB_ 1.582 -{database}$.html), MAMEP (http://mamep.molgen.mpg.de/index.php), Xenbase (http://xenbase.org/), ZFIN[20], Aniseed (http:// 1.583 -aniseed-ibdm.univ-mrs.fr/), VisiGene (http://genome.ucsc.edu/cgi-bin/hgVisiGene ; includes data from some of the other listed data 1.584 -sources), GEISHA[4], Fruitfly.org[23], COMPARE (http://compare.ibdml.univ-mrs.fr/), GXD[19], GEO[3] (GXD and GEO contain spatial 1.585 -data but also non-spatial data. All GXD spatial data are also in EMAGE.) 1.586 - 14without prior offline registration 1.587 - 15In both cases, the cause is that pairwise correlations between the gene expression of voxels in different areas but the same layer are often stronger 1.588 -than pairwise correlations between the gene expression of voxels in different layers but the same area. Therefore, a pairwise voxel correlation 1.589 -clustering algorithm will tend to create clusters representing cortical layers, not areas (there may be clusters which presumably correspond to the 1.590 -intersection of a layer and an area, but since one area will have many layer-area intersection clusters, further work is needed to make sense of 1.591 -these). The reason that Gene Finder cannot the find marker genes for cortical areas is that, although the user chooses a seed voxel, Gene Finder 1.592 -chooses the ROI for which genes will be found, and it creates that ROI by (pairwise voxel correlation) clustering around the seed. 1.593 -Our project is guided by a concrete application with a well-specified criterion of success (how well we can find marker 1.594 -genes for / reproduce the layout of cortical areas), which will provide a solid basis for comparing different methods. 1.595 +[15 ] describes the application of AGEA to the cortex. The paper describes interesting results on the structure 1.596 +of correlations between voxel gene expression profiles within a handful of cortical areas. However, this sort 1.597 +of analysis is not related to either of our aims, as it neither finds marker genes, nor does it suggest a cortical 1.598 +map based on gene expression data. Neither of the other components of AGEA can be applied to cortical 1.599 +_________________________________________ 1.600 + 11Outside of isocortex, the number of layers varies. 1.601 + 12The sagittal data do not cover the entire cortex, and also have greater registration error[15]. Genes were selected by the Allen 1.602 +Institute for coronal sectioning based on, &#8220;classes of known neuroscientific interest... or through post hoc identification of a marked 1.603 +non-ubiquitous expression pattern&#8221;[15]. 1.604 + 13Other such resources include GENSAT[8], GenePaint[27], its sister project GeneAtlas[5], BGEM[14], EMAGE[26], EurExpress 1.605 +(http://www.eurexpress.org/ee/; EurExpress data are also entered into EMAGE), EADHB (http://www.ncl.ac.uk/ihg/EADHB/ 1.606 +database/EADHB_database.html), MAMEP (http://mamep.molgen.mpg.de/index.php), Xenbase (http://xenbase.org/), ZFIN[21], 1.607 +Aniseed (http://aniseed-ibdm.univ-mrs.fr/), VisiGene (http://genome.ucsc.edu/cgi-bin/hgVisiGene ; includes data from some 1.608 +of the other listed data sources), GEISHA[4], Fruitfly.org[24], COMPARE (http://compare.ibdml.univ-mrs.fr/), GXD[20], GEO[3] 1.609 +(GXD and GEO contain spatial data but also non-spatial data. All GXD spatial data are also in EMAGE.) 1.610 + 14without prior offline registration 1.611 +areas; AGEA&#8217;s Gene Finder cannot be used to find marker genes for the cortical areas; and AGEA&#8217;s hierarchial 1.612 +clustering does not produce clusters corresponding to the cortical areas15. 1.613 +In summary, for all three aims, (a) only one of the previous projects explores combinations of marker genes, 1.614 +(b) there has been almost no comparison of different algorithms or scoring methods, and (c) there has been no 1.615 +work on computationally finding marker genes for cortical areas, or on finding a hierarchial clustering that will 1.616 +yield a map of cortical areas de novo from gene expression data. 1.617 +Our project is guided by a concrete application with a well-specified criterion of success (how well we can 1.618 +find marker genes for / reproduce the layout of cortical areas), which will provide a solid basis for comparing 1.619 +different methods. 1.620 Significance 1.621 1.622 1.623 -Figure 1: Top row: Genes Nfic and 1.624 -A930001M12Rik are the most correlated 1.625 -with area SS (somatosensory cortex). Bot- 1.626 -tom row: Genes C130038G02Rik and 1.627 -Cacna1i are those with the best fit using 1.628 -logistic regression. Within each picture, the 1.629 -vertical axis roughly corresponds to anterior 1.630 -at the top and posterior at the bottom, and 1.631 -the horizontal axis roughly corresponds to 1.632 -medial at the left and lateral at the right. 1.633 -The red outline is the boundary of region 1.634 -SS. Pixels are colored according to correla- 1.635 -tion, with red meaning high correlation and 1.636 -blue meaning low. The method developed in aim (1) will be applied to each cortical area to find 1.637 - a set of marker genes such that the combinatorial expression pattern of those 1.638 - genes uniquely picks out the target area. Finding marker genes will be useful 1.639 - for drug discovery as well as for experimentation because marker genes can be 1.640 - used to design interventions which selectively target individual cortical areas. 1.641 - The application of the marker gene finding algorithm to the cortex will 1.642 - also support the development of new neuroanatomical methods. In addition 1.643 - to finding markers for each individual cortical areas, we will find a small panel 1.644 - of genes that can find many of the areal boundaries at once. This panel of 1.645 - marker genes will allow the development of an ISH protocol that will allow 1.646 - experimenters to more easily identify which anatomical areas are present in 1.647 - small samples of cortex. 1.648 - The method developed in aim (2) will provide a genoarchitectonic viewpoint 1.649 - that will contribute to the creation of a better map. The development of 1.650 - present-day cortical maps was driven by the application of histological stains. 1.651 - If a different set of stains had been available which identified a different set of 1.652 - features, then today&#8217;s cortical maps may have come out differently. It is likely 1.653 - that there are many repeated, salient spatial patterns in the gene expression 1.654 - which have not yet been captured by any stain. Therefore, cortical anatomy 1.655 - needs to incorporate what we can learn from looking at the patterns of gene 1.656 - expression. 1.657 - While we do not here propose to analyze human gene expression data, it is 1.658 - conceivable that the methods we propose to develop could be used to suggest 1.659 - modifications to the human cortical map as well. In fact, the methods we will 1.660 - develop will be applicable to other datasets beyond the brain. We will provide 1.661 - an open-source toolbox to allow other researchers to easily use our methods. 1.662 - With these methods, researchers with gene expression for any area of the body 1.663 - will be able to efficiently find marker genes for anatomical regions, or to use 1.664 - gene expression to discover new anatomical patterning. As described above, 1.665 -marker genes have a variety of uses in the development of drugs and experimental manipulations, and in the anatomical 1.666 -characterization of tissue samples. The discovery of new ways to carve up anatomical structures into regions may lead to 1.667 -the discovery of new anatomical subregions in various structures, which will widely impact all areas of biology. 1.668 +Figure 1: Top row: Genes Nfic 1.669 +and A930001M12Rik are the most 1.670 +correlated with area SS (somatosen- 1.671 +sory cortex). Bottom row: Genes 1.672 +C130038G02Rik and Cacna1i are 1.673 +those with the best fit using logistic 1.674 +regression. Within each picture, the 1.675 +vertical axis roughly corresponds to 1.676 +anterior at the top and posterior at the 1.677 +bottom, and the horizontal axis roughly 1.678 +corresponds to medial at the left and 1.679 +lateral at the right. The red outline is 1.680 +the boundary of region SS. Pixels are 1.681 +colored according to correlation, with 1.682 +red meaning high correlation and blue 1.683 +meaning low. The method developed in aim (1) will be applied to each cortical area to 1.684 + find a set of marker genes such that the combinatorial expression pat- 1.685 + tern of those genes uniquely picks out the target area. Finding marker 1.686 + genes will be useful for drug discovery as well as for experimentation 1.687 + because marker genes can be used to design interventions which se- 1.688 + lectively target individual cortical areas. 1.689 + The application of the marker gene finding algorithm to the cortex 1.690 + will also support the development of new neuroanatomical methods. In 1.691 + addition to finding markers for each individual cortical areas, we will 1.692 + find a small panel of genes that can find many of the areal boundaries 1.693 + at once. This panel of marker genes will allow the development of an 1.694 + ISH protocol that will allow experimenters to more easily identify which 1.695 + anatomical areas are present in small samples of cortex. 1.696 + The method developed in aim (2) will provide a genoarchitectonic 1.697 + viewpoint that will contribute to the creation of a better map. The de- 1.698 + velopment of present-day cortical maps was driven by the application 1.699 + of histological stains. If a different set of stains had been available 1.700 + which identified a different set of features, then today&#8217;s cortical maps 1.701 + may have come out differently. It is likely that there are many repeated, 1.702 + salient spatial patterns in the gene expression which have not yet been 1.703 + captured by any stain. Therefore, cortical anatomy needs to incorpo- 1.704 + rate what we can learn from looking at the patterns of gene expression. 1.705 + While we do not here propose to analyze human gene expression 1.706 + data, it is conceivable that the methods we propose to develop could 1.707 + be used to suggest modifications to the human cortical map as well. In 1.708 + fact, the methods we will develop will be applicable to other datasets 1.709 + beyond the brain. We will provide an open-source toolbox to allow 1.710 + other researchers to easily use our methods. With these methods, re- 1.711 + searchers with gene expression for any area of the body will be able to 1.712 +efficiently find marker genes for anatomical regions, or to use gene expression to discover new anatomical pat- 1.713 +terning. As described above, marker genes have a variety of uses in the development of drugs and experimental 1.714 +manipulations, and in the anatomical characterization of tissue samples. The discovery of new ways to carve up 1.715 +anatomical structures into regions may lead to the discovery of new anatomical subregions in various structures, 1.716 +_________________________________________ 1.717 + 15In both cases, the cause is that pairwise correlations between the gene expression of voxels in different areas but the same layer 1.718 +are often stronger than pairwise correlations between the gene expression of voxels in different layers but the same area. Therefore, a 1.719 +pairwise voxel correlation clustering algorithm will tend to create clusters representing cortical layers, not areas (there may be clusters 1.720 +which presumably correspond to the intersection of a layer and an area, but since one area will have many layer-area intersection 1.721 +clusters, further work is needed to make sense of these). The reason that Gene Finder cannot the find marker genes for cortical areas 1.722 +is that, although the user chooses a seed voxel, Gene Finder chooses the ROI for which genes will be found, and it creates that ROI by 1.723 +(pairwise voxel correlation) clustering around the seed. 1.724 +which will widely impact all areas of biology. 1.725 1.726 -Figure 2: Gene Pitx2 1.727 -is selectively underex- 1.728 -pressed in area SS. Although our particular application involves the 3D spatial distribution of gene expression, we 1.729 - anticipate that the methods developed in aims (1) and (2) will not be limited to gene expression 1.730 - data, but rather will generalize to any sort of high-dimensional data over points located in a 1.731 - low-dimensional space. 1.732 - The approach: Preliminary Studies 1.733 +Figure 2: Gene Pitx2 1.734 +is selectively underex- 1.735 +pressed in area SS. Although our particular application involves the 3D spatial distribution of gene ex- 1.736 + pression, we anticipate that the methods developed in aims (1) and (2) will not be limited 1.737 + to gene expression data, but rather will generalize to any sort of high-dimensional data 1.738 + over points located in a low-dimensional space. 1.739 + The approach: Preliminary Studies 1.740 Format conversion between SEV, MATLAB, NIFTI 1.741 - We have created software to (politely) download all of the SEV files16 from the Allen Institute 1.742 - website. We have also created software to convert between the SEV, MATLAB, and NIFTI file 1.743 - formats, as well as some of Caret&#8217;s file formats. 1.744 + We have created software to (politely) download all of the SEV files16 from the Allen 1.745 + Institute website. We have also created software to convert between the SEV, MATLAB, 1.746 + and NIFTI file formats, as well as some of Caret&#8217;s file formats. 1.747 Flatmap of cortex 1.748 - We downloaded the ABA data and applied a mask to select only those voxels which belong to 1.749 - cerebral cortex. We divided the cortex into hemispheres. 1.750 -Using Caret[7], we created a mesh representation of the surface of the selected voxels. For each gene, and for each node 1.751 -of the mesh, we calculated an average of the gene expression of the voxels &#8220;underneath&#8221; that mesh node. We then flattened 1.752 -the cortex, creating a two-dimensional mesh. 1.753 -____ 1.754 - 16SEV is a sparse format for spatial data. It is the format in which the ABA data is made available. 1.755 - 1.756 + We downloaded the ABA data and applied a mask to select only those voxels which 1.757 +belong to cerebral cortex. We divided the cortex into hemispheres. 1.758 +Using Caret[7], we created a mesh representation of the surface of the selected voxels. For each gene, and 1.759 +for each node of the mesh, we calculated an average of the gene expression of the voxels &#8220;underneath&#8221; that 1.760 +mesh node. We then flattened the cortex, creating a two-dimensional mesh. 1.761 1.762 1.763 -Figure 3: The top row shows the two genes 1.764 -which (individually) best predict area AUD, 1.765 -according to logistic regression. The bot- 1.766 -tom row shows the two genes which (indi- 1.767 -vidually) best match area AUD, according 1.768 -to gradient similarity. From left to right and 1.769 -top to bottom, the genes are Ssr1, Efcbp1, 1.770 -Ptk7, and Aph1a. We sampled the nodes of the irregular, flat mesh in order to create a regular 1.771 - grid of pixel values. We converted this grid into a MATLAB matrix. 1.772 - We manually traced the boundaries of each of 49 cortical areas from the 1.773 - ABA coronal reference atlas slides. We then converted these manual traces 1.774 - into Caret-format regional boundary data on the mesh surface. We projected 1.775 - the regions onto the 2-d mesh, and then onto the grid, and then we converted 1.776 - the region data into MATLAB format. 1.777 - At this point, the data are in the form of a number of 2-D matrices, all in 1.778 - registration, with the matrix entries representing a grid of points (pixels) over 1.779 - the cortical surface: 1.780 - &#x2219; A 2-D matrix whose entries represent the regional label associated with 1.781 - each surface pixel 1.782 - &#x2219; For each gene, a 2-D matrix whose entries represent the average expres- 1.783 - sion level underneath each surface pixel 1.784 - We created a normalized version of the gene expression data by subtracting 1.785 - each gene&#8217;s mean expression level (over all surface pixels) and dividing the 1.786 - expression level of each gene by its standard deviation. 1.787 - The features and the target area are both functions on the surface pix- 1.788 - els. They can be referred to as scalar fields over the space of surface pixels; 1.789 - alternately, they can be thought of as images which can be displayed on the 1.790 - flatmapped surface. 1.791 - To move beyond a single average expression level for each surface pixel, we 1.792 -plan to create a separate matrix for each cortical layer to represent the average expression level within that layer. Cortical 1.793 -layers are found at different depths in different parts of the cortex. In preparation for extracting the layer-specific datasets, 1.794 -we have extended Caret with routines that allow the depth of the ROI for volume-to-surface projection to vary. 1.795 -In the Research Plan, we describe how we will automatically locate the layer depths. For validation, we have manually 1.796 -demarcated the depth of the outer boundary of cortical layer 5 throughout the cortex. 1.797 +Figure 3: The top row shows the two 1.798 +genes which (individually) best predict 1.799 +area AUD, according to logistic regres- 1.800 +sion. The bottom row shows the two 1.801 +genes which (individually) best match 1.802 +area AUD, according to gradient sim- 1.803 +ilarity. From left to right and top to 1.804 +bottom, the genes are Ssr1, Efcbp1, 1.805 +Ptk7, and Aph1a. We sampled the nodes of the irregular, flat mesh in order to create 1.806 + a regular grid of pixel values. We converted this grid into a MATLAB 1.807 + matrix. 1.808 + We manually traced the boundaries of each of 49 cortical areas 1.809 + from the ABA coronal reference atlas slides. We then converted these 1.810 + manual traces into Caret-format regional boundary data on the mesh 1.811 + surface. We projected the regions onto the 2-d mesh, and then onto 1.812 + the grid, and then we converted the region data into MATLAB format. 1.813 + At this point, the data are in the form of a number of 2-D matrices, 1.814 + all in registration, with the matrix entries representing a grid of points 1.815 + (pixels) over the cortical surface: 1.816 + &#x2219; A 2-D matrix whose entries represent the regional label associ- 1.817 + ated with each surface pixel 1.818 + &#x2219; For each gene, a 2-D matrix whose entries represent the average 1.819 + expression level underneath each surface pixel 1.820 + We created a normalized version of the gene expression data by 1.821 + subtracting each gene&#8217;s mean expression level (over all surface pixels) 1.822 + and dividing the expression level of each gene by its standard deviation. 1.823 + The features and the target area are both functions on the surface 1.824 + pixels. They can be referred to as scalar fields over the space of sur- 1.825 + face pixels; alternately, they can be thought of as images which can be 1.826 + displayed on the flatmapped surface. 1.827 +To move beyond a single average expression level for each surface pixel, we plan to create a separate matrix 1.828 +for each cortical layer to represent the average expression level within that layer. Cortical layers are found at 1.829 +different depths in different parts of the cortex. In preparation for extracting the layer-specific datasets, we have 1.830 +extended Caret with routines that allow the depth of the ROI for volume-to-surface projection to vary. 1.831 +In the Research Plan, we describe how we will automatically locate the layer depths. For validation, we have 1.832 +manually demarcated the depth of the outer boundary of cortical layer 5 throughout the cortex. 1.833 +_________________________________________ 1.834 + 16SEV is a sparse format for spatial data. It is the format in which the ABA data is made available. 1.835 Feature selection and scoring methods 1.836 -Underexpression of a gene can serve as a marker Underexpression of a gene can sometimes serve as a marker. See, 1.837 -for example, Figure 2. 1.838 +Underexpression of a gene can serve as a marker Underexpression of a gene can sometimes serve as a 1.839 +marker. See, for example, Figure 2. 1.840 1.841 1.842 -Figure 4: Upper left: wwc1. Upper right: 1.843 -mtif2. Lower left: wwc1 + mtif2 (each 1.844 -pixel&#8217;s value on the lower left is the sum of 1.845 -the corresponding pixels in the upper row). Correlation Recall that the instances are surface pixels, and consider the 1.846 - problem of attempting to classify each instance as either a member of a partic- 1.847 - ular anatomical area, or not. The target area can be represented as a boolean 1.848 - mask over the surface pixels. 1.849 - One class of feature selection scoring methods contains methods which cal- 1.850 - culate some sort of &#8220;match&#8221; between each gene image and the target image. 1.851 - Those genes which match the best are good candidates for features. 1.852 - One of the simplest methods in this class is to use correlation as the match 1.853 - score. We calculated the correlation between each gene and each cortical area. 1.854 - The top row of Figure 1 shows the three genes most correlated with area SS. 1.855 - Conditional entropy An information-theoretic scoring method is to find 1.856 - features such that, if the features (gene expression levels) are known, uncer- 1.857 - tainty about the target (the regional identity) is reduced. Entropy measures 1.858 - uncertainty, so what we want is to find features such that the conditional dis- 1.859 - tribution of the target has minimal entropy. The distribution to which we are 1.860 - referring is the probability distribution over the population of surface pixels. 1.861 - The simplest way to use information theory is on discrete data, so we 1.862 - discretized our gene expression data by creating, for each gene, five thresholded 1.863 - boolean masks of the gene data. For each gene, we created a boolean mask 1.864 -of its expression levels using each of these thresholds: the mean of that gene, the mean minus one standard deviation, the 1.865 -mean minus two standard deviations, the mean plus one standard deviation, the mean plus two standard deviations. 1.866 -Now, for each region, we created and ran a forward stepwise procedure which attempted to find pairs of gene expression 1.867 -boolean masks such that the conditional entropy of the target area&#8217;s boolean mask, conditioned upon the pair of gene 1.868 -expression boolean masks, is minimized. 1.869 -This finds pairs of genes which are most informative (at least at these discretization thresholds) relative to the question, 1.870 -&#8220;Is this surface pixel a member of the target area?&#8221;. Its advantage over linear methods such as logistic regression is that it 1.871 -takes account of arbitrarily nonlinear relationships; for example, if the XOR of two variables predicts the target, conditional 1.872 -entropy would notice, whereas linear methods would not. 1.873 +Figure 4: Upper left: wwc1. Upper 1.874 +right: mtif2. Lower left: wwc1 + mtif2 1.875 +(each pixel&#8217;s value on the lower left is 1.876 +the sum of the corresponding pixels in 1.877 +the upper row). Correlation Recall that the instances are surface pixels, and con- 1.878 + sider the problem of attempting to classify each instance as either a 1.879 + member of a particular anatomical area, or not. The target area can be 1.880 + represented as a boolean mask over the surface pixels. 1.881 + One class of feature selection scoring methods contains methods 1.882 + which calculate some sort of &#8220;match&#8221; between each gene image and 1.883 + the target image. Those genes which match the best are good candi- 1.884 + dates for features. 1.885 + One of the simplest methods in this class is to use correlation as 1.886 + the match score. We calculated the correlation between each gene 1.887 + and each cortical area. The top row of Figure 1 shows the three genes 1.888 + most correlated with area SS. 1.889 + Conditional entropy An information-theoretic scoring method is 1.890 + to find features such that, if the features (gene expression levels) are 1.891 + known, uncertainty about the target (the regional identity) is reduced. 1.892 + Entropy measures uncertainty, so what we want is to find features such 1.893 + that the conditional distribution of the target has minimal entropy. The 1.894 + distribution to which we are referring is the probability distribution over 1.895 +the population of surface pixels. 1.896 +The simplest way to use information theory is on discrete data, so we discretized our gene expression data 1.897 +by creating, for each gene, five thresholded boolean masks of the gene data. For each gene, we created a 1.898 +boolean mask of its expression levels using each of these thresholds: the mean of that gene, the mean minus 1.899 +one standard deviation, the mean minus two standard deviations, the mean plus one standard deviation, the 1.900 +mean plus two standard deviations. 1.901 +Now, for each region, we created and ran a forward stepwise procedure which attempted to find pairs of gene 1.902 +expression boolean masks such that the conditional entropy of the target area&#8217;s boolean mask, conditioned upon 1.903 +the pair of gene expression boolean masks, is minimized. 1.904 +This finds pairs of genes which are most informative (at least at these discretization thresholds) relative to the 1.905 +question, &#8220;Is this surface pixel a member of the target area?&#8221;. Its advantage over linear methods such as logistic 1.906 +regression is that it takes account of arbitrarily nonlinear relationships; for example, if the XOR of two variables 1.907 +predicts the target, conditional entropy would notice, whereas linear methods would not. 1.908 +Gradient similarity We noticed that the previous two scoring methods, which are pointwise, often found 1.909 +genes whose pattern of expression did not look similar in shape to the target region. For this reason we designed 1.910 +a non-pointwise local scoring method to detect when a gene had a pattern of expression which looked like it had 1.911 +a boundary whose shape is similar to the shape of the target region. We call this scoring method &#8220;gradient 1.912 +similarity&#8221;. 1.913 +One might say that gradient similarity attempts to measure how much the border of the area of gene expres- 1.914 +sion and the border of the target region overlap. However, since gene expression falls off continuously rather 1.915 +than jumping from its maximum value to zero, the spatial pattern of a gene&#8217;s expression often does not have a 1.916 +discrete border. Therefore, instead of looking for a discrete border, we look for large gradients. Gradient similarity 1.917 +is a symmetric function over two images (i.e. two scalar fields). It is is high to the extent that matching pixels 1.918 +which have large values and large gradients also have gradients which are oriented in a similar direction. The 1.919 +formula is: 1.920 + &#x2211; 1.921 + pixel<img src="cmsy8-32.png" alt="&#x2208;" />pixels cos(abs(&#x2220;&#x2207;1 -&#x2220;&#x2207;2)) &#x22C5;|&#x2207;1| + |&#x2207;2| 1.922 + 2 &#x22C5; pixel_value1 + pixel_value2 1.923 + 2 1.924 + 1.925 1.926 1.927 1.928 1.929 -Figure 5: From left to right and top 1.930 -to bottom, single genes which roughly 1.931 -identify areas SS (somatosensory primary 1.932 -+ supplemental), SSs (supplemental so- 1.933 -matosensory), PIR (piriform), FRP (frontal 1.934 -pole), RSP (retrosplenial), COApm (Corti- 1.935 -cal amygdalar, posterior part, medial zone). 1.936 -Grouping some areas together, we have 1.937 -also found genes to identify the groups 1.938 +Figure 5: From left to right and top 1.939 +to bottom, single genes which roughly 1.940 +identify areas SS (somatosensory pri- 1.941 +mary + supplemental), SSs (supple- 1.942 +mental somatosensory), PIR (piriform), 1.943 +FRP (frontal pole), RSP (retrosple- 1.944 +nial), COApm (Cortical amygdalar, pos- 1.945 +terior part, medial zone). Grouping 1.946 +some areas together, we have also 1.947 +found genes to identify the groups 1.948 ACA+PL+ILA+DP+ORB+MO (anterior 1.949 -cingulate, prelimbic, infralimbic, dorsal pe- 1.950 -duncular, orbital, motor), posterior and lat- 1.951 -eral visual (VISpm, VISpl, VISI, VISp; pos- 1.952 -teromedial, posterolateral, lateral, and pri- 1.953 -mary visual; the posterior and lateral vi- 1.954 -sual area is distinguished from its neigh- 1.955 -bors, but not from the entire rest of the 1.956 -cortex). The genes are Pitx2, Aldh1a2, 1.957 -Ppfibp1, Slco1a5, Tshz2, Trhr, Col12a1, 1.958 -Ets1. Gradient similarity We noticed that the previous two scoring methods, 1.959 - which are pointwise, often found genes whose pattern of expression did not 1.960 - look similar in shape to the target region. For this reason we designed a 1.961 - non-pointwise local scoring method to detect when a gene had a pattern of 1.962 - expression which looked like it had a boundary whose shape is similar to the 1.963 - shape of the target region. We call this scoring method &#8220;gradient similarity&#8221;. 1.964 - One might say that gradient similarity attempts to measure how much the 1.965 - border of the area of gene expression and the border of the target region over- 1.966 - lap. However, since gene expression falls off continuously rather than jumping 1.967 - from its maximum value to zero, the spatial pattern of a gene&#8217;s expression often 1.968 - does not have a discrete border. Therefore, instead of looking for a discrete 1.969 - border, we look for large gradients. Gradient similarity is a symmetric function 1.970 - over two images (i.e. two scalar fields). It is is high to the extent that matching 1.971 - pixels which have large values and large gradients also have gradients which 1.972 - are oriented in a similar direction. The formula is: 1.973 - &#x2211; 1.974 - pixel<img src="cmsy7-32.png" alt="&#x2208;" />pixels cos(abs(&#x2220;&#x2207;1 -&#x2220;&#x2207;2)) &#x22C5;|&#x2207;1| + |&#x2207;2| 1.975 - 2 &#x22C5; pixel_value1 + pixel_value2 1.976 - 2 1.977 - where &#x2207;1 and &#x2207;2 are the gradient vectors of the two images at the current 1.978 - pixel; &#x2220;&#x2207;i is the angle of the gradient of image i at the current pixel; |&#x2207;i| is 1.979 - the magnitude of the gradient of image i at the current pixel; and pixel_valuei 1.980 - is the value of the current pixel in image i. 1.981 - The intuition is that we want to see if the borders of the pattern in the 1.982 - two images are similar; if the borders are similar, then both images will have 1.983 - corresponding pixels with large gradients (because this is a border) which are 1.984 - oriented in a similar direction (because the borders are similar). 1.985 +cingulate, prelimbic, infralimbic, dor- 1.986 +sal peduncular, orbital, motor), poste- 1.987 +rior and lateral visual (VISpm, VISpl, 1.988 +VISI, VISp; posteromedial, posterolat- 1.989 +eral, lateral, and primary visual; the 1.990 +posterior and lateral visual area is dis- 1.991 +tinguished from its neighbors, but not 1.992 +from the entire rest of the cortex). The 1.993 +genes are Pitx2, Aldh1a2, Ppfibp1, 1.994 +Slco1a5, Tshz2, Trhr, Col12a1, Ets1. where &#x2207;1 and &#x2207;2 are the gradient vectors of the two images at the 1.995 + current pixel; &#x2220;&#x2207;i is the angle of the gradient of image i at the current 1.996 + pixel; |&#x2207;i| is the magnitude of the gradient of image i at the current 1.997 + pixel; and pixel_valuei is the value of the current pixel in image i. 1.998 + The intuition is that we want to see if the borders of the pattern in 1.999 + the two images are similar; if the borders are similar, then both images 1.1000 + will have corresponding pixels with large gradients (because this is a 1.1001 + border) which are oriented in a similar direction (because the borders 1.1002 + are similar). 1.1003 Most of the genes in Figure 5 were identified via gradient similarity. 1.1004 - Gradient similarity provides information complementary to cor- 1.1005 - relation 1.1006 - To show that gradient similarity can provide useful information that cannot 1.1007 - be detected via pointwise analyses, consider Fig. 3. The top row of Fig. 3 1.1008 - displays the 3 genes which most match area AUD, according to a pointwise 1.1009 - method17. The bottom row displays the 3 genes which most match AUD ac- 1.1010 - cording to a method which considers local geometry18 The pointwise method 1.1011 - in the top row identifies genes which express more strongly in AUD than out- 1.1012 - side of it; its weakness is that this includes many areas which don&#8217;t have a 1.1013 - salient border matching the areal border. The geometric method identifies 1.1014 - genes whose salient expression border seems to partially line up with the bor- 1.1015 - der of AUD; its weakness is that this includes genes which don&#8217;t express over 1.1016 - the entire area. Genes which have high rankings using both pointwise and bor- 1.1017 - der criteria, such as Aph1a in the example, may be particularly good markers. 1.1018 - None of these genes are, individually, a perfect marker for AUD; we deliberately 1.1019 - chose a &#8220;difficult&#8221; area in order to better contrast pointwise with geometric 1.1020 - methods. 1.1021 - Areas which can be identified by single genes Using gradient simi- 1.1022 - larity, we have already found single genes which roughly identify some areas 1.1023 -and groupings of areas. For each of these areas, an example of a gene which roughly identifies it is shown in Figure 5. We 1.1024 -have not yet cross-verified these genes in other atlases. 1.1025 + Gradient similarity provides information complementary to 1.1026 + correlation 1.1027 + To show that gradient similarity can provide useful information that 1.1028 + cannot be detected via pointwise analyses, consider Fig. 3. The top 1.1029 + row of Fig. 3 displays the 3 genes which most match area AUD, ac- 1.1030 + cording to a pointwise method17. The bottom row displays the 3 genes 1.1031 + which most match AUD according to a method which considers local 1.1032 + geometry18 The pointwise method in the top row identifies genes which 1.1033 + express more strongly in AUD than outside of it; its weakness is that 1.1034 + this includes many areas which don&#8217;t have a salient border matching 1.1035 + the areal border. The geometric method identifies genes whose salient 1.1036 + expression border seems to partially line up with the border of AUD; 1.1037 + its weakness is that this includes genes which don&#8217;t express over the 1.1038 + entire area. Genes which have high rankings using both pointwise and 1.1039 + border criteria, such as Aph1a in the example, may be particularly good 1.1040 + markers. None of these genes are, individually, a perfect marker for 1.1041 + AUD; we deliberately chose a &#8220;difficult&#8221; area in order to better contrast 1.1042 + pointwise with geometric methods. 1.1043 + Areas which can be identified by single genes Using gradient 1.1044 + similarity, we have already found single genes which roughly identify 1.1045 + some areas and groupings of areas. For each of these areas, an ex- 1.1046 + ample of a gene which roughly identifies it is shown in Figure 5. We 1.1047 + have not yet cross-verified these genes in other atlases. 1.1048 + In addition, there are a number of areas which are almost identified 1.1049 + by single genes: COAa+NLOT (anterior part of cortical amygdalar area, 1.1050 + nucleus of the lateral olfactory tract), ENT (entorhinal), ACAv (ventral 1.1051 + anterior cingulate), VIS (visual), AUD (auditory). 1.1052 + These results validate our expectation that the ABA dataset can 1.1053 + be exploited to find marker genes for many cortical areas, while also 1.1054 + validating the relevancy of our new scoring method, gradient similarity. 1.1055 + Combinations of multiple genes are useful and necessary for 1.1056 + some areas 1.1057 + In Figure 4, we give an example of a cortical area which is not 1.1058 + marked by any single gene, but which can be identified combinatorially. 1.1059 +Acccording to logistic regression, gene wwc1 is the best fit single gene for predicting whether or not a pixel on 1.1060 +the cortical surface belongs to the motor area (area MO). The upper-left picture in Figure 4 shows wwc1&#8217;s spatial 1.1061 _________________________________________ 1.1062 - 17For each gene, a logistic regression in which the response variable was whether or not a surface pixel was within area AUD, and the predictor 1.1063 -variable was the value of the expression of the gene underneath that pixel. The resulting scores were used to rank the genes in terms of how well 1.1064 -they predict area AUD. 1.1065 - 18For each gene the gradient similarity between (a) a map of the expression of each gene on the cortical surface and (b) the shape of area AUD, 1.1066 -was calculated, and this was used to rank the genes. 1.1067 -In addition, there are a number of areas which are almost identified by single genes: COAa+NLOT (anterior part of 1.1068 -cortical amygdalar area, nucleus of the lateral olfactory tract), ENT (entorhinal), ACAv (ventral anterior cingulate), VIS 1.1069 -(visual), AUD (auditory). 1.1070 -These results validate our expectation that the ABA dataset can be exploited to find marker genes for many cortical 1.1071 -areas, while also validating the relevancy of our new scoring method, gradient similarity. 1.1072 -Combinations of multiple genes are useful and necessary for some areas 1.1073 -In Figure 4, we give an example of a cortical area which is not marked by any single gene, but which can be identified 1.1074 -combinatorially. Acccording to logistic regression, gene wwc1 is the best fit single gene for predicting whether or not a 1.1075 -pixel on the cortical surface belongs to the motor area (area MO). The upper-left picture in Figure 4 shows wwc1&#8217;s spatial 1.1076 -expression pattern over the cortex. The lower-right boundary of MO is represented reasonably well by this gene, but the 1.1077 -gene overshoots the upper-left boundary. This flattened 2-D representation does not show it, but the area corresponding 1.1078 -to the overshoot is the medial surface of the cortex. MO is only found on the dorsal surface. Gene mtif2 is shown in the 1.1079 -upper-right. Mtif2 captures MO&#8217;s upper-left boundary, but not its lower-right boundary. Mtif2 does not express very much 1.1080 -on the medial surface. By adding together the values at each pixel in these two figures, we get the lower-left image. This 1.1081 -combination captures area MO much better than any single gene. 1.1082 -This shows that our proposal to develop a method to find combinations of marker genes is both possible and necessary. 1.1083 -Feature selection integrated with prediction As noted earlier, in general, any classifier can be used for feature 1.1084 -selection by running it inside a stepwise wrapper. Also, some learning algorithms integrate soft constraints on number of 1.1085 -features used. Examples of both of these will be seen in the section &#8220;Multivariate supervised learning&#8221;. 1.1086 + 17For each gene, a logistic regression in which the response variable was whether or not a surface pixel was within area AUD, and the 1.1087 +predictor variable was the value of the expression of the gene underneath that pixel. The resulting scores were used to rank the genes 1.1088 +in terms of how well they predict area AUD. 1.1089 + 18For each gene the gradient similarity between (a) a map of the expression of each gene on the cortical surface and (b) the shape of 1.1090 +area AUD, was calculated, and this was used to rank the genes. 1.1091 +expression pattern over the cortex. The lower-right boundary of MO is represented reasonably well by this gene, 1.1092 +but the gene overshoots the upper-left boundary. This flattened 2-D representation does not show it, but the 1.1093 +area corresponding to the overshoot is the medial surface of the cortex. MO is only found on the dorsal surface. 1.1094 +Gene mtif2 is shown in the upper-right. Mtif2 captures MO&#8217;s upper-left boundary, but not its lower-right boundary. 1.1095 +Mtif2 does not express very much on the medial surface. By adding together the values at each pixel in these 1.1096 +two figures, we get the lower-left image. This combination captures area MO much better than any single gene. 1.1097 +This shows that our proposal to develop a method to find combinations of marker genes is both possible and 1.1098 +necessary. 1.1099 +Feature selection integrated with prediction As noted earlier, in general, any classifier can be used for fea- 1.1100 +ture selection by running it inside a stepwise wrapper. Also, some learning algorithms integrate soft constraints 1.1101 +on number of features used. Examples of both of these will be seen in the section &#8220;Multivariate supervised 1.1102 +learning&#8221;. 1.1103 Multivariate supervised learning 1.1104 1.1105 1.1106 1.1107 1.1108 -Figure 6: First row: the first 6 reduced dimensions, using PCA. Second 1.1109 -row: the first 6 reduced dimensions, using NNMF. Third row: the first 1.1110 -six reduced dimensions, using landmark Isomap. Bottom row: examples 1.1111 -of kmeans clustering applied to reduced datasets to find 7 clusters. Left: 1.1112 -19 of the major subdivisions of the cortex. Second from left: PCA. Third 1.1113 -from left: NNMF. Right: Landmark Isomap. Additional details: In the 1.1114 -third and fourth rows, 7 dimensions were found, but only 6 displayed. In 1.1115 -the last row: for PCA, 50 dimensions were used; for NNMF, 6 dimensions 1.1116 -were used; for landmark Isomap, 7 dimensions were used. Forward stepwise logistic regression Lo- 1.1117 - gistic regression is a popular method for pre- 1.1118 - dictive modeling of categorial data. As a pi- 1.1119 - lot run, for five cortical areas (SS, AUD, RSP, 1.1120 - VIS, and MO), we performed forward stepwise 1.1121 - logistic regression to find single genes, pairs of 1.1122 - genes, and triplets of genes which predict areal 1.1123 - identify. This is an example of feature selec- 1.1124 - tion integrated with prediction using a stepwise 1.1125 - wrapper. Some of the single genes found were 1.1126 - shown in various figures throughout this doc- 1.1127 - ument, and Figure 4 shows a combination of 1.1128 - genes which was found. 1.1129 - We felt that, for single genes, gradient simi- 1.1130 - larity did a better job than logistic regression at 1.1131 - capturing our subjective impression of a &#8220;good 1.1132 - gene&#8221;. 1.1133 - SVM on all genes at once 1.1134 - In order to see how well one can do when 1.1135 - looking at all genes at once, we ran a support 1.1136 - vector machine to classify cortical surface pix- 1.1137 - els based on their gene expression profiles. We 1.1138 - achieved classification accuracy of about 81%19. 1.1139 - This shows that the genes included in the ABA 1.1140 - dataset are sufficient to define much of cortical 1.1141 - anatomy. However, as noted above, a classifier 1.1142 - that looks at all the genes at once isn&#8217;t as prac- 1.1143 - tically useful as a classifier that uses only a few 1.1144 - genes. 1.1145 +Figure 6: First row: the first 6 reduced dimensions, using PCA. Sec- 1.1146 +ond row: the first 6 reduced dimensions, using NNMF. Third row: 1.1147 +the first six reduced dimensions, using landmark Isomap. Bottom 1.1148 +row: examples of kmeans clustering applied to reduced datasets 1.1149 +to find 7 clusters. Left: 19 of the major subdivisions of the cortex. 1.1150 +Second from left: PCA. Third from left: NNMF. Right: Landmark 1.1151 +Isomap. Additional details: In the third and fourth rows, 7 dimen- 1.1152 +sions were found, but only 6 displayed. In the last row: for PCA, 1.1153 +50 dimensions were used; for NNMF, 6 dimensions were used; for 1.1154 +landmark Isomap, 7 dimensions were used. Forward stepwise logistic regression 1.1155 + Logistic regression is a popular method 1.1156 + for predictive modeling of categorial data. 1.1157 + As a pilot run, for five cortical areas (SS, 1.1158 + AUD, RSP, VIS, and MO), we performed 1.1159 + forward stepwise logistic regression to find 1.1160 + single genes, pairs of genes, and triplets 1.1161 + of genes which predict areal identify. This 1.1162 + is an example of feature selection inte- 1.1163 + grated with prediction using a stepwise 1.1164 + wrapper. Some of the single genes found 1.1165 + were shown in various figures throughout 1.1166 + this document, and Figure 4 shows a com- 1.1167 + bination of genes which was found. 1.1168 + We felt that, for single genes, gradi- 1.1169 + ent similarity did a better job than logistic 1.1170 + regression at capturing our subjective im- 1.1171 + pression of a &#8220;good gene&#8221;. 1.1172 + SVM on all genes at once 1.1173 + In order to see how well one can do 1.1174 + when looking at all genes at once, we ran 1.1175 + a support vector machine to classify corti- 1.1176 + cal surface pixels based on their gene ex- 1.1177 + pression profiles. We achieved classifica- 1.1178 + tion accuracy of about 81%19. This shows 1.1179 + that the genes included in the ABA dataset 1.1180 + are sufficient to define much of cortical 1.1181 + anatomy. However, as noted above, a clas- 1.1182 + sifier that looks at all the genes at once isn&#8217;t 1.1183 +as practically useful as a classifier that uses only a few genes. 1.1184 _________________________________________ 1.1185 - 195-fold cross-validation. 1.1186 - Data-driven redrawing of the cor- 1.1187 - tical map 1.1188 -We have applied the following dimensionality reduction algorithms to reduce the dimensionality of the gene expression 1.1189 -profile associated with each voxel: Principal Components Analysis (PCA), Simple PCA (SPCA), Multi-Dimensional Scaling 1.1190 -(MDS), Isomap, Landmark Isomap, Laplacian eigenmaps, Local Tangent Space Alignment (LTSA), Hessian locally linear 1.1191 -embedding, Diffusion maps, Stochastic Neighbor Embedding (SNE), Stochastic Proximity Embedding (SPE), Fast Maximum 1.1192 -Variance Unfolding (FastMVU), Non-negative Matrix Factorization (NNMF). Space constraints prevent us from showing 1.1193 -many of the results, but as a sample, PCA, NNMF, and landmark Isomap are shown in the first, second, and third rows of 1.1194 -Figure 6. 1.1195 -After applying the dimensionality reduction, we ran clustering algorithms on the reduced data. To date we have tried 1.1196 -k-means and spectral clustering. The results of k-means after PCA, NNMF, and landmark Isomap are shown in the last 1.1197 -row of Figure 6. To compare, the leftmost picture on the bottom row of Figure 6 shows some of the major subdivisions of 1.1198 -cortex. These results clearly show that different dimensionality reduction techniques capture different aspects of the data 1.1199 -and lead to different clusterings, indicating the utility of our proposal to produce a detailed comparion of these techniques 1.1200 -as applied to the domain of genomic anatomy. 1.1201 + 195-fold cross-validation. 1.1202 +Data-driven redrawing of the cortical map 1.1203 +We have applied the following dimensionality reduction algorithms to reduce the dimensionality of the gene 1.1204 +expression profile associated with each pixel: Principal Components Analysis (PCA), Simple PCA (SPCA), Multi- 1.1205 +Dimensional Scaling (MDS), Isomap, Landmark Isomap, Laplacian eigenmaps, Local Tangent Space Alignment 1.1206 +(LTSA), Stochastic Proximity Embedding (SPE), Fast Maximum Variance Unfolding (FastMVU), Non-negative 1.1207 +Matrix Factorization (NNMF). Space constraints prevent us from showing many of the results, but as a sample, 1.1208 +PCA, NNMF, and landmark Isomap are shown in the first, second, and third rows of Figure 6. 1.1209 +After applying the dimensionality reduction, we ran clustering algorithms on the reduced data. To date we 1.1210 +have tried k-means and spectral clustering. The results of k-means after PCA, NNMF, and landmark Isomap are 1.1211 +shown in the last row of Figure 6. To compare, the leftmost picture on the bottom row of Figure 6 shows some 1.1212 +of the major subdivisions of cortex. These results clearly show that different dimensionality reduction techniques 1.1213 +capture different aspects of the data and lead to different clusterings, indicating the utility of our proposal to 1.1214 +produce a detailed comparion of these techniques as applied to the domain of genomic anatomy. 1.1215 1.1216 -Figure 7: Prototypes corresponding to sample gene clusters, 1.1217 -clustered by gradient similarity. Region boundaries for the 1.1218 -region that most matches each prototype are overlayed. Many areas are captured by clusters of genes We 1.1219 - also clustered the genes using gradient similarity to see if 1.1220 - the spatial regions defined by any clusters matched known 1.1221 - anatomical regions. Figure 7 shows, for ten sample gene 1.1222 - clusters, each cluster&#8217;s average expression pattern, compared 1.1223 - to a known anatomical boundary. This suggests that it is 1.1224 - worth attempting to cluster genes, and then to use the re- 1.1225 - sults to cluster voxels. 1.1226 +Figure 7: Prototypes corresponding to sample gene 1.1227 +clusters, clustered by gradient similarity. Region bound- 1.1228 +aries for the region that most matches each prototype 1.1229 +are overlayed. Many areas are captured by clusters of genes 1.1230 + We also clustered the genes using gradient similarity 1.1231 + to see if the spatial regions defined by any clusters 1.1232 + matched known anatomical regions. Figure 7 shows, 1.1233 + for ten sample gene clusters, each cluster&#8217;s average 1.1234 + expression pattern, compared to a known anatomical 1.1235 + boundary. This suggests that it is worth attempting to 1.1236 + cluster genes, and then to use the results to cluster 1.1237 + pixels. 1.1238 The approach: what we plan to do 1.1239 Flatmap cortex and segment cortical layers 1.1240 - There are multiple ways to flatten 3-D data into 2-D. We 1.1241 - will compare mappings from manifolds to planes which at- 1.1242 - tempt to preserve size (such as the one used by Caret[7]) 1.1243 - with mappings which preserve angle (conformal maps). Our 1.1244 - method will include a statistical test that warns the user if 1.1245 -the assumption of 2-D structure seems to be wrong. 1.1246 -We have not yet made use of radial profiles. While the radial profiles may be used &#8220;raw&#8221;, for laminar structures like the 1.1247 -cortex another strategy is to group together voxels in the same cortical layer; each surface pixel would then be associated 1.1248 -with one expression level per gene per layer. We will develop a segmentation algorithm to automatically identify the layer 1.1249 -boundaries. 1.1250 + There are multiple ways to flatten 3-D data into 2-D. 1.1251 + We will compare mappings from manifolds to planes 1.1252 + which attempt to preserve size (such as the one used 1.1253 +by Caret[7]) with mappings which preserve angle (conformal maps). Our method will include a statistical test 1.1254 +that warns the user if the assumption of 2-D structure seems to be wrong. 1.1255 +We have not yet made use of radial profiles. While the radial profiles may be used &#8220;raw&#8221;, for laminar structures 1.1256 +like the cortex another strategy is to group together voxels in the same cortical layer; each surface pixel would 1.1257 +then be associated with one expression level per gene per layer. We will develop a segmentation algorithm to 1.1258 +automatically identify the layer boundaries. 1.1259 Develop algorithms that find genetic markers for anatomical regions 1.1260 -We will develop scoring methods for evaluating how good individual genes are at marking areas. We will compare pointwise, 1.1261 -geometric, and information-theoretic measures. We already developed one entirely new scoring method (gradient similarity), 1.1262 -but we may develop more. Scoring measures that we will explore will include the L1 norm, correlation, expression energy 1.1263 -ratio, conditional entropy, gradient similarity, Jaccard similarity, Dice similarity, Hough transform, and statistical tests such 1.1264 -as Student&#8217;s t-test, and the Mann-Whitney U test (a non-parametric test). In addition, any classifier induces a scoring 1.1265 -measure on genes by taking the prediction error when using that gene to predict the target. 1.1266 -Using some combination of these measures, we will develop a procedure to find single marker genes for anatomical regions: 1.1267 -for each cortical area, we will rank the genes by their ability to delineate each area. We will quantitatively compare the list 1.1268 -of single genes generated by our method to the lists generated by previous methods which are mentioned in Aim 1 Related 1.1269 -Work. 1.1270 -Some cortical areas have no single marker genes but can be identified by combinatorial coding. This requires multivariate 1.1271 -scoring measures and feature selection procedures. Many of the measures, such as expression energy, gradient similarity, 1.1272 -Jaccard, Dice, Hough, Student&#8217;s t, and Mann-Whitney U are univariate. We will extend these scoring measures for use 1.1273 -in multivariate feature selection, that is, for scoring how well combinations of genes, rather than individual genes, can 1.1274 -distinguish a target area. There are existing multivariate forms of some of the univariate scoring measures, for example, 1.1275 -Hotelling&#8217;s T-square is a multivariate analog of Student&#8217;s t. 1.1276 -We will develop a feature selection procedure for choosing the best small set of marker genes for a given anatomical 1.1277 -area. In addition to using the scoring measures that we develop, we will also explore (a) feature selection using a stepwise 1.1278 -wrapper over &#8220;vanilla&#8221; classifiers such as logistic regression, (b) supervised learning methods such as decision trees which 1.1279 -incrementally/greedily combine single gene markers into sets, and (c) supervised learning methods which use soft constraints 1.1280 -to minimize number of features used, such as sparse support vector machines. 1.1281 -Since errors of displacement and of shape may cause genes and target areas to match less than they should, we will 1.1282 -consider the robustness of feature selection methods in the presence of error. Some of these methods, such as the Hough 1.1283 -transform, are designed to be resistant in the presence of error, but many are not. We will consider extensions to scoring 1.1284 -measures that may improve their robustness; for example, a wrapper that runs a scoring method on small displacements 1.1285 -and distortions of the data adds robustness to registration error at the expense of computation time. 1.1286 -An area may be difficult to identify because the boundaries are misdrawn in the atlas, or because the shape of the natural 1.1287 -domain of gene expression corresponding to the area is different from the shape of the area as recognized by anatomists. 1.1288 -We will extend our procedure to handle difficult areas by combining areas or redrawing their boundaries. We will develop 1.1289 -extensions to our procedure which (a) detect when a difficult area could be fit if its boundary were redrawn slightly, and (b) 1.1290 -detect when a difficult area could be combined with adjacent areas to create a larger area which can be fit. 1.1291 -A future publication on the method that we develop in Aim 1 will review the scoring measures and quantitatively compare 1.1292 -their performance in order to provide a foundation for future research of methods of marker gene finding. We will measure 1.1293 -the robustness of the scoring measures as well as their absolute performance on our dataset. 1.1294 -Classifiers 1.1295 -We will explore and compare different classifiers. As noted above, this activity is not separate from the previous one, 1.1296 -because some supervised learning algorithms include feature selection, and any classifier can be combined with a stepwise 1.1297 -wrapper for use as a feature selection method. We will explore logistic regression (including spatial models[15]), decision 1.1298 -trees20 , sparse SVMs, generative mixture models (including naive bayes), kernel density estimation, instance-based learning 1.1299 -methods (such as k-nearest neighbor), genetic algorithms, and artificial neural networks. 1.1300 -Application to cortical areas 1.1301 -# confirm with EMAGE, GeneAtlas, GENSAT, etc, to fight overfitting, two hemis 1.1302 +Scoring measures and feature selection We will develop scoring methods for evaluating how good individual 1.1303 +genes are at marking areas. We will compare pointwise, geometric, and information-theoretic measures. We 1.1304 +already developed one entirely new scoring method (gradient similarity), but we may develop more. Scoring 1.1305 +measures that we will explore will include the L1 norm, correlation, expression energy ratio, conditional entropy, 1.1306 +gradient similarity, Jaccard similarity, Dice similarity, Hough transform, and statistical tests such as Student&#8217;s t- 1.1307 +test, and the Mann-Whitney U test (a non-parametric test). In addition, any classifier induces a scoring measure 1.1308 +on genes by taking the prediction error when using that gene to predict the target. 1.1309 +Using some combination of these measures, we will develop a procedure to find single marker genes for 1.1310 +anatomical regions: for each cortical area, we will rank the genes by their ability to delineate each area. We 1.1311 +will quantitatively compare the list of single genes generated by our method to the lists generated by previous 1.1312 +methods which are mentioned in Aim 1 Related Work. 1.1313 +Some cortical areas have no single marker genes but can be identified by combinatorial coding. This requires 1.1314 +multivariate scoring measures and feature selection procedures. Many of the measures, such as expression 1.1315 +energy, gradient similarity, Jaccard, Dice, Hough, Student&#8217;s t, and Mann-Whitney U are univariate. We will extend 1.1316 +these scoring measures for use in multivariate feature selection, that is, for scoring how well combinations of 1.1317 +genes, rather than individual genes, can distinguish a target area. There are existing multivariate forms of some 1.1318 +of the univariate scoring measures, for example, Hotelling&#8217;s T-square is a multivariate analog of Student&#8217;s t. 1.1319 +We will develop a feature selection procedure for choosing the best small set of marker genes for a given 1.1320 +anatomical area. In addition to using the scoring measures that we develop, we will also explore (a) feature 1.1321 +selection using a stepwise wrapper over &#8220;vanilla&#8221; classifiers such as logistic regression, (b) supervised learning 1.1322 +methods such as decision trees which incrementally/greedily combine single gene markers into sets, and (c) 1.1323 +supervised learning methods which use soft constraints to minimize number of features used, such as sparse 1.1324 +support vector machines. 1.1325 +Since errors of displacement and of shape may cause genes and target areas to match less than they should, 1.1326 +we will consider the robustness of feature selection methods in the presence of error. Some of these methods, 1.1327 +such as the Hough transform, are designed to be resistant in the presence of error, but many are not. We will 1.1328 +consider extensions to scoring measures that may improve their robustness; for example, a wrapper that runs a 1.1329 +scoring method on small displacements and distortions of the data adds robustness to registration error at the 1.1330 +expense of computation time. 1.1331 +An area may be difficult to identify because the boundaries are misdrawn in the atlas, or because the shape 1.1332 +of the natural domain of gene expression corresponding to the area is different from the shape of the area as 1.1333 +recognized by anatomists. We will extend our procedure to handle difficult areas by combining areas or redrawing 1.1334 +their boundaries. We will develop extensions to our procedure which (a) detect when a difficult area could be 1.1335 +fit if its boundary were redrawn slightly20, and (b) detect when a difficult area could be combined with adjacent 1.1336 +areas to create a larger area which can be fit. 1.1337 +A future publication on the method that we develop in Aim 1 will review the scoring measures and quantita- 1.1338 +tively compare their performance in order to provide a foundation for future research of methods of marker gene 1.1339 +finding. We will measure the robustness of the scoring measures as well as their absolute performance on our 1.1340 +dataset. 1.1341 +Classifiers We will explore and compare different classifiers. As noted above, this activity is not separate 1.1342 +from the previous one, because some supervised learning algorithms include feature selection, and any clas- 1.1343 +sifier can be combined with a stepwise wrapper for use as a feature selection method. We will explore logistic 1.1344 +regression (including spatial models[16]), decision trees21, sparse SVMs, generative mixture models (including 1.1345 +naive bayes), kernel density estimation, instance-based learning methods (such as k-nearest neighbor), genetic 1.1346 +algorithms, and artificial neural networks. 1.1347 Develop algorithms to suggest a division of a structure into anatomical parts 1.1348 -1.Explore dimensionality reduction algorithms applied to pixels: including TODO 1.1349 -2.Explore dimensionality reduction algorithms applied to genes: including TODO 1.1350 -3.Explore clustering algorithms applied to pixels: including TODO 1.1351 -4.Explore clustering algorithms applied to genes: including gene shaving[9], TODO 1.1352 -5.Develop an algorithm to use dimensionality reduction and/or hierarchial clustering to create anatomical maps 1.1353 -6.Run this algorithm on the cortex: present a hierarchial, genoarchitectonic map of the cortex 1.1354 -# Linear discriminant analysis 1.1355 -# jbt, coclustering 1.1356 -# self-organizing map 1.1357 -# Linear discriminant analysis 1.1358 -# compare using clustering scores 1.1359 -# multivariate gradient similarity 1.1360 -# deep belief nets 1.1361 -Apply these algorithms to the cortex 1.1362 -Using the methods developed in Aim 1, we will present, for each cortical area, a short list of markers to identify that 1.1363 -area; and we will also present lists of &#8220;panels&#8221; of genes that can be used to delineate many areas at once. Using the methods 1.1364 -developed in Aim 2, we will present one or more hierarchial cortical maps. We will identify and explain how the statistical 1.1365 -structure in the gene expression data led to any unexpected or interesting features of these maps, and we will provide 1.1366 -biological hypotheses to interpret any new cortical areas, or groupings of areas, which are discovered. 1.1367 +Explore dimensionality reduction on gene expression profiles We have already described the application 1.1368 +of ten dimensionality reduction algorithms for the purpose of replacing the gene expression profiles, which are 1.1369 +vectors of about 4000 gene expression levels, with a smaller number of features. We plan to further explore 1.1370 +and interpret these results, as well as to apply other unsupervised learning algorithms, including independent 1.1371 +components analysis, self-organizing maps, and generative models such as deep Boltzmann machines. We 1.1372 +will explore ways to quantitatively compare the relevance of the different dimensionality reduction methods for 1.1373 +identifying cortical areal boundaries. 1.1374 +Explore dimensionality reduction on pixels Instead of applying dimensionality reduction to the gene ex- 1.1375 _________________________________________ 1.1376 - 20Actually, we have already begun to explore decision trees. For each cortical area, we have used the C4.5 algorithm to find a decision tree for 1.1377 -that area. We achieved good classification accuracy on our training set, but the number of genes that appeared in each tree was too large. We 1.1378 -plan to implement a pruning procedure to generate trees that use fewer genes. 1.1379 + 20Not just any redrawing is acceptable, only those which appear to be justified as a natural spatial domain of gene expression by 1.1380 +multiple sources of evidence. Interestingly, the need to detect &#8220;natural spatial domains of gene expression&#8221; in a data-driven fashion 1.1381 +means that the methods of Aim 2 might be useful in achieving Aim 1, as well &#8211; particularly discriminative dimensionality reduction. 1.1382 + 21Actually, we have already begun to explore decision trees. For each cortical area, we have used the C4.5 algorithm to find a decision 1.1383 +tree for that area. We achieved good classification accuracy on our training set, but the number of genes that appeared in each tree was 1.1384 +too large. We plan to implement a pruning procedure to generate trees that use fewer genes. 1.1385 +pression profiles, the same techniques can be applied instead to the pixels22. It is possible that the features 1.1386 +generated in this way by some dimensionality reduction techniques will directly correspond to interesting spatial 1.1387 +regions. 1.1388 +Explore clustering and segmentation algorithms on pixels We will explore clustering and segmenta- 1.1389 +tion algorithms in order to segment the pixels into regions. We will explore k-means, spectral clustering, gene 1.1390 +shaving[9], recursive division clustering, multivariate generalizations of edge detectors, multivariate generaliza- 1.1391 +tions of watershed transformations, region growing, active contours, graph partitioning methods, and recursive 1.1392 +agglomerative clustering with various linkage functions. These methods can be combined with dimensionality 1.1393 +reduction. 1.1394 +Explore clustering on genes We have already shown that the procedure of clustering genes according to 1.1395 +gradient similarity, and then creating an averaged prototype of each cluster&#8217;s expression pattern, yields some 1.1396 +spatial patterns which match cortical areas. We will further explore the clustering of genes. 1.1397 +In addition to using the cluster expression prototypes directly to identify spatial regions, this might be useful 1.1398 +as a component of dimensionality reduction. For example, one could imagine clustering similar genes and then 1.1399 +replacing their expression levels with a single average expression level, thereby removing some redundancy from 1.1400 +the gene expression profiles. One could then perform clustering on pixels (possibly after a second dimensionality 1.1401 +reduction step) in order to identify spatial regions. It remains to be seen whether removal of redundancy would 1.1402 +help or hurt the ultimate goal of identifying interesting spatial regions. 1.1403 +Explore co-clustering There are some algorithms which simultaineously incorporate clustering on instances 1.1404 +and on features (in our case, genes and pixels), for example, IRM[11]. These are called co-clustering or biclus- 1.1405 +tering algorithms. 1.1406 +Quantitatively compare different methods In order to tell which method is best for genomic anatomy, for 1.1407 +each experimental method we will compare the cortical map found by unsupervised learning to a cortical map 1.1408 +derived from the Allen Reference Atlas. In order to compare the experimental clustering with the reference 1.1409 +clustering, we will explore various quantitative metrics that purport to measure how similar two clusterings are, 1.1410 +such as Jaccard, Rand index, Fowlkes-Mallows, variation of information, Larsen, Van Dongen, and others. 1.1411 +Discriminative dimensionality reduction In addition to using a purely data-driven approach to identify 1.1412 +spatial regions, it might be useful to see how well the known regions can be reconstructed from a small number 1.1413 +of features, even if those features are chosen by using knowledge of the regions. For example, linear discriminant 1.1414 +analysis could be used as a dimensionality reduction technique in order to identify a few features which are the 1.1415 +best linear summary of gene expression profiles for the purpose of discriminating between regions. This reduced 1.1416 +feature set could then be used to cluster pixels into regions. Perhaps the resulting clusters will be similar to the 1.1417 +reference atlas, yet more faithful to natural spatial domains of gene expression than the reference atlas is. 1.1418 +Apply the new methods to the cortex 1.1419 +Using the methods developed in Aim 1, we will present, for each cortical area, a short list of markers to identify 1.1420 +that area; and we will also present lists of &#8220;panels&#8221; of genes that can be used to delineate many areas at once. 1.1421 +Because in most cases the ABA coronal dataset only contains one ISH per gene, it is possible for an unrelated 1.1422 +combination of genes to seem to identify an area when in fact it is only coincidence. There are two ways we will 1.1423 +validate our marker genes to guard against this. First, we will confirm that putative combinations of marker genes 1.1424 +express the same pattern in both hemispheres. Second, we will manually validate our final results on other gene 1.1425 +expression datasets such as EMAGE, GeneAtlas, and GENSAT. 1.1426 +Using the methods developed in Aim 2, we will present one or more hierarchial cortical maps. We will identify 1.1427 +and explain how the statistical structure in the gene expression data led to any unexpected or interesting features 1.1428 +_________________________________________ 1.1429 + 22Consider a matrix whose rows represent pixel locations, and whose columns represent genes. An entry in this matrix represents the 1.1430 +gene expression level at a given pixel. One can look at this matrix as a collection of pixels, each corresponding to a vector of many gene 1.1431 +expression levels; or one can look at it as a collection of genes, each corresponding to a vector giving that gene&#8217;s expression at each 1.1432 +pixel. Similarly, dimensionality reduction can be used to replace a large number of genes with a small number of features, or it can be 1.1433 +used to replace a large number of pixels with a small number of features. 1.1434 +of these maps, and we will provide biological hypotheses to interpret any new cortical areas, or groupings of 1.1435 +areas, which are discovered. 1.1436 Timeline and milestones 1.1437 Finding marker genes 1.1438 -&#x2219;September-November 2009: Develop an automated mechanism for segmenting the cortical voxels into layers 1.1439 -&#x2219;November 2009 (milestone): Have completed construction of a flatmapped, cortical dataset with information for each 1.1440 -layer 1.1441 -&#x2219;October 2009-April 2010: Develop scoring methods and to test them in various supervised learning frameworks. Also 1.1442 -test out various dimensionality reduction schemes in combination with supervised learning. create or extend supervised 1.1443 -learning frameworks which use multivariate versions of the best scoring methods. 1.1444 -&#x2219;January 2010 (milestone): Submit a publication on single marker genes for cortical areas 1.1445 -&#x2219;February-July 2010: Continue to develop scoring methods and supervised learning frameworks. Explore the best way 1.1446 -to integrate radial profiles with supervised learning. Explore the best way to make supervised learning techniques 1.1447 -robust against incorrect labels (i.e. when the areas drawn on the input cortical map are slightly off). Quantitatively 1.1448 -compare the performance of different supervised learning techniques. Validate marker genes found in the ABA dataset 1.1449 -by checking against other gene expression datasets. Create documentation and unit tests for software toolbox for Aim 1.1450 -1. Respond to user bug reports for Aim 1 software toolbox. 1.1451 -&#x2219;June 2010 (milestone): Submit a paper describing a method fulfilling Aim 1. Release toolbox. 1.1452 -&#x2219;July 2010 (milestone): Submit a paper describing combinations of marker genes for each cortical area, and a small 1.1453 -number of marker genes that can, in combination, define most of the areas at once 1.1454 +September-November 2009: Develop an automated mechanism for segmenting the cortical voxels into layers 1.1455 +November 2009 (milestone): Have completed construction of a flatmapped, cortical dataset with information 1.1456 +for each layer 1.1457 +October 2009-April 2010: Develop scoring methods and to test them in various supervised learning frameworks. 1.1458 +Also test out various dimensionality reduction schemes in combination with supervised learning. create or extend 1.1459 +supervised learning frameworks which use multivariate versions of the best scoring methods. 1.1460 +January 2010 (milestone): Submit a publication on single marker genes for cortical areas 1.1461 +February-July 2010: Continue to develop scoring methods and supervised learning frameworks. Explore the 1.1462 +best way to integrate radial profiles with supervised learning. Explore the best way to make supervised learning 1.1463 +techniques robust against incorrect labels (i.e. when the areas drawn on the input cortical map are slightly 1.1464 +off). Quantitatively compare the performance of different supervised learning techniques. Validate marker genes 1.1465 +found in the ABA dataset by checking against other gene expression datasets. Create documentation and unit 1.1466 +tests for software toolbox for Aim 1. Respond to user bug reports for Aim 1 software toolbox. 1.1467 +June 2010 (milestone): Submit a paper describing a method fulfilling Aim 1. Release toolbox. 1.1468 +July 2010 (milestone): Submit a paper describing combinations of marker genes for each cortical area, and a 1.1469 +small number of marker genes that can, in combination, define most of the areas at once 1.1470 Revealing new ways to parcellate a structure into regions 1.1471 -&#x2219;June 2010-March 2011: Explore dimensionality reduction algorithms for Aim 2. Explore standard hierarchial clus- 1.1472 -tering algorithms, used in combination with dimensionality reduction, for Aim 2. Explore co-clustering algorithms. 1.1473 -Think about how radial profile information can be used for Aim 2. Adapt clustering algorithms to use radial profile 1.1474 -information. Quantitatively compare the performance of different dimensionality reduction and clustering techniques. 1.1475 -Quantitatively compare the value of different flatmapping methods and ways of representing radial profiles. 1.1476 -&#x2219;March 2011 (milestone): Submit a paper describing a method fulfilling Aim 2. Release toolbox. 1.1477 -&#x2219;February-May 2011: Using the methods developed for Aim 2, explore the genomic anatomy of the cortex. If new ways 1.1478 -of organizing the cortex into areas are discovered, read the literature and talk to people to learn about research related 1.1479 -to interpreting our results. Create documentation and unit tests for software toolbox for Aim 2. Respond to user bug 1.1480 -reports for Aim 2 software toolbox. 1.1481 -&#x2219;May 2011 (milestone): Submit a paper on the genomic anatomy of the cortex, using the methods developed in Aim 2 1.1482 -&#x2219;May-August 2011: Revisit Aim 1 to see if what was learned during Aim 2 can improve the methods for Aim 1. Follow 1.1483 -up on responses to our papers. Possibly submit another paper. 1.1484 +June 2010-March 2011: Explore dimensionality reduction algorithms for Aim 2. Explore standard hierarchial 1.1485 +clustering algorithms, used in combination with dimensionality reduction, for Aim 2. Explore co-clustering algo- 1.1486 +rithms. Think about how radial profile information can be used for Aim 2. Adapt clustering algorithms to use radial 1.1487 +profile information. Quantitatively compare the performance of different dimensionality reduction and clustering 1.1488 +techniques. Quantitatively compare the value of different flatmapping methods and ways of representing radial 1.1489 +profiles. 1.1490 +March 2011 (milestone): Submit a paper describing a method fulfilling Aim 2. Release toolbox. 1.1491 +February-May 2011: Using the methods developed for Aim 2, explore the genomic anatomy of the cortex. If 1.1492 +new ways of organizing the cortex into areas are discovered, read the literature and talk to people to learn about 1.1493 +research related to interpreting our results. Create documentation and unit tests for software toolbox for Aim 2. 1.1494 +Respond to user bug reports for Aim 2 software toolbox. 1.1495 +May 2011 (milestone): Submit a paper on the genomic anatomy of the cortex, using the methods developed in 1.1496 +Aim 2 1.1497 +May-August 2011: Revisit Aim 1 to see if what was learned during Aim 2 can improve the methods for Aim 1. 1.1498 +Follow up on responses to our papers. Possibly submit another paper. 1.1499 Bibliography &amp; References Cited 1.1500 -[1]Chris Adamson, Leigh Johnston, Terrie Inder, Sandra Rees, Iven Mareels, and Gary Egan. A Tracking Approach to 1.1501 -Parcellation of the Cerebral Cortex, volume Volume 3749/2005 of Lecture Notes in Computer Science, pages 294&#8211;301. 1.1502 -Springer Berlin / Heidelberg, 2005. 1.1503 -[2]J. Annese, A. Pitiot, I. D. Dinov, and A. W. Toga. A myelo-architectonic method for the structural classification of 1.1504 -cortical areas. NeuroImage, 21(1):15&#8211;26, 2004. 1.1505 -[3]Tanya Barrett, Dennis B. Troup, Stephen E. Wilhite, Pierre Ledoux, Dmitry Rudnev, Carlos Evangelista, Irene F. 1.1506 -Kim, Alexandra Soboleva, Maxim Tomashevsky, and Ron Edgar. NCBI GEO: mining tens of millions of expression 1.1507 -profiles&#8211;database and tools update. Nucl. Acids Res., 35(suppl_1):D760&#8211;765, 2007. 1.1508 -[4]George W. Bell, Tatiana A. Yatskievych, and Parker B. Antin. GEISHA, a whole-mount in situ hybridization gene 1.1509 -expression screen in chicken embryos. Developmental Dynamics, 229(3):677&#8211;687, 2004. 1.1510 -[5]James P Carson, Tao Ju, Hui-Chen Lu, Christina Thaller, Mei Xu, Sarah L Pallas, Michael C Crair, Joe Warren, Wah 1.1511 -Chiu, and Gregor Eichele. A digital atlas to characterize the mouse brain transcriptome. PLoS Comput Biol, 1(4):e41, 1.1512 -2005. 1.1513 -[6]Mark H. Chin, Alex B. Geng, Arshad H. Khan, Wei-Jun Qian, Vladislav A. Petyuk, Jyl Boline, Shawn Levy, Arthur W. 1.1514 -Toga, Richard D. Smith, Richard M. Leahy, and Desmond J. Smith. A genome-scale map of expression for a mouse 1.1515 -brain section obtained using voxelation. Physiol. Genomics, 30(3):313&#8211;321, August 2007. 1.1516 -[7]D C Van Essen, H A Drury, J Dickson, J Harwell, D Hanlon, and C H Anderson. An integrated software suite for surface- 1.1517 -based analyses of cerebral cortex. Journal of the American Medical Informatics Association: JAMIA, 8(5):443&#8211;59, 2001. 1.1518 -PMID: 11522765. 1.1519 -[8]Shiaoching Gong, Chen Zheng, Martin L. Doughty, Kasia Losos, Nicholas Didkovsky, Uta B. Schambra, Norma J. 1.1520 -Nowak, Alexandra Joyner, Gabrielle Leblanc, Mary E. Hatten, and Nathaniel Heintz. A gene expression atlas of the 1.1521 -central nervous system based on bacterial artificial chromosomes. Nature, 425(6961):917&#8211;925, October 2003. 1.1522 -[9]Trevor Hastie, Robert Tibshirani, Michael Eisen, Ash Alizadeh, Ronald Levy, Louis Staudt, Wing Chan, David Botstein, 1.1523 -and Patrick Brown. &#8217;Gene shaving&#8217; as a method for identifying distinct sets of genes with similar expression patterns. 1.1524 -Genome Biology, 1(2):research0003.1&#8211;research0003.21, 2000. 1.1525 -[10]Jano Hemert and Richard Baldock. Matching Spatial Regions with Combinations of Interacting Gene Expression Pat- 1.1526 -terns, volume 13 of Communications in Computer and Information Science, pages 347&#8211;361. Springer Berlin Heidelberg, 1.1527 -2008. 1.1528 -[11]F. Kruggel, M. K. Brckner, Th. Arendt, C. J. Wiggins, and D. Y. von Cramon. Analyzing the neocortical fine-structure. 1.1529 -Medical Image Analysis, 7(3):251&#8211;264, September 2003. 1.1530 -[12]Erh-Fang Lee, Jyl Boline, and Arthur W. Toga. A High-Resolution anatomical framework of the neonatal mouse brain 1.1531 -for managing gene expression data. Frontiers in Neuroinformatics, 1:6, 2007. PMC2525996. 1.1532 -[13]Susan Magdaleno, Patricia Jensen, Craig L. Brumwell, Anna Seal, Karen Lehman, Andrew Asbury, Tony Cheung, 1.1533 -Tommie Cornelius, Diana M. Batten, Christopher Eden, Shannon M. Norland, Dennis S. Rice, Nilesh Dosooye, Sundeep 1.1534 -Shakya, Perdeep Mehta, and Tom Curran. BGEM: an in situ hybridization database of gene expression in the embryonic 1.1535 -and adult mouse nervous system. PLoS Biology, 4(4):e86 EP &#8211;, April 2006. 1.1536 -[14]Lydia Ng, Amy Bernard, Chris Lau, Caroline C Overly, Hong-Wei Dong, Chihchau Kuan, Sayan Pathak, Susan M 1.1537 -Sunkin, Chinh Dang, Jason W Bohland, Hemant Bokil, Partha P Mitra, Luis Puelles, John Hohmann, David J Anderson, 1.1538 -Ed S Lein, Allan R Jones, and Michael Hawrylycz. An anatomic gene expression atlas of the adult mouse brain. Nat 1.1539 -Neurosci, 12(3):356&#8211;362, March 2009. 1.1540 -[15]Christopher J. Paciorek. Computational techniques for spatial logistic regression with large data sets. Computational 1.1541 -Statistics &amp; Data Analysis, 51(8):3631&#8211;3653, May 2007. 1.1542 -[16]George Paxinos and Keith B.J. Franklin. The Mouse Brain in Stereotaxic Coordinates. Academic Press, 2 edition, July 1.1543 -2001. 1.1544 -[17]A. Schleicher, N. Palomero-Gallagher, P. Morosan, S. Eickhoff, T. Kowalski, K. Vos, K. Amunts, and K. Zilles. Quanti- 1.1545 -tative architectural analysis: a new approach to cortical mapping. Anatomy and Embryology, 210(5):373&#8211;386, December 1.1546 -2005. 1.1547 -[18]Oliver Schmitt, Lars Hmke, and Lutz Dmbgen. Detection of cortical transition regions utilizing statistical analyses of 1.1548 -excess masses. NeuroImage, 19(1):42&#8211;63, May 2003. 1.1549 -[19]Constance M. Smith, Jacqueline H. Finger, Terry F. Hayamizu, Ingeborg J. McCright, Janan T. Eppig, James A. 1.1550 -Kadin, Joel E. Richardson, and Martin Ringwald. The mouse gene expression database (GXD): 2007 update. Nucl. 1.1551 -Acids Res., 35(suppl_1):D618&#8211;623, 2007. 1.1552 -[20]Judy Sprague, Leyla Bayraktaroglu, Dave Clements, Tom Conlin, David Fashena, Ken Frazer, Melissa Haendel, Dou- 1.1553 -glas G Howe, Prita Mani, Sridhar Ramachandran, Kevin Schaper, Erik Segerdell, Peiran Song, Brock Sprunger, Sierra 1.1554 -Taylor, Ceri E Van Slyke, and Monte Westerfield. The zebrafish information network: the zebrafish model organism 1.1555 -database. Nucleic Acids Research, 34(Database issue):D581&#8211;5, 2006. PMID: 16381936. 1.1556 -[21]Larry Swanson. Brain Maps: Structure of the Rat Brain. Academic Press, 3 edition, November 2003. 1.1557 -[22]Carol L. Thompson, Sayan D. Pathak, Andreas Jeromin, Lydia L. Ng, Cameron R. MacPherson, Marty T. Mortrud, 1.1558 -Allison Cusick, Zackery L. Riley, Susan M. Sunkin, Amy Bernard, Ralph B. Puchalski, Fred H. Gage, Allan R. Jones, 1.1559 -Vladimir B. Bajic, Michael J. Hawrylycz, and Ed S. Lein. Genomic anatomy of the hippocampus. Neuron, 60(6):1010&#8211; 1.1560 -1021, December 2008. 1.1561 -[23]Pavel Tomancak, Amy Beaton, Richard Weiszmann, Elaine Kwan, ShengQiang Shu, Suzanna E Lewis, Stephen 1.1562 -Richards, Michael Ashburner, Volker Hartenstein, Susan E Celniker, and Gerald M Rubin. Systematic determina- 1.1563 -tion of patterns of gene expression during drosophila embryogenesis. Genome Biology, 3(12):research008818814, 2002. 1.1564 -PMC151190. 1.1565 -[24]Jano van Hemert and Richard Baldock. Mining Spatial Gene Expression Data for Association Rules, volume 4414/2007 1.1566 -of Lecture Notes in Computer Science, pages 66&#8211;76. Springer Berlin / Heidelberg, 2007. 1.1567 -[25]Shanmugasundaram Venkataraman, Peter Stevenson, Yiya Yang, Lorna Richardson, Nicholas Burton, Thomas P. Perry, 1.1568 -Paul Smith, Richard A. Baldock, Duncan R. Davidson, and Jeffrey H. Christiansen. EMAGE edinburgh mouse atlas 1.1569 -of gene expression: 2008 update. Nucl. Acids Res., 36(suppl_1):D860&#8211;865, 2008. 1.1570 -[26]Axel Visel, Christina Thaller, and Gregor Eichele. GenePaint.org: an atlas of gene expression patterns in the mouse 1.1571 -embryo. Nucl. Acids Res., 32(suppl_1):D552&#8211;556, 2004. 1.1572 -[27]Robert H Waterston, Kerstin Lindblad-Toh, Ewan Birney, Jane Rogers, Josep F Abril, Pankaj Agarwal, Richa Agar- 1.1573 -wala, Rachel Ainscough, Marina Alexandersson, Peter An, Stylianos E Antonarakis, John Attwood, Robert Baertsch, 1.1574 -Jonathon Bailey, Karen Barlow, Stephan Beck, Eric Berry, Bruce Birren, Toby Bloom, Peer Bork, Marc Botcherby, 1.1575 -Nicolas Bray, Michael R Brent, Daniel G Brown, Stephen D Brown, Carol Bult, John Burton, Jonathan Butler, 1.1576 -Robert D Campbell, Piero Carninci, Simon Cawley, Francesca Chiaromonte, Asif T Chinwalla, Deanna M Church, 1.1577 -Michele Clamp, Christopher Clee, Francis S Collins, Lisa L Cook, Richard R Copley, Alan Coulson, Olivier Couronne, 1.1578 -James Cuff, Val Curwen, Tim Cutts, Mark Daly, Robert David, Joy Davies, Kimberly D Delehaunty, Justin Deri, 1.1579 -Emmanouil T Dermitzakis, Colin Dewey, Nicholas J Dickens, Mark Diekhans, Sheila Dodge, Inna Dubchak, Diane M 1.1580 -Dunn, Sean R Eddy, Laura Elnitski, Richard D Emes, Pallavi Eswara, Eduardo Eyras, Adam Felsenfeld, Ginger A 1.1581 -Fewell, Paul Flicek, Karen Foley, Wayne N Frankel, Lucinda A Fulton, Robert S Fulton, Terrence S Furey, Diane Gage, 1.1582 -Richard A Gibbs, Gustavo Glusman, Sante Gnerre, Nick Goldman, Leo Goodstadt, Darren Grafham, Tina A Graves, 1.1583 -Eric D Green, Simon Gregory, Roderic Guig, Mark Guyer, Ross C Hardison, David Haussler, Yoshihide Hayashizaki, 1.1584 -LaDeana W Hillier, Angela Hinrichs, Wratko Hlavina, Timothy Holzer, Fan Hsu, Axin Hua, Tim Hubbard, Adrienne 1.1585 -Hunt, Ian Jackson, David B Jaffe, L Steven Johnson, Matthew Jones, Thomas A Jones, Ann Joy, Michael Kamal, 1.1586 -Elinor K Karlsson, Donna Karolchik, Arkadiusz Kasprzyk, Jun Kawai, Evan Keibler, Cristyn Kells, W James Kent, 1.1587 -Andrew Kirby, Diana L Kolbe, Ian Korf, Raju S Kucherlapati, Edward J Kulbokas, David Kulp, Tom Landers, J P 1.1588 -Leger, Steven Leonard, Ivica Letunic, Rosie Levine, Jia Li, Ming Li, Christine Lloyd, Susan Lucas, Bin Ma, Donna R 1.1589 -Maglott, Elaine R Mardis, Lucy Matthews, Evan Mauceli, John H Mayer, Megan McCarthy, W Richard McCombie, 1.1590 -Stuart McLaren, Kirsten McLay, John D McPherson, Jim Meldrim, Beverley Meredith, Jill P Mesirov, Webb Miller, 1.1591 -Tracie L Miner, Emmanuel Mongin, Kate T Montgomery, Michael Morgan, Richard Mott, James C Mullikin, Donna M 1.1592 -Muzny, William E Nash, Joanne O Nelson, Michael N Nhan, Robert Nicol, Zemin Ning, Chad Nusbaum, Michael J 1.1593 -O&#8217;Connor, Yasushi Okazaki, Karen Oliver, Emma Overton-Larty, Lior Pachter, Gens Parra, Kymberlie H Pepin, Jane 1.1594 -Peterson, Pavel Pevzner, Robert Plumb, Craig S Pohl, Alex Poliakov, Tracy C Ponce, Chris P Ponting, Simon Potter, 1.1595 -Michael Quail, Alexandre Reymond, Bruce A Roe, Krishna M Roskin, Edward M Rubin, Alistair G Rust, Ralph San- 1.1596 -tos, Victor Sapojnikov, Brian Schultz, Jrg Schultz, Matthias S Schwartz, Scott Schwartz, Carol Scott, Steven Seaman, 1.1597 -Steve Searle, Ted Sharpe, Andrew Sheridan, Ratna Shownkeen, Sarah Sims, Jonathan B Singer, Guy Slater, Arian 1.1598 -Smit, Douglas R Smith, Brian Spencer, Arne Stabenau, Nicole Stange-Thomann, Charles Sugnet, Mikita Suyama, 1.1599 -Glenn Tesler, Johanna Thompson, David Torrents, Evanne Trevaskis, John Tromp, Catherine Ucla, Abel Ureta-Vidal, 1.1600 -Jade P Vinson, Andrew C Von Niederhausern, Claire M Wade, Melanie Wall, Ryan J Weber, Robert B Weiss, Michael C 1.1601 -Wendl, Anthony P West, Kris Wetterstrand, Raymond Wheeler, Simon Whelan, Jamey Wierzbowski, David Willey, 1.1602 -Sophie Williams, Richard K Wilson, Eitan Winter, Kim C Worley, Dudley Wyman, Shan Yang, Shiaw-Pyng Yang, 1.1603 -Evgeny M Zdobnov, Michael C Zody, and Eric S Lander. Initial sequencing and comparative analysis of the mouse 1.1604 -genome. Nature, 420(6915):520&#8211;62, December 2002. PMID: 12466850. 1.1605 +[1]Chris Adamson, Leigh Johnston, Terrie Inder, Sandra Rees, Iven Mareels, and Gary Egan. A Tracking 1.1606 +Approach to Parcellation of the Cerebral Cortex, volume Volume 3749/2005 of Lecture Notes in Computer 1.1607 +Science, pages 294&#8211;301. Springer Berlin / Heidelberg, 2005. 1.1608 +[2]J. Annese, A. Pitiot, I. D. Dinov, and A. W. Toga. A myelo-architectonic method for the structural classification 1.1609 +of cortical areas. NeuroImage, 21(1):15&#8211;26, 2004. 1.1610 +[3]Tanya Barrett, Dennis B. Troup, Stephen E. Wilhite, Pierre Ledoux, Dmitry Rudnev, Carlos Evangelista, 1.1611 +Irene F. Kim, Alexandra Soboleva, Maxim Tomashevsky, and Ron Edgar. NCBI GEO: mining tens of millions 1.1612 +of expression profiles&#8211;database and tools update. Nucl. Acids Res., 35(suppl_1):D760&#8211;765, 2007. 1.1613 +[4]George W. Bell, Tatiana A. Yatskievych, and Parker B. Antin. GEISHA, a whole-mount in situ hybridization 1.1614 +gene expression screen in chicken embryos. Developmental Dynamics, 229(3):677&#8211;687, 2004. 1.1615 +[5]James P Carson, Tao Ju, Hui-Chen Lu, Christina Thaller, Mei Xu, Sarah L Pallas, Michael C Crair, Joe 1.1616 +Warren, Wah Chiu, and Gregor Eichele. A digital atlas to characterize the mouse brain transcriptome. 1.1617 +PLoS Comput Biol, 1(4):e41, 2005. 1.1618 +[6]Mark H. Chin, Alex B. Geng, Arshad H. Khan, Wei-Jun Qian, Vladislav A. Petyuk, Jyl Boline, Shawn Levy, 1.1619 +Arthur W. Toga, Richard D. Smith, Richard M. Leahy, and Desmond J. Smith. A genome-scale map of 1.1620 +expression for a mouse brain section obtained using voxelation. Physiol. Genomics, 30(3):313&#8211;321, August 1.1621 +2007. 1.1622 +[7]D C Van Essen, H A Drury, J Dickson, J Harwell, D Hanlon, and C H Anderson. An integrated software suite 1.1623 +for surface-based analyses of cerebral cortex. Journal of the American Medical Informatics Association: 1.1624 +JAMIA, 8(5):443&#8211;59, 2001. PMID: 11522765. 1.1625 +[8]Shiaoching Gong, Chen Zheng, Martin L. Doughty, Kasia Losos, Nicholas Didkovsky, Uta B. Scham- 1.1626 +bra, Norma J. Nowak, Alexandra Joyner, Gabrielle Leblanc, Mary E. Hatten, and Nathaniel Heintz. A 1.1627 +gene expression atlas of the central nervous system based on bacterial artificial chromosomes. Nature, 1.1628 +425(6961):917&#8211;925, October 2003. 1.1629 +[9]Trevor Hastie, Robert Tibshirani, Michael Eisen, Ash Alizadeh, Ronald Levy, Louis Staudt, Wing Chan, 1.1630 +David Botstein, and Patrick Brown. &#8217;Gene shaving&#8217; as a method for identifying distinct sets of genes with 1.1631 +similar expression patterns. Genome Biology, 1(2):research0003.1&#8211;research0003.21, 2000. 1.1632 +[10]Jano Hemert and Richard Baldock. Matching Spatial Regions with Combinations of Interacting Gene Ex- 1.1633 +pression Patterns, volume 13 of Communications in Computer and Information Science, pages 347&#8211;361. 1.1634 +Springer Berlin Heidelberg, 2008. 1.1635 +[11]C Kemp, JB Tenenbaum, TL Griffiths, T Yamada, and N Ueda. Learning systems of concepts with an infinite 1.1636 +relational model. In AAAI, 2006. 1.1637 +[12]F. Kruggel, M. K. Brckner, Th. Arendt, C. J. Wiggins, and D. Y. von Cramon. Analyzing the neocortical 1.1638 +fine-structure. Medical Image Analysis, 7(3):251&#8211;264, September 2003. 1.1639 +[13]Erh-Fang Lee, Jyl Boline, and Arthur W. Toga. A High-Resolution anatomical framework of the neonatal 1.1640 +mouse brain for managing gene expression data. Frontiers in Neuroinformatics, 1:6, 2007. PMC2525996. 1.1641 +[14]Susan Magdaleno, Patricia Jensen, Craig L. Brumwell, Anna Seal, Karen Lehman, Andrew Asbury, Tony 1.1642 +Cheung, Tommie Cornelius, Diana M. Batten, Christopher Eden, Shannon M. Norland, Dennis S. Rice, 1.1643 +Nilesh Dosooye, Sundeep Shakya, Perdeep Mehta, and Tom Curran. BGEM: an in situ hybridization 1.1644 +database of gene expression in the embryonic and adult mouse nervous system. PLoS Biology, 4(4):e86 1.1645 +EP &#8211;, April 2006. 1.1646 +[15]Lydia Ng, Amy Bernard, Chris Lau, Caroline C Overly, Hong-Wei Dong, Chihchau Kuan, Sayan Pathak, Su- 1.1647 +san M Sunkin, Chinh Dang, Jason W Bohland, Hemant Bokil, Partha P Mitra, Luis Puelles, John Hohmann, 1.1648 +David J Anderson, Ed S Lein, Allan R Jones, and Michael Hawrylycz. An anatomic gene expression atlas 1.1649 +of the adult mouse brain. Nat Neurosci, 12(3):356&#8211;362, March 2009. 1.1650 +[16]Christopher J. Paciorek. Computational techniques for spatial logistic regression with large data sets. Com- 1.1651 +putational Statistics &amp; Data Analysis, 51(8):3631&#8211;3653, May 2007. 1.1652 +[17]George Paxinos and Keith B.J. Franklin. The Mouse Brain in Stereotaxic Coordinates. Academic Press, 2 1.1653 +edition, July 2001. 1.1654 +[18]A. Schleicher, N. Palomero-Gallagher, P. Morosan, S. Eickhoff, T. Kowalski, K. Vos, K. Amunts, and 1.1655 +K. Zilles. Quantitative architectural analysis: a new approach to cortical mapping. Anatomy and Em- 1.1656 +bryology, 210(5):373&#8211;386, December 2005. 1.1657 +[19]Oliver Schmitt, Lars Hmke, and Lutz Dmbgen. Detection of cortical transition regions utilizing statistical 1.1658 +analyses of excess masses. NeuroImage, 19(1):42&#8211;63, May 2003. 1.1659 +[20]Constance M. Smith, Jacqueline H. Finger, Terry F. Hayamizu, Ingeborg J. McCright, Janan T. Eppig, 1.1660 +James A. Kadin, Joel E. Richardson, and Martin Ringwald. The mouse gene expression database (GXD): 1.1661 +2007 update. Nucl. Acids Res., 35(suppl_1):D618&#8211;623, 2007. 1.1662 +[21]Judy Sprague, Leyla Bayraktaroglu, Dave Clements, Tom Conlin, David Fashena, Ken Frazer, Melissa 1.1663 +Haendel, Douglas G Howe, Prita Mani, Sridhar Ramachandran, Kevin Schaper, Erik Segerdell, Peiran 1.1664 +Song, Brock Sprunger, Sierra Taylor, Ceri E Van Slyke, and Monte Westerfield. The zebrafish information 1.1665 +network: the zebrafish model organism database. Nucleic Acids Research, 34(Database issue):D581&#8211;5, 1.1666 +2006. PMID: 16381936. 1.1667 +[22]Larry Swanson. Brain Maps: Structure of the Rat Brain. Academic Press, 3 edition, November 2003. 1.1668 +[23]Carol L. Thompson, Sayan D. Pathak, Andreas Jeromin, Lydia L. Ng, Cameron R. MacPherson, Marty T. 1.1669 +Mortrud, Allison Cusick, Zackery L. Riley, Susan M. Sunkin, Amy Bernard, Ralph B. Puchalski, Fred H. 1.1670 +Gage, Allan R. Jones, Vladimir B. Bajic, Michael J. Hawrylycz, and Ed S. Lein. Genomic anatomy of the 1.1671 +hippocampus. Neuron, 60(6):1010&#8211;1021, December 2008. 1.1672 +[24]Pavel Tomancak, Amy Beaton, Richard Weiszmann, Elaine Kwan, ShengQiang Shu, Suzanna E Lewis, 1.1673 +Stephen Richards, Michael Ashburner, Volker Hartenstein, Susan E Celniker, and Gerald M Rubin. Sys- 1.1674 +tematic determination of patterns of gene expression during drosophila embryogenesis. Genome Biology, 1.1675 +3(12):research008818814, 2002. PMC151190. 1.1676 +[25]Jano van Hemert and Richard Baldock. Mining Spatial Gene Expression Data for Association Rules, volume 1.1677 +4414/2007 of Lecture Notes in Computer Science, pages 66&#8211;76. Springer Berlin / Heidelberg, 2007. 1.1678 +[26]Shanmugasundaram Venkataraman, Peter Stevenson, Yiya Yang, Lorna Richardson, Nicholas Burton, 1.1679 +Thomas P. Perry, Paul Smith, Richard A. Baldock, Duncan R. Davidson, and Jeffrey H. Christiansen. 1.1680 +EMAGE edinburgh mouse atlas of gene expression: 2008 update. Nucl. Acids Res., 36(suppl_1):D860&#8211; 1.1681 +865, 2008. 1.1682 +[27]Axel Visel, Christina Thaller, and Gregor Eichele. GenePaint.org: an atlas of gene expression patterns in 1.1683 +the mouse embryo. Nucl. Acids Res., 32(suppl_1):D552&#8211;556, 2004. 1.1684 +[28]Robert H Waterston, Kerstin Lindblad-Toh, Ewan Birney, Jane Rogers, Josep F Abril, Pankaj Agarwal, Richa 1.1685 +Agarwala, Rachel Ainscough, Marina Alexandersson, Peter An, Stylianos E Antonarakis, John Attwood, 1.1686 +Robert Baertsch, Jonathon Bailey, Karen Barlow, Stephan Beck, Eric Berry, Bruce Birren, Toby Bloom, Peer 1.1687 +Bork, Marc Botcherby, Nicolas Bray, Michael R Brent, Daniel G Brown, Stephen D Brown, Carol Bult, John 1.1688 +Burton, Jonathan Butler, Robert D Campbell, Piero Carninci, Simon Cawley, Francesca Chiaromonte, Asif T 1.1689 +Chinwalla, Deanna M Church, Michele Clamp, Christopher Clee, Francis S Collins, Lisa L Cook, Richard R 1.1690 +Copley, Alan Coulson, Olivier Couronne, James Cuff, Val Curwen, Tim Cutts, Mark Daly, Robert David, Joy 1.1691 +Davies, Kimberly D Delehaunty, Justin Deri, Emmanouil T Dermitzakis, Colin Dewey, Nicholas J Dickens, 1.1692 +Mark Diekhans, Sheila Dodge, Inna Dubchak, Diane M Dunn, Sean R Eddy, Laura Elnitski, Richard D Emes, 1.1693 +Pallavi Eswara, Eduardo Eyras, Adam Felsenfeld, Ginger A Fewell, Paul Flicek, Karen Foley, Wayne N 1.1694 +Frankel, Lucinda A Fulton, Robert S Fulton, Terrence S Furey, Diane Gage, Richard A Gibbs, Gustavo 1.1695 +Glusman, Sante Gnerre, Nick Goldman, Leo Goodstadt, Darren Grafham, Tina A Graves, Eric D Green, 1.1696 +Simon Gregory, Roderic Guig, Mark Guyer, Ross C Hardison, David Haussler, Yoshihide Hayashizaki, 1.1697 +LaDeana W Hillier, Angela Hinrichs, Wratko Hlavina, Timothy Holzer, Fan Hsu, Axin Hua, Tim Hubbard, 1.1698 +Adrienne Hunt, Ian Jackson, David B Jaffe, L Steven Johnson, Matthew Jones, Thomas A Jones, Ann Joy, 1.1699 +Michael Kamal, Elinor K Karlsson, Donna Karolchik, Arkadiusz Kasprzyk, Jun Kawai, Evan Keibler, Cristyn 1.1700 +Kells, W James Kent, Andrew Kirby, Diana L Kolbe, Ian Korf, Raju S Kucherlapati, Edward J Kulbokas, David 1.1701 +Kulp, Tom Landers, J P Leger, Steven Leonard, Ivica Letunic, Rosie Levine, Jia Li, Ming Li, Christine Lloyd, 1.1702 +Susan Lucas, Bin Ma, Donna R Maglott, Elaine R Mardis, Lucy Matthews, Evan Mauceli, John H Mayer, 1.1703 +Megan McCarthy, W Richard McCombie, Stuart McLaren, Kirsten McLay, John D McPherson, Jim Meldrim, 1.1704 +Beverley Meredith, Jill P Mesirov, Webb Miller, Tracie L Miner, Emmanuel Mongin, Kate T Montgomery, 1.1705 +Michael Morgan, Richard Mott, James C Mullikin, Donna M Muzny, William E Nash, Joanne O Nelson, 1.1706 +Michael N Nhan, Robert Nicol, Zemin Ning, Chad Nusbaum, Michael J O&#8217;Connor, Yasushi Okazaki, Karen 1.1707 +Oliver, Emma Overton-Larty, Lior Pachter, Gens Parra, Kymberlie H Pepin, Jane Peterson, Pavel Pevzner, 1.1708 +Robert Plumb, Craig S Pohl, Alex Poliakov, Tracy C Ponce, Chris P Ponting, Simon Potter, Michael Quail, 1.1709 +Alexandre Reymond, Bruce A Roe, Krishna M Roskin, Edward M Rubin, Alistair G Rust, Ralph Santos, 1.1710 +Victor Sapojnikov, Brian Schultz, Jrg Schultz, Matthias S Schwartz, Scott Schwartz, Carol Scott, Steven 1.1711 +Seaman, Steve Searle, Ted Sharpe, Andrew Sheridan, Ratna Shownkeen, Sarah Sims, Jonathan B Singer, 1.1712 +Guy Slater, Arian Smit, Douglas R Smith, Brian Spencer, Arne Stabenau, Nicole Stange-Thomann, Charles 1.1713 +Sugnet, Mikita Suyama, Glenn Tesler, Johanna Thompson, David Torrents, Evanne Trevaskis, John Tromp, 1.1714 +Catherine Ucla, Abel Ureta-Vidal, Jade P Vinson, Andrew C Von Niederhausern, Claire M Wade, Melanie 1.1715 +Wall, Ryan J Weber, Robert B Weiss, Michael C Wendl, Anthony P West, Kris Wetterstrand, Raymond 1.1716 +Wheeler, Simon Whelan, Jamey Wierzbowski, David Willey, Sophie Williams, Richard K Wilson, Eitan Win- 1.1717 +ter, Kim C Worley, Dudley Wyman, Shan Yang, Shiaw-Pyng Yang, Evgeny M Zdobnov, Michael C Zody, and 1.1718 +Eric S Lander. Initial sequencing and comparative analysis of the mouse genome. Nature, 420(6915):520&#8211; 1.1719 +62, December 2002. PMID: 12466850. 1.1720 1.1721