cg
diff grant.html @ 96:3dd9a1a81c23
.
author | bshanks@bshanks.dyndns.org |
---|---|
date | Wed Apr 22 05:26:06 2009 -0700 (16 years ago) |
parents | a25a60a4bf43 |
children | a75c226cbdd6 |
line diff
1.1 --- a/grant.html Tue Apr 21 18:53:40 2009 -0700
1.2 +++ b/grant.html Wed Apr 22 05:26:06 2009 -0700
1.3 @@ -1,834 +1,938 @@
1.4 Specific aims
1.5 -Massivenew datasets obtained with techniques such as in situ hybridization (ISH), immunohistochemistry, in situ transgenic
1.6 -reporter, microarray voxelation, and others, allow the expression levels of many genes at many locations to be compared.
1.7 -Our goal is to develop automated methods to relate spatial variation in gene expression to anatomy. We want to find marker
1.8 -genes for specific anatomical regions, and also to draw new anatomical maps based on gene expression patterns. We have
1.9 -three specific aims:
1.10 -(1) develop an algorithm to screen spatial gene expression data for combinations of marker genes which selectively target
1.11 -anatomical regions
1.12 -(2) develop an algorithm to suggest new ways of carving up a structure into anatomically distinct regions, based on
1.13 -spatial patterns in gene expression
1.14 -(3) create a 2-D “flat map” dataset of the mouse cerebral cortex that contains a flattened version of the Allen Mouse
1.15 -Brain Atlas ISH data, as well as the boundaries of cortical anatomical areas. This will involve extending the functionality of
1.16 -Caret, an existing open-source scientific imaging program. Use this dataset to validate the methods developed in (1) and (2).
1.17 -Although our particular application involves the 3D spatial distribution of gene expression, we anticipate that the methods
1.18 -developed in aims (1) and (2) will generalize to any sort of high-dimensional data over points located in a low-dimensional
1.19 -space. In particular, our method could be applied to genome-wide sequencing data derived from sets of tissues and disease
1.20 -states.
1.21 -In terms of the application of the methods to cerebral cortex, aim (1) is to go from cortical areas to marker genes,
1.22 -and aim (2) is to let the gene profile define the cortical areas. In addition to validating the usefulness of the algorithms,
1.23 -the application of these methods to cortex will produce immediate benefits, because there are currently no known genetic
1.24 -markers for most cortical areas. The results of the project will support the development of new ways to selectively target
1.25 -cortical areas, and it will support the development of a method for identifying the cortical areal boundaries present in small
1.26 -tissue samples.
1.27 -All algorithms that we develop will be implemented in a GPL open-source software toolkit. The toolkit, as well as the
1.28 -machine-readable datasets developed in aim (3), will be published and freely available for others to use.
1.29 +Massive new datasets obtained with techniques such as in situ hybridization (ISH), immunohistochemistry, in
1.30 +situ transgenic reporter, microarray voxelation, and others, allow the expression levels of many genes at many
1.31 +locations to be compared. Our goal is to develop automated methods to relate spatial variation in gene expres-
1.32 +sion to anatomy. We want to find marker genes for specific anatomical regions, and also to draw new anatomical
1.33 +maps based on gene expression patterns. We have three specific aims:
1.34 +(1) develop an algorithm to screen spatial gene expression data for combinations of marker genes which
1.35 +selectively target anatomical regions
1.36 +(2) develop an algorithm to suggest new ways of carving up a structure into anatomically distinct regions,
1.37 +based on spatial patterns in gene expression
1.38 +(3) create a 2-D “flat map” dataset of the mouse cerebral cortex that contains a flattened version of the Allen
1.39 +Mouse Brain Atlas ISH data, as well as the boundaries of cortical anatomical areas. This will involve extending
1.40 +the functionality of Caret, an existing open-source scientific imaging program. Use this dataset to validate the
1.41 +methods developed in (1) and (2).
1.42 +Although our particular application involves the 3D spatial distribution of gene expression, we anticipate that
1.43 +the methods developed in aims (1) and (2) will generalize to any sort of high-dimensional data over points located
1.44 +in a low-dimensional space. In particular, our method could be applied to genome-wide sequencing data derived
1.45 +from sets of tissues and disease states.
1.46 +In terms of the application of the methods to cerebral cortex, aim (1) is to go from cortical areas to marker
1.47 +genes, and aim (2) is to let the gene profile define the cortical areas. In addition to validating the usefulness
1.48 +of the algorithms, the application of these methods to cortex will produce immediate benefits, because there
1.49 +are currently no known genetic markers for most cortical areas. The results of the project will support the
1.50 +development of new ways to selectively target cortical areas, and it will support the development of a method for
1.51 +identifying the cortical areal boundaries present in small tissue samples.
1.52 +All algorithms that we develop will be implemented in a GPL open-source software toolkit. The toolkit, as well
1.53 +as the machine-readable datasets developed in aim (3), will be published and freely available for others to use.
1.54 The challenge topic
1.55 -This proposal addresses challenge topic 06-HG-101. Massive new datasets obtained with techniques such as in situ hybridiza-
1.56 -tion (ISH), immunohistochemistry, in situ transgenic reporter, microarray voxelation, and others, allow the expression levels
1.57 -of many genes at many locations to be compared. Our goal is to develop automated methods to relate spatial variation in
1.58 -gene expression to anatomy. We want to find marker genes for specific anatomical regions, and also to draw new anatomical
1.59 -maps based on gene expression patterns.
1.60 +This proposal addresses challenge topic 06-HG-101. Massive new datasets obtained with techniques such as
1.61 +in situ hybridization (ISH), immunohistochemistry, in situ transgenic reporter, microarray voxelation, and others,
1.62 +allow the expression levels of many genes at many locations to be compared. Our goal is to develop automated
1.63 +methods to relate spatial variation in gene expression to anatomy. We want to find marker genes for specific
1.64 +anatomical regions, and also to draw new anatomical maps based on gene expression patterns.
1.65 The Challenge and Potential impact
1.66 -Each of our three aims will be discussed in turn. For each aim, we will develop a conceptual framework for thinking about
1.67 -the task, and we will present our strategy for solving it. Next we will discuss related work. At the conclusion of each section,
1.68 -we will summarize why our strategy is different from what has been done before. At the end of this section, we will describe
1.69 -the potential impact.
1.70 +Each of our three aims will be discussed in turn. For each aim, we will develop a conceptual framework for
1.71 +thinking about the task, and we will present our strategy for solving it. Next we will discuss related work. At the
1.72 +conclusion of each section, we will summarize why our strategy is different from what has been done before. At
1.73 +the end of this section, we will describe the potential impact.
1.74 Aim 1: Given a map of regions, find genes that mark the regions
1.75 -Machine learning terminology: classifiers The task of looking for marker genes for known anatomical regions means
1.76 -that one is looking for a set of genes such that, if the expression level of those genes is known, then the locations of the
1.77 -regions can be inferred.
1.78 -If we define the regions so that they cover the entire anatomical structure to be subdivided, we may say that we are
1.79 -using gene expression in each voxel to assign that voxel to the proper area. We call this a classification task, because each
1.80 -voxel is being assigned to a class (namely, its region). An understanding of the relationship between the combination of
1.81 -their expression levels and the locations of the regions may be expressed as a function. The input to this function is a voxel,
1.82 -along with the gene expression levels within that voxel; the output is the regional identity of the target voxel, that is, the
1.83 -region to which the target voxel belongs. We call this function a classifier. In general, the input to a classifier is called an
1.84 -instance, and the output is called a label (or a class label).
1.85 -The object of aim 1 is not to produce a single classifier, but rather to develop an automated method for determining a
1.86 -classifier for any known anatomical structure. Therefore, we seek a procedure by which a gene expression dataset may be
1.87 -analyzed in concert with an anatomical atlas in order to produce a classifier. The initial gene expression dataset used in
1.88 -the construction of the classifier is called training data. In the machine learning literature, this sort of procedure may be
1.89 -thought of as a supervised learning task, defined as a task in which the goal is to learn a mapping from instances to labels,
1.90 -and the training data consists of a set of instances (voxels) for which the labels (regions) are known.
1.91 -Each gene expression level is called a feature, and the selection of which genes1 to include is called feature selection.
1.92 -Feature selection is one component of the task of learning a classifier. Some methods for learning classifiers start out with
1.93 -a separate feature selection phase, whereas other methods combine feature selection with other aspects of training.
1.94 -One class of feature selection methods assigns some sort of score to each candidate gene. The top-ranked genes are then
1.95 -chosen. Some scoring measures can assign a score to a set of selected genes, not just to a single gene; in this case, a dynamic
1.96 -procedure may be used in which features are added and subtracted from the selected set depending on how much they raise
1.97 -the score. Such procedures are called “stepwise” or “greedy”.
1.98 -Although the classifier itself may only look at the gene expression data within each voxel before classifying that voxel, the
1.99 -algorithm which constructs the classifier may look over the entire dataset. We can categorize score-based feature selection
1.100 -methods depending on how the score of calculated. Often the score calculation consists of assigning a sub-score to each voxel,
1.101 -and then aggregating these sub-scores into a final score (the aggregation is often a sum or a sum of squares or average). If
1.102 -only information from nearby voxels is used to calculate a voxel’s sub-score, then we say it is a local scoring method. If only
1.103 -information from the voxel itself is used to calculate a voxel’s sub-score, then we say it is a pointwise scoring method.
1.104 -Both gene expression data and anatomical atlases have errors, due to a variety of factors. Individual subjects have
1.105 -idiosyncratic anatomy. Subjects may be improperly registred to the atlas. The method used to measure gene expression
1.106 -may be noisy. The atlas may have errors. It is even possible that some areas in the anatomical atlas are “wrong” in that
1.107 -they do not have the same shape as the natural domains of gene expression to which they correspond. These sources of error
1.108 -can affect the displacement and the shape of both the gene expression data and the anatomical target areas. Therefore, it
1.109 -is important to use feature selection methods which are robust to these kinds of errors.
1.110 +Machine learning terminology: classifiers The task of looking for marker genes for known anatomical regions
1.111 +means that one is looking for a set of genes such that, if the expression level of those genes is known, then the
1.112 +locations of the regions can be inferred.
1.113 +If we define the regions so that they cover the entire anatomical structure to be subdivided, we may say that
1.114 +we are using gene expression in each voxel to assign that voxel to the proper area. We call this a classification
1.115 +task, because each voxel is being assigned to a class (namely, its region). An understanding of the relationship
1.116 +between the combination of their expression levels and the locations of the regions may be expressed as a
1.117 +function. The input to this function is a voxel, along with the gene expression levels within that voxel; the output is
1.118 +the regional identity of the target voxel, that is, the region to which the target voxel belongs. We call this function
1.119 +a classifier. In general, the input to a classifier is called an instance, and the output is called a label (or a class
1.120 +label).
1.121 +The object of aim 1 is not to produce a single classifier, but rather to develop an automated method for
1.122 +determining a classifier for any known anatomical structure. Therefore, we seek a procedure by which a gene
1.123 +expression dataset may be analyzed in concert with an anatomical atlas in order to produce a classifier. The
1.124 +initial gene expression dataset used in the construction of the classifier is called training data. In the machine
1.125 +learning literature, this sort of procedure may be thought of as a supervised learning task, defined as a task in
1.126 +which the goal is to learn a mapping from instances to labels, and the training data consists of a set of instances
1.127 +(voxels) for which the labels (regions) are known.
1.128 +Each gene expression level is called a feature, and the selection of which genes1 to include is called feature
1.129 +selection. Feature selection is one component of the task of learning a classifier. Some methods for learning
1.130 +classifiers start out with a separate feature selection phase, whereas other methods combine feature selection
1.131 +with other aspects of training.
1.132 +One class of feature selection methods assigns some sort of score to each candidate gene. The top-ranked
1.133 +genes are then chosen. Some scoring measures can assign a score to a set of selected genes, not just to a
1.134 +single gene; in this case, a dynamic procedure may be used in which features are added and subtracted from the
1.135 +selected set depending on how much they raise the score. Such procedures are called “stepwise” or “greedy”.
1.136 +Although the classifier itself may only look at the gene expression data within each voxel before classifying
1.137 +that voxel, the algorithm which constructs the classifier may look over the entire dataset. We can categorize
1.138 +score-based feature selection methods depending on how the score of calculated. Often the score calculation
1.139 +consists of assigning a sub-score to each voxel, and then aggregating these sub-scores into a final score (the
1.140 +aggregation is often a sum or a sum of squares or average). If only information from nearby voxels is used to
1.141 +calculate a voxel’s sub-score, then we say it is a local scoring method. If only information from the voxel itself is
1.142 +used to calculate a voxel’s sub-score, then we say it is a pointwise scoring method.
1.143 +_________________________________________
1.144 + 1Strictly speaking, the features are gene expression levels, but we’ll call them genes.
1.145 +Both gene expression data and anatomical atlases have errors, due to a variety of factors. Individual subjects
1.146 +have idiosyncratic anatomy. Subjects may be improperly registred to the atlas. The method used to measure
1.147 +gene expression may be noisy. The atlas may have errors. It is even possible that some areas in the anatomical
1.148 +atlas are “wrong” in that they do not have the same shape as the natural domains of gene expression to which
1.149 +they correspond. These sources of error can affect the displacement and the shape of both the gene expression
1.150 +data and the anatomical target areas. Therefore, it is important to use feature selection methods which are
1.151 +robust to these kinds of errors.
1.152 Our strategy for Aim 1
1.153 -Key questions when choosing a learning method are: What are the instances? What are the features? How are the features
1.154 -chosen? Here are four principles that outline our answers to these questions.
1.155 -_________________________________________
1.156 - 1Strictly speaking, the features are gene expression levels, but we’ll call them genes.
1.157 +Key questions when choosing a learning method are: What are the instances? What are the features? How are
1.158 +the features chosen? Here are four principles that outline our answers to these questions.
1.159 Principle 1: Combinatorial gene expression
1.160 -It istoo much to hope that every anatomical region of interest will be identified by a single gene. For example, in the
1.161 -cortex, there are some areas which are not clearly delineated by any gene included in the Allen Brain Atlas (ABA) dataset.
1.162 -However, at least some of these areas can be delineated by looking at combinations of genes (an example of an area for
1.163 -which multiple genes are necessary and sufficient is provided in Preliminary Studies, Figure 4). Therefore, each instance
1.164 -should contain multiple features (genes).
1.165 +It is too much to hope that every anatomical region of interest will be identified by a single gene. For example,
1.166 +in the cortex, there are some areas which are not clearly delineated by any gene included in the Allen Brain Atlas
1.167 +(ABA) dataset. However, at least some of these areas can be delineated by looking at combinations of genes
1.168 +(an example of an area for which multiple genes are necessary and sufficient is provided in Preliminary Studies,
1.169 +Figure 4). Therefore, each instance should contain multiple features (genes).
1.170 Principle 2: Only look at combinations of small numbers of genes
1.171 -When the classifier classifies a voxel, it is only allowed to look at the expression of the genes which have been selected
1.172 -as features. The more data that are available to a classifier, the better that it can do. For example, perhaps there are weak
1.173 -correlations over many genes that add up to a strong signal. So, why not include every gene as a feature? The reason is that
1.174 -we wish to employ the classifier in situations in which it is not feasible to gather data about every gene. For example, if we
1.175 -want to use the expression of marker genes as a trigger for some regionally-targeted intervention, then our intervention must
1.176 -contain a molecular mechanism to check the expression level of each marker gene before it triggers. It is currently infeasible
1.177 -to design a molecular trigger that checks the level of more than a handful of genes. Similarly, if the goal is to develop a
1.178 -procedure to do ISH on tissue samples in order to label their anatomy, then it is infeasible to label more than a few genes.
1.179 -Therefore, we must select only a few genes as features.
1.180 -The requirement to find combinations of only a small number of genes limits us from straightforwardly applying many
1.181 -of the most simple techniques from the field of supervised machine learning. In the parlance of machine learning, our task
1.182 -combines feature selection with supervised learning.
1.183 +When the classifier classifies a voxel, it is only allowed to look at the expression of the genes which have
1.184 +been selected as features. The more data that are available to a classifier, the better that it can do. For example,
1.185 +perhaps there are weak correlations over many genes that add up to a strong signal. So, why not include every
1.186 +gene as a feature? The reason is that we wish to employ the classifier in situations in which it is not feasible to
1.187 +gather data about every gene. For example, if we want to use the expression of marker genes as a trigger for
1.188 +some regionally-targeted intervention, then our intervention must contain a molecular mechanism to check the
1.189 +expression level of each marker gene before it triggers. It is currently infeasible to design a molecular trigger that
1.190 +checks the level of more than a handful of genes. Similarly, if the goal is to develop a procedure to do ISH on
1.191 +tissue samples in order to label their anatomy, then it is infeasible to label more than a few genes. Therefore, we
1.192 +must select only a few genes as features.
1.193 +The requirement to find combinations of only a small number of genes limits us from straightforwardly ap-
1.194 +plying many of the most simple techniques from the field of supervised machine learning. In the parlance of
1.195 +machine learning, our task combines feature selection with supervised learning.
1.196 Principle 3: Use geometry in feature selection
1.197 -When doing feature selection with score-based methods, the simplest thing to do would be to score the performance of
1.198 -each voxel by itself and then combine these scores (pointwise scoring). A more powerful approach is to also use information
1.199 -about the geometric relations between each voxel and its neighbors; this requires non-pointwise, local scoring methods. See
1.200 -Preliminary Studies, figure 3 for evidence of the complementary nature of pointwise and local scoring methods.
1.201 +When doing feature selection with score-based methods, the simplest thing to do would be to score the per-
1.202 +formance of each voxel by itself and then combine these scores (pointwise scoring). A more powerful approach
1.203 +is to also use information about the geometric relations between each voxel and its neighbors; this requires non-
1.204 +pointwise, local scoring methods. See Preliminary Studies, figure 3 for evidence of the complementary nature of
1.205 +pointwise and local scoring methods.
1.206 Principle 4: Work in 2-D whenever possible
1.207 -There are many anatomical structures which are commonly characterized in terms of a two-dimensional manifold. When
1.208 -it is known that the structure that one is looking for is two-dimensional, the results may be improved by allowing the analysis
1.209 -algorithm to take advantage of this prior knowledge. In addition, it is easier for humans to visualize and work with 2-D
1.210 -data. Therefore, when possible, the instances should represent pixels, not voxels.
1.211 +There are many anatomical structures which are commonly characterized in terms of a two-dimensional
1.212 +manifold. When it is known that the structure that one is looking for is two-dimensional, the results may be
1.213 +improved by allowing the analysis algorithm to take advantage of this prior knowledge. In addition, it is easier for
1.214 +humans to visualize and work with 2-D data. Therefore, when possible, the instances should represent pixels,
1.215 +not voxels.
1.216 Related work
1.217 -There is a substantial body of work on the analysis of gene expression data, most of this concerns gene expression data
1.218 -which are not fundamentally spatial2.
1.219 -As noted above, there has been much work on both supervised learning and there are many available algorithms for
1.220 -each. However, the algorithms require the scientist to provide a framework for representing the problem domain, and the
1.221 -way that this framework is set up has a large impact on performance. Creating a good framework can require creatively
1.222 -reconceptualizing the problem domain, and is not merely a mechanical “fine-tuning” of numerical parameters. For example,
1.223 -we believe that domain-specific scoring measures (such as gradient similarity, which is discussed in Preliminary Studies) may
1.224 -be necessary in order to achieve the best results in this application.
1.225 -We are aware of six existing efforts to find marker genes using spatial gene expression data using automated methods.
1.226 -[12 ] mentions the possibility of constructing a spatial region for each gene, and then, for each anatomical structure of
1.227 -interest, computing what proportion of this structure is covered by the gene’s spatial region.
1.228 -GeneAtlas[5] and EMAGE [25] allow the user to construct a search query by demarcating regions and then specifing
1.229 -either the strength of expression or the name of another gene or dataset whose expression pattern is to be matched. For the
1.230 -similiarity score (match score) between two images (in this case, the query and the gene expression images), GeneAtlas uses
1.231 -the sum of a weighted L1-norm distance between vectors whose components represent the number of cells within a pixel3
1.232 -whose expression is within four discretization levels. EMAGE uses Jaccard similarity4. Neither GeneAtlas nor EMAGE
1.233 -allow one to search for combinations of genes that define a region in concert but not separately.
1.234 -[14 ] describes AGEA, ”Anatomic Gene Expression Atlas”. AGEA has three components. Gene Finder: The user
1.235 -selects a seed voxel and the system (1) chooses a cluster which includes the seed voxel, (2) yields a list of genes which are
1.236 -overexpressed in that cluster. (note: the ABA website also contains pre-prepared lists of overexpressed genes for selected
1.237 -structures). Correlation: The user selects a seed voxel and the system then shows the user how much correlation there is
1.238 -between the gene expression profile of the seed voxel and every other voxel. Clusters: will be described later
1.239 -_________________________________________
1.240 - 2By “fundamentally spatial” we mean that there is information from a large number of spatial locations indexed by spatial coordinates; not
1.241 -just data which have only a few different locations or which is indexed by anatomical label.
1.242 - 3Actually, many of these projects use quadrilaterals instead of square pixels; but we will refer to them as pixels for simplicity.
1.243 - 4the number of true pixels in the intersection of the two images, divided by the number of pixels in their union.
1.244 -Gene Finder is different from our Aim 1 in at least three ways. First, Gene Finder finds only single genes, whereas we
1.245 -will also look for combinations of genes. Second, gene finder can only use overexpression as a marker, whereas we will also
1.246 -search for underexpression. Third, Gene Finder uses a simple pointwise score5, whereas we will also use geometric scores
1.247 -such as gradient similarity (described in Preliminary Studies). Figures 4, 2, and 3 in the Preliminary Studies section contains
1.248 -evidence that each of our three choices is the right one.
1.249 -[6 ] looks at the mean expression level of genes within anatomical regions, and applies a Student’s t-test with Bonferroni
1.250 -correction to determine whether the mean expression level of a gene is significantly higher in the target region. Like AGEA,
1.251 -this is a pointwise measure (only the mean expression level per pixel is being analyzed), it is not being used to look for
1.252 -underexpression, and does not look for combinations of genes.
1.253 -[10 ] describes a technique to find combinations of marker genes to pick out an anatomical region. They use an evolutionary
1.254 -algorithm to evolve logical operators which combine boolean (thresholded) images in order to match a target image. Their
1.255 -match score is Jaccard similarity.
1.256 -In summary, there has been fruitful work on finding marker genes, but only one of the previous projects explores
1.257 -combinations of marker genes, and none of these publications compare the results obtained by using different algorithms or
1.258 -scoring methods.
1.259 +There is a substantial body of work on the analysis of gene expression data, most of this concerns gene expres-
1.260 +sion data which are not fundamentally spatial2.
1.261 +As noted above, there has been much work on both supervised learning and there are many available
1.262 +algorithms for each. However, the algorithms require the scientist to provide a framework for representing the
1.263 +problem domain, and the way that this framework is set up has a large impact on performance. Creating a
1.264 +good framework can require creatively reconceptualizing the problem domain, and is not merely a mechanical
1.265 +“fine-tuning” of numerical parameters. For example, we believe that domain-specific scoring measures (such
1.266 +as gradient similarity, which is discussed in Preliminary Studies) may be necessary in order to achieve the best
1.267 +results in this application.
1.268 +We are aware of six existing efforts to find marker genes using spatial gene expression data using automated
1.269 +methods.
1.270 +[13 ] mentions the possibility of constructing a spatial region for each gene, and then, for each anatomical
1.271 +structure of interest, computing what proportion of this structure is covered by the gene’s spatial region.
1.272 +GeneAtlas[5] and EMAGE [26] allow the user to construct a search query by demarcating regions and then
1.273 +specifing either the strength of expression or the name of another gene or dataset whose expression pattern
1.274 +is to be matched. For the similiarity score (match score) between two images (in this case, the query and the
1.275 +gene expression images), GeneAtlas uses the sum of a weighted L1-norm distance between vectors whose
1.276 +components represent the number of cells within a pixel3 whose expression is within four discretization levels.
1.277 +EMAGE uses Jaccard similarity4. Neither GeneAtlas nor EMAGE allow one to search for combinations of genes
1.278 +that define a region in concert but not separately.
1.279 +[15 ] describes AGEA, ”Anatomic Gene Expression Atlas”. AGEA has three components. Gene Finder: The
1.280 +user selects a seed voxel and the system (1) chooses a cluster which includes the seed voxel, (2) yields a list
1.281 +of genes which are overexpressed in that cluster. (note: the ABA website also contains pre-prepared lists of
1.282 +overexpressed genes for selected structures). Correlation: The user selects a seed voxel and the system then
1.283 +shows the user how much correlation there is between the gene expression profile of the seed voxel and every
1.284 +other voxel. Clusters: will be described later
1.285 +Gene Finder is different from our Aim 1 in at least three ways. First, Gene Finder finds only single genes,
1.286 +whereas we will also look for combinations of genes. Second, gene finder can only use overexpression as a
1.287 +marker, whereas we will also search for underexpression. Third, Gene Finder uses a simple pointwise score5,
1.288 +whereas we will also use geometric scores such as gradient similarity (described in Preliminary Studies). Figures
1.289 +4, 2, and 3 in the Preliminary Studies section contains evidence that each of our three choices is the right one.
1.290 +[6 ] looks at the mean expression level of genes within anatomical regions, and applies a Student’s t-test
1.291 +with Bonferroni correction to determine whether the mean expression level of a gene is significantly higher in
1.292 +the target region. Like AGEA, this is a pointwise measure (only the mean expression level per pixel is being
1.293 +analyzed), it is not being used to look for underexpression, and does not look for combinations of genes.
1.294 +[10 ] describes a technique to find combinations of marker genes to pick out an anatomical region. They use
1.295 +an evolutionary algorithm to evolve logical operators which combine boolean (thresholded) images in order to
1.296 +match a target image. Their match score is Jaccard similarity.
1.297 +In summary, there has been fruitful work on finding marker genes, but only one of the previous projects
1.298 +explores combinations of marker genes, and none of these publications compare the results obtained by using
1.299 +different algorithms or scoring methods.
1.300 Aim 2: From gene expression data, discover a map of regions
1.301 Machine learning terminology: clustering
1.302 -If one is given a dataset consisting merely of instances, with no class labels, then analysis of the dataset is referred to as
1.303 -unsupervised learning in the jargon of machine learning. One thing that you can do with such a dataset is to group instances
1.304 -together. A set of similar instances is called a cluster, and the activity of finding grouping the data into clusters is called
1.305 -clustering or cluster analysis.
1.306 -The task of deciding how to carve up a structure into anatomical regions can be put into these terms. The instances
1.307 -are once again voxels (or pixels) along with their associated gene expression profiles. We make the assumption that voxels
1.308 -from the same anatomical region have similar gene expression profiles, at least compared to the other regions. This means
1.309 -that clustering voxels is the same as finding potential regions; we seek a partitioning of the voxels into regions, that is, into
1.310 -clusters of voxels with similar gene expression.
1.311 -It is desirable to determine not just one set of regions, but also how these regions relate to each other. The outcome of
1.312 -clustering may be a hierarchial tree of clusters, rather than a single set of clusters which partition the voxels. This is called
1.313 -hierarchial clustering.
1.314 -Similarity scores A crucial choice when designing a clustering method is how to measure similarity, across either pairs
1.315 -of instances, or clusters, or both. There is much overlap between scoring methods for feature selection (discussed above
1.316 -under Aim 1) and scoring methods for similarity.
1.317 -Spatially contiguous clusters; image segmentation We have shown that aim 2 is a type of clustering task. In fact,
1.318 -it is a special type of clustering task because we have an additional constraint on clusters; voxels grouped together into a
1.319 -cluster must be spatially contiguous. In Preliminary Studies, we show that one can get reasonable results without enforcing
1.320 -this constraint; however, we plan to compare these results against other methods which guarantee contiguous clusters.
1.321 -Image segmentation is the task of partitioning the pixels in a digital image into clusters, usually contiguous clusters. Aim
1.322 -2 is similar to an image segmentation task. There are two main differences; in our task, there are thousands of color channels
1.323 -(one for each gene), rather than just three6. A more crucial difference is that there are various cues which are appropriate
1.324 -for detecting sharp object boundaries in a visual scene but which are not appropriate for segmenting abstract spatial data
1.325 -such as gene expression. Although many image segmentation algorithms can be expected to work well for segmenting other
1.326 -sorts of spatially arranged data, some of these algorithms are specialized for visual images.
1.327 -Dimensionality reduction In this section, we discuss reducing the length of the per-pixel gene expression feature
1.328 -vector. By “dimension”, we mean the dimension of this vector, not the spatial dimension of the underlying data.
1.329 -Unlike aim 1, there is no externally-imposed need to select only a handful of informative genes for inclusion in the
1.330 -instances. However, some clustering algorithms perform better on small numbers of features7. There are techniques which
1.331 -“summarize” a larger number of features using a smaller number of features; these techniques go by the name of feature
1.332 -extraction or dimensionality reduction. The small set of features that such a technique yields is called the reduced feature
1.333 -set. Note that the features in the reduced feature set do not necessarily correspond to genes; each feature in the reduced set
1.334 -may be any function of the set of gene expression levels.
1.335 +2By “fundamentally spatial” we mean that there is information from a large number of spatial locations indexed by spatial coordinates;
1.336 +not just data which have only a few different locations or which is indexed by anatomical label.
1.337 +3Actually, many of these projects use quadrilaterals instead of square pixels; but we will refer to them as pixels for simplicity.
1.338 +4the number of true pixels in the intersection of the two images, divided by the number of pixels in their union.
1.339 +5“Expression energy ratio”, which captures overexpression.
1.340 +If one is given a dataset consisting merely of instances, with no class labels, then analysis of the dataset is
1.341 +referred to as unsupervised learning in the jargon of machine learning. One thing that you can do with such a
1.342 +dataset is to group instances together. A set of similar instances is called a cluster, and the activity of finding
1.343 +grouping the data into clusters is called clustering or cluster analysis.
1.344 +The task of deciding how to carve up a structure into anatomical regions can be put into these terms. The
1.345 +instances are once again voxels (or pixels) along with their associated gene expression profiles. We make
1.346 +the assumption that voxels from the same anatomical region have similar gene expression profiles, at least
1.347 +compared to the other regions. This means that clustering voxels is the same as finding potential regions; we
1.348 +seek a partitioning of the voxels into regions, that is, into clusters of voxels with similar gene expression.
1.349 +It is desirable to determine not just one set of regions, but also how these regions relate to each other. The
1.350 +outcome of clustering may be a hierarchial tree of clusters, rather than a single set of clusters which partition the
1.351 +voxels. This is called hierarchial clustering.
1.352 +Similarity scores A crucial choice when designing a clustering method is how to measure similarity, across
1.353 +either pairs of instances, or clusters, or both. There is much overlap between scoring methods for feature
1.354 +selection (discussed above under Aim 1) and scoring methods for similarity.
1.355 +Spatially contiguous clusters; image segmentation We have shown that aim 2 is a type of clustering
1.356 +task. In fact, it is a special type of clustering task because we have an additional constraint on clusters; voxels
1.357 +grouped together into a cluster must be spatially contiguous. In Preliminary Studies, we show that one can get
1.358 +reasonable results without enforcing this constraint; however, we plan to compare these results against other
1.359 +methods which guarantee contiguous clusters.
1.360 +Image segmentation is the task of partitioning the pixels in a digital image into clusters, usually contiguous
1.361 +clusters. Aim 2 is similar to an image segmentation task. There are two main differences; in our task, there are
1.362 +thousands of color channels (one for each gene), rather than just three6. A more crucial difference is that there
1.363 +are various cues which are appropriate for detecting sharp object boundaries in a visual scene but which are not
1.364 +appropriate for segmenting abstract spatial data such as gene expression. Although many image segmentation
1.365 +algorithms can be expected to work well for segmenting other sorts of spatially arranged data, some of these
1.366 +algorithms are specialized for visual images.
1.367 +Dimensionality reduction In this section, we discuss reducing the length of the per-pixel gene expression
1.368 +feature vector. By “dimension”, we mean the dimension of this vector, not the spatial dimension of the underlying
1.369 +data.
1.370 +Unlike aim 1, there is no externally-imposed need to select only a handful of informative genes for inclusion
1.371 +in the instances. However, some clustering algorithms perform better on small numbers of features7. There are
1.372 +techniques which “summarize” a larger number of features using a smaller number of features; these techniques
1.373 +go by the name of feature extraction or dimensionality reduction. The small set of features that such a technique
1.374 +yields is called the reduced feature set. Note that the features in the reduced feature set do not necessarily
1.375 +correspond to genes; each feature in the reduced set may be any function of the set of gene expression levels.
1.376 +Clustering genes rather than voxels Although the ultimate goal is to cluster the instances (voxels or pixels),
1.377 +one strategy to achieve this goal is to first cluster the features (genes). There are two ways that clusters of genes
1.378 +could be used.
1.379 +Gene clusters could be used as part of dimensionality reduction: rather than have one feature for each gene,
1.380 +we could have one reduced feature for each gene cluster.
1.381 +Gene clusters could also be used to directly yield a clustering on instances. This is because many genes
1.382 +have an expression pattern which seems to pick out a single, spatially continguous region. Therefore, it seems
1.383 +likely that an anatomically interesting region will have multiple genes which each individually pick it out8. This
1.384 _________________________________________
1.385 - 5“Expression energy ratio”, which captures overexpression.
1.386 - 6There are imaging tasks which use more than three colors, for example multispectral imaging and hyperspectral imaging, which are often
1.387 -used to process satellite imagery.
1.388 - 7First, because the number of features in the reduced dataset is less than in the original dataset, the running time of clustering algorithms
1.389 -may be much less. Second, it is thought that some clustering algorithms may give better results on reduced data.
1.390 -Clustering genes rather than voxels Although the ultimate goal is to cluster the instances (voxels or pixels), one
1.391 -strategy to achieve this goal is to first cluster the features (genes). There are two ways that clusters of genes could be used.
1.392 -Gene clusters could be used as part of dimensionality reduction: rather than have one feature for each gene, we could
1.393 -have one reduced feature for each gene cluster.
1.394 -Gene clusters could also be used to directly yield a clustering on instances. This is because many genes have an expression
1.395 -pattern which seems to pick out a single, spatially continguous region. Therefore, it seems likely that an anatomically
1.396 -interesting region will have multiple genes which each individually pick it out8. This suggests the following procedure:
1.397 -cluster together genes which pick out similar regions, and then to use the more popular common regions as the final clusters.
1.398 -In Preliminary Studies, Figure 7, we show that a number of anatomically recognized cortical regions, as well as some
1.399 -“superregions” formed by lumping together a few regions, are associated with gene clusters in this fashion.
1.400 -The task of clustering both the instances and the features is called co-clustering, and there are a number of co-clustering
1.401 -algorithms.
1.402 + 6There are imaging tasks which use more than three colors, for example multispectral imaging and hyperspectral imaging, which are
1.403 +often used to process satellite imagery.
1.404 + 7First, because the number of features in the reduced dataset is less than in the original dataset, the running time of clustering
1.405 +algorithms may be much less. Second, it is thought that some clustering algorithms may give better results on reduced data.
1.406 + 8This would seem to contradict our finding in aim 1 that some cortical areas are combinatorially coded by multiple genes. However,
1.407 +it is possible that the currently accepted cortical maps divide the cortex into regions which are unnatural from the point of view of gene
1.408 +expression; perhaps there is some other way to map the cortex for which each region can be identified by single genes. Another
1.409 +suggests the following procedure: cluster together genes which pick out similar regions, and then to use the
1.410 +more popular common regions as the final clusters. In Preliminary Studies, Figure 7, we show that a number
1.411 +of anatomically recognized cortical regions, as well as some “superregions” formed by lumping together a few
1.412 +regions, are associated with gene clusters in this fashion.
1.413 +The task of clustering both the instances and the features is called co-clustering, and there are a number of
1.414 +co-clustering algorithms.
1.415 Related work
1.416 -Some researchers have attempted to parcellate cortex on the basis of non-gene expression data. For example, [17], [2], [18],
1.417 -and [1 ] associate spots on the cortex with the radial profile9 of response to some stain ([11] uses MRI), extract features from
1.418 -this profile, and then use similarity between surface pixels to cluster. Features used include statistical moments, wavelets,
1.419 -and the excess mass functional. Some of these features are motivated by the presence of tangential lines of stain intensity
1.420 -which correspond to laminar structure. Some methods use standard clustering procedures, whereas others make use of the
1.421 -spatial nature of the data to look for sudden transitions, which are identified as areal borders.
1.422 -[22 ] describes an analysis of the anatomy of the hippocampus using the ABA dataset. In addition to manual analysis,
1.423 -two clustering methods were employed, a modified Non-negative Matrix Factorization (NNMF), and a hierarchial recursive
1.424 -bifurcation clustering scheme based on correlation as the similarity score. The paper yielded impressive results, proving
1.425 -the usefulness of computational genomic anatomy. We have run NNMF on the cortical dataset10 and while the results are
1.426 -promising, they also demonstrate that NNMF is not necessarily the best dimensionality reduction method for this application
1.427 -(see Preliminary Studies, Figure 6).
1.428 -AGEA[14] includes a preset hierarchial clustering of voxels based on a recursive bifurcation algorithm with correlation
1.429 -as the similarity metric. EMAGE[25] allows the user to select a dataset from among a large number of alternatives, or by
1.430 -running a search query, and then to cluster the genes within that dataset. EMAGE clusters via hierarchial complete linkage
1.431 -clustering with un-centred correlation as the similarity score.
1.432 -[6 ] clustered genes, starting out by selecting 135 genes out of 20,000 which had high variance over voxels and which were
1.433 -highly correlated with many other genes. They computed the matrix of (rank) correlations between pairs of these genes, and
1.434 -ordered the rows of this matrix as follows: “the first row of the matrix was chosen to show the strongest contrast between
1.435 -the highest and lowest correlation coefficient for that row. The remaining rows were then arranged in order of decreasing
1.436 -similarity using a least squares metric”. The resulting matrix showed four clusters. For each cluster, prototypical spatial
1.437 -expression patterns were created by averaging the genes in the cluster. The prototypes were analyzed manually, without
1.438 -clustering voxels.
1.439 -[10 ] applies their technique for finding combinations of marker genes for the purpose of clustering genes around a “seed
1.440 -gene”. They do this by using the pattern of expression of the seed gene as the target image, and then searching for other
1.441 -genes which can be combined to reproduce this pattern. Other genes which are found are considered to be related to the
1.442 -seed. The same team also describes a method[24] for finding “association rules” such as, “if this voxel is expressed in by
1.443 -any gene, then that voxel is probably also expressed in by the same gene”. This could be useful as part of a procedure for
1.444 -clustering voxels.
1.445 -In summary, although these projects obtained clusterings, there has not been much comparison between different algo-
1.446 -rithms or scoring methods, so it is likely that the best clustering method for this application has not yet been found. The
1.447 -projects using gene expression on cortex did not attempt to make use of the radial profile of gene expression. Also, none of
1.448 -these projects did a separate dimensionality reduction step before clustering pixels, none tried to cluster genes first in order
1.449 -to guide automated clustering of pixels into spatial regions, and none used co-clustering algorithms.
1.450 -_________________________________________
1.451 - 8This would seem to contradict our finding in aim 1 that some cortical areas are combinatorially coded by multiple genes. However, it is
1.452 -possible that the currently accepted cortical maps divide the cortex into regions which are unnatural from the point of view of gene expression;
1.453 -perhaps there is some other way to map the cortex for which each region can be identified by single genes. Another possibility is that, although
1.454 -the cluster prototype fits an anatomical region, the individual genes are each somewhat different from the prototype.
1.455 - 9A radial profile is a profile along a line perpendicular to the cortical surface.
1.456 - 10We ran “vanilla” NNMF, whereas the paper under discussion used a modified method. Their main modification consisted of adding a soft
1.457 -spatial contiguity constraint. However, on our dataset, NNMF naturally produced spatially contiguous clusters, so no additional constraint was
1.458 -needed. The paper under discussion also mentions that they tried a hierarchial variant of NNMF, which we have not yet tried.
1.459 +Some researchers have attempted to parcellate cortex on the basis of non-gene expression data. For example,
1.460 +[18 ], [2 ], [19], and [1] associate spots on the cortex with the radial profile9 of response to some stain ([12] uses
1.461 +MRI), extract features from this profile, and then use similarity between surface pixels to cluster. Features used
1.462 +include statistical moments, wavelets, and the excess mass functional. Some of these features are motivated
1.463 +by the presence of tangential lines of stain intensity which correspond to laminar structure. Some methods use
1.464 +standard clustering procedures, whereas others make use of the spatial nature of the data to look for sudden
1.465 +transitions, which are identified as areal borders.
1.466 +[23 ] describes an analysis of the anatomy of the hippocampus using the ABA dataset. In addition to manual
1.467 +analysis, two clustering methods were employed, a modified Non-negative Matrix Factorization (NNMF), and
1.468 +a hierarchial recursive bifurcation clustering scheme based on correlation as the similarity score. The paper
1.469 +yielded impressive results, proving the usefulness of computational genomic anatomy. We have run NNMF on
1.470 +the cortical dataset10 and while the results are promising, they also demonstrate that NNMF is not necessarily
1.471 +the best dimensionality reduction method for this application (see Preliminary Studies, Figure 6).
1.472 +AGEA[15] includes a preset hierarchial clustering of voxels based on a recursive bifurcation algorithm with
1.473 +correlation as the similarity metric. EMAGE[26] allows the user to select a dataset from among a large number
1.474 +of alternatives, or by running a search query, and then to cluster the genes within that dataset. EMAGE clusters
1.475 +via hierarchial complete linkage clustering with un-centred correlation as the similarity score.
1.476 +[6 ] clustered genes, starting out by selecting 135 genes out of 20,000 which had high variance over voxels and
1.477 +which were highly correlated with many other genes. They computed the matrix of (rank) correlations between
1.478 +pairs of these genes, and ordered the rows of this matrix as follows: “the first row of the matrix was chosen to
1.479 +show the strongest contrast between the highest and lowest correlation coefficient for that row. The remaining
1.480 +rows were then arranged in order of decreasing similarity using a least squares metric”. The resulting matrix
1.481 +showed four clusters. For each cluster, prototypical spatial expression patterns were created by averaging the
1.482 +genes in the cluster. The prototypes were analyzed manually, without clustering voxels.
1.483 +[10 ] applies their technique for finding combinations of marker genes for the purpose of clustering genes
1.484 +around a “seed gene”. They do this by using the pattern of expression of the seed gene as the target image, and
1.485 +then searching for other genes which can be combined to reproduce this pattern. Other genes which are found
1.486 +are considered to be related to the seed. The same team also describes a method[25] for finding “association
1.487 +rules” such as, “if this voxel is expressed in by any gene, then that voxel is probably also expressed in by the
1.488 +same gene”. This could be useful as part of a procedure for clustering voxels.
1.489 +In summary, although these projects obtained clusterings, there has not been much comparison between
1.490 +different algorithms or scoring methods, so it is likely that the best clustering method for this application has not
1.491 +yet been found. The projects using gene expression on cortex did not attempt to make use of the radial profile
1.492 +of gene expression. Also, none of these projects did a separate dimensionality reduction step before clustering
1.493 +pixels, none tried to cluster genes first in order to guide automated clustering of pixels into spatial regions, and
1.494 +none used co-clustering algorithms.
1.495 +________
1.496 +possibility is that, although the cluster prototype fits an anatomical region, the individual genes are each somewhat different from the
1.497 +prototype.
1.498 + 9A radial profile is a profile along a line perpendicular to the cortical surface.
1.499 + 10We ran “vanilla” NNMF, whereas the paper under discussion used a modified method. Their main modification consisted of adding
1.500 +a soft spatial contiguity constraint. However, on our dataset, NNMF naturally produced spatially contiguous clusters, so no additional
1.501 +constraint was needed. The paper under discussion also mentions that they tried a hierarchial variant of NNMF, which we have not yet
1.502 +tried.
1.503 Aim 3: apply the methods developed to the cerebral cortex
1.504 Background
1.505 -The cortex is divided into areas and layers. Because of the cortical columnar organization, the parcellation of the cortex
1.506 -into areas can be drawn as a 2-D map on the surface of the cortex. In the third dimension, the boundaries between the
1.507 -areas continue downwards into the cortical depth, perpendicular to the surface. The layer boundaries run parallel to the
1.508 -surface. One can picture an area of the cortex as a slice of a six-layered cake11.
1.509 -It is known that different cortical areas have distinct roles in both normal functioning and in disease processes, yet there
1.510 -are no known marker genes for most cortical areas. When it is necessary to divide a tissue sample into cortical areas, this is
1.511 -a manual process that requires a skilled human to combine multiple visual cues and interpret them in the context of their
1.512 -approximate location upon the cortical surface.
1.513 -Even the questions of how many areas should be recognized in cortex, and what their arrangement is, are still not
1.514 -completely settled. A proposed division of the cortex into areas is called a cortical map. In the rodent, the lack of a single
1.515 -agreed-upon map can be seen by contrasting the recent maps given by Swanson[21] on the one hand, and Paxinos and
1.516 -Franklin[16] on the other. While the maps are certainly very similar in their general arrangement, significant differences
1.517 -remain.
1.518 +The cortex is divided into areas and layers. Because of the cortical columnar organization, the parcellation
1.519 +of the cortex into areas can be drawn as a 2-D map on the surface of the cortex. In the third dimension, the
1.520 +boundaries between the areas continue downwards into the cortical depth, perpendicular to the surface. The
1.521 +layer boundaries run parallel to the surface. One can picture an area of the cortex as a slice of a six-layered
1.522 +cake11 .
1.523 +It is known that different cortical areas have distinct roles in both normal functioning and in disease processes,
1.524 +yet there are no known marker genes for most cortical areas. When it is necessary to divide a tissue sample
1.525 +into cortical areas, this is a manual process that requires a skilled human to combine multiple visual cues and
1.526 +interpret them in the context of their approximate location upon the cortical surface.
1.527 +Even the questions of how many areas should be recognized in cortex, and what their arrangement is, are
1.528 +still not completely settled. A proposed division of the cortex into areas is called a cortical map. In the rodent,
1.529 +the lack of a single agreed-upon map can be seen by contrasting the recent maps given by Swanson[22] on the
1.530 +one hand, and Paxinos and Franklin[17] on the other. While the maps are certainly very similar in their general
1.531 +arrangement, significant differences remain.
1.532 The Allen Mouse Brain Atlas dataset
1.533 -The Allen Mouse Brain Atlas (ABA) data were produced by doing in-situ hybridization on slices of male, 56-day-old
1.534 -C57BL/6J mouse brains. Pictures were taken of the processed slice, and these pictures were semi-automatically analyzed
1.535 -to create a digital measurement of gene expression levels at each location in each slice. Per slice, cellular spatial resolution
1.536 -is achieved. Using this method, a single physical slice can only be used to measure one single gene; many different mouse
1.537 -brains were needed in order to measure the expression of many genes.
1.538 -An automated nonlinear alignment procedure located the 2D data from the various slices in a single 3D coordinate
1.539 -system. In the final 3D coordinate system, voxels are cubes with 200 microns on a side. There are 67x41x58 = 159,326
1.540 -voxels in the 3D coordinate system, of which 51,533 are in the brain[14].
1.541 -Mus musculus is thought to contain about 22,000 protein-coding genes[27]. The ABA contains data on about 20,000
1.542 -genes in sagittal sections, out of which over 4,000 genes are also measured in coronal sections. Our dataset is derived from
1.543 -only the coronal subset of the ABA12.
1.544 -The ABA is not the only large public spatial gene expression dataset13. With the exception of the ABA, GenePaint, and
1.545 -EMAGE, most of the other resources have not (yet) extracted the expression intensity from the ISH images and registered
1.546 -the results into a single 3-D space, and to our knowledge only ABA and EMAGE make this form of data available for public
1.547 -download from the website14. Many of these resources focus on developmental gene expression.
1.548 +The Allen Mouse Brain Atlas (ABA) data were produced by doing in-situ hybridization on slices of male,
1.549 +56-day-old C57BL/6J mouse brains. Pictures were taken of the processed slice, and these pictures were semi-
1.550 +automatically analyzed to create a digital measurement of gene expression levels at each location in each slice.
1.551 +Per slice, cellular spatial resolution is achieved. Using this method, a single physical slice can only be used
1.552 +to measure one single gene; many different mouse brains were needed in order to measure the expression of
1.553 +many genes.
1.554 +An automated nonlinear alignment procedure located the 2D data from the various slices in a single 3D
1.555 +coordinate system. In the final 3D coordinate system, voxels are cubes with 200 microns on a side. There are
1.556 +67x41x58 = 159,326 voxels in the 3D coordinate system, of which 51,533 are in the brain[15].
1.557 +Mus musculus is thought to contain about 22,000 protein-coding genes[28]. The ABA contains data on about
1.558 +20,000 genes in sagittal sections, out of which over 4,000 genes are also measured in coronal sections. Our
1.559 +dataset is derived from only the coronal subset of the ABA12.
1.560 +The ABA is not the only large public spatial gene expression dataset13. With the exception of the ABA,
1.561 +GenePaint, and EMAGE, most of the other resources have not (yet) extracted the expression intensity from the
1.562 +ISH images and registered the results into a single 3-D space, and to our knowledge only ABA and EMAGE
1.563 +make this form of data available for public download from the website14. Many of these resources focus on
1.564 +developmental gene expression.
1.565 Related work
1.566 -[14 ] describes the application of AGEA to the cortex. The paper describes interesting results on the structure of correlations
1.567 -between voxel gene expression profiles within a handful of cortical areas. However, this sort of analysis is not related to either
1.568 -of our aims, as it neither finds marker genes, nor does it suggest a cortical map based on gene expression data. Neither of
1.569 -the other components of AGEA can be applied to cortical areas; AGEA’s Gene Finder cannot be used to find marker genes
1.570 -for the cortical areas; and AGEA’s hierarchial clustering does not produce clusters corresponding to the cortical areas15.
1.571 -In summary, for all three aims, (a) only one of the previous projects explores combinations of marker genes, (b) there has
1.572 -been almost no comparison of different algorithms or scoring methods, and (c) there has been no work on computationally
1.573 -finding marker genes for cortical areas, or on finding a hierarchial clustering that will yield a map of cortical areas de novo
1.574 -from gene expression data.
1.575 -___________________
1.576 - 11Outside of isocortex, the number of layers varies.
1.577 - 12The sagittal data do not cover the entire cortex, and also have greater registration error[14]. Genes were selected by the Allen Institute for
1.578 -coronal sectioning based on, “classes of known neuroscientific interest... or through post hoc identification of a marked non-ubiquitous expression
1.579 -pattern”[14].
1.580 - 13Other such resources include GENSAT[8], GenePaint[26], its sister project GeneAtlas[5], BGEM[13], EMAGE[25], EurExpress (http:
1.581 -//www.eurexpress.org/ee/; EurExpress data are also entered into EMAGE), EADHB (http://www.ncl.ac.uk/ihg/EADHB/database/$EADHB_
1.582 -{database}$.html), MAMEP (http://mamep.molgen.mpg.de/index.php), Xenbase (http://xenbase.org/), ZFIN[20], Aniseed (http://
1.583 -aniseed-ibdm.univ-mrs.fr/), VisiGene (http://genome.ucsc.edu/cgi-bin/hgVisiGene ; includes data from some of the other listed data
1.584 -sources), GEISHA[4], Fruitfly.org[23], COMPARE (http://compare.ibdml.univ-mrs.fr/), GXD[19], GEO[3] (GXD and GEO contain spatial
1.585 -data but also non-spatial data. All GXD spatial data are also in EMAGE.)
1.586 - 14without prior offline registration
1.587 - 15In both cases, the cause is that pairwise correlations between the gene expression of voxels in different areas but the same layer are often stronger
1.588 -than pairwise correlations between the gene expression of voxels in different layers but the same area. Therefore, a pairwise voxel correlation
1.589 -clustering algorithm will tend to create clusters representing cortical layers, not areas (there may be clusters which presumably correspond to the
1.590 -intersection of a layer and an area, but since one area will have many layer-area intersection clusters, further work is needed to make sense of
1.591 -these). The reason that Gene Finder cannot the find marker genes for cortical areas is that, although the user chooses a seed voxel, Gene Finder
1.592 -chooses the ROI for which genes will be found, and it creates that ROI by (pairwise voxel correlation) clustering around the seed.
1.593 -Our project is guided by a concrete application with a well-specified criterion of success (how well we can find marker
1.594 -genes for / reproduce the layout of cortical areas), which will provide a solid basis for comparing different methods.
1.595 +[15 ] describes the application of AGEA to the cortex. The paper describes interesting results on the structure
1.596 +of correlations between voxel gene expression profiles within a handful of cortical areas. However, this sort
1.597 +of analysis is not related to either of our aims, as it neither finds marker genes, nor does it suggest a cortical
1.598 +map based on gene expression data. Neither of the other components of AGEA can be applied to cortical
1.599 +_________________________________________
1.600 + 11Outside of isocortex, the number of layers varies.
1.601 + 12The sagittal data do not cover the entire cortex, and also have greater registration error[15]. Genes were selected by the Allen
1.602 +Institute for coronal sectioning based on, “classes of known neuroscientific interest... or through post hoc identification of a marked
1.603 +non-ubiquitous expression pattern”[15].
1.604 + 13Other such resources include GENSAT[8], GenePaint[27], its sister project GeneAtlas[5], BGEM[14], EMAGE[26], EurExpress
1.605 +(http://www.eurexpress.org/ee/; EurExpress data are also entered into EMAGE), EADHB (http://www.ncl.ac.uk/ihg/EADHB/
1.606 +database/EADHB_database.html), MAMEP (http://mamep.molgen.mpg.de/index.php), Xenbase (http://xenbase.org/), ZFIN[21],
1.607 +Aniseed (http://aniseed-ibdm.univ-mrs.fr/), VisiGene (http://genome.ucsc.edu/cgi-bin/hgVisiGene ; includes data from some
1.608 +of the other listed data sources), GEISHA[4], Fruitfly.org[24], COMPARE (http://compare.ibdml.univ-mrs.fr/), GXD[20], GEO[3]
1.609 +(GXD and GEO contain spatial data but also non-spatial data. All GXD spatial data are also in EMAGE.)
1.610 + 14without prior offline registration
1.611 +areas; AGEA’s Gene Finder cannot be used to find marker genes for the cortical areas; and AGEA’s hierarchial
1.612 +clustering does not produce clusters corresponding to the cortical areas15.
1.613 +In summary, for all three aims, (a) only one of the previous projects explores combinations of marker genes,
1.614 +(b) there has been almost no comparison of different algorithms or scoring methods, and (c) there has been no
1.615 +work on computationally finding marker genes for cortical areas, or on finding a hierarchial clustering that will
1.616 +yield a map of cortical areas de novo from gene expression data.
1.617 +Our project is guided by a concrete application with a well-specified criterion of success (how well we can
1.618 +find marker genes for / reproduce the layout of cortical areas), which will provide a solid basis for comparing
1.619 +different methods.
1.620 Significance
1.621
1.622
1.623 -Figure 1: Top row: Genes Nfic and
1.624 -A930001M12Rik are the most correlated
1.625 -with area SS (somatosensory cortex). Bot-
1.626 -tom row: Genes C130038G02Rik and
1.627 -Cacna1i are those with the best fit using
1.628 -logistic regression. Within each picture, the
1.629 -vertical axis roughly corresponds to anterior
1.630 -at the top and posterior at the bottom, and
1.631 -the horizontal axis roughly corresponds to
1.632 -medial at the left and lateral at the right.
1.633 -The red outline is the boundary of region
1.634 -SS. Pixels are colored according to correla-
1.635 -tion, with red meaning high correlation and
1.636 -blue meaning low. The method developed in aim (1) will be applied to each cortical area to find
1.637 - a set of marker genes such that the combinatorial expression pattern of those
1.638 - genes uniquely picks out the target area. Finding marker genes will be useful
1.639 - for drug discovery as well as for experimentation because marker genes can be
1.640 - used to design interventions which selectively target individual cortical areas.
1.641 - The application of the marker gene finding algorithm to the cortex will
1.642 - also support the development of new neuroanatomical methods. In addition
1.643 - to finding markers for each individual cortical areas, we will find a small panel
1.644 - of genes that can find many of the areal boundaries at once. This panel of
1.645 - marker genes will allow the development of an ISH protocol that will allow
1.646 - experimenters to more easily identify which anatomical areas are present in
1.647 - small samples of cortex.
1.648 - The method developed in aim (2) will provide a genoarchitectonic viewpoint
1.649 - that will contribute to the creation of a better map. The development of
1.650 - present-day cortical maps was driven by the application of histological stains.
1.651 - If a different set of stains had been available which identified a different set of
1.652 - features, then today’s cortical maps may have come out differently. It is likely
1.653 - that there are many repeated, salient spatial patterns in the gene expression
1.654 - which have not yet been captured by any stain. Therefore, cortical anatomy
1.655 - needs to incorporate what we can learn from looking at the patterns of gene
1.656 - expression.
1.657 - While we do not here propose to analyze human gene expression data, it is
1.658 - conceivable that the methods we propose to develop could be used to suggest
1.659 - modifications to the human cortical map as well. In fact, the methods we will
1.660 - develop will be applicable to other datasets beyond the brain. We will provide
1.661 - an open-source toolbox to allow other researchers to easily use our methods.
1.662 - With these methods, researchers with gene expression for any area of the body
1.663 - will be able to efficiently find marker genes for anatomical regions, or to use
1.664 - gene expression to discover new anatomical patterning. As described above,
1.665 -marker genes have a variety of uses in the development of drugs and experimental manipulations, and in the anatomical
1.666 -characterization of tissue samples. The discovery of new ways to carve up anatomical structures into regions may lead to
1.667 -the discovery of new anatomical subregions in various structures, which will widely impact all areas of biology.
1.668 +Figure 1: Top row: Genes Nfic
1.669 +and A930001M12Rik are the most
1.670 +correlated with area SS (somatosen-
1.671 +sory cortex). Bottom row: Genes
1.672 +C130038G02Rik and Cacna1i are
1.673 +those with the best fit using logistic
1.674 +regression. Within each picture, the
1.675 +vertical axis roughly corresponds to
1.676 +anterior at the top and posterior at the
1.677 +bottom, and the horizontal axis roughly
1.678 +corresponds to medial at the left and
1.679 +lateral at the right. The red outline is
1.680 +the boundary of region SS. Pixels are
1.681 +colored according to correlation, with
1.682 +red meaning high correlation and blue
1.683 +meaning low. The method developed in aim (1) will be applied to each cortical area to
1.684 + find a set of marker genes such that the combinatorial expression pat-
1.685 + tern of those genes uniquely picks out the target area. Finding marker
1.686 + genes will be useful for drug discovery as well as for experimentation
1.687 + because marker genes can be used to design interventions which se-
1.688 + lectively target individual cortical areas.
1.689 + The application of the marker gene finding algorithm to the cortex
1.690 + will also support the development of new neuroanatomical methods. In
1.691 + addition to finding markers for each individual cortical areas, we will
1.692 + find a small panel of genes that can find many of the areal boundaries
1.693 + at once. This panel of marker genes will allow the development of an
1.694 + ISH protocol that will allow experimenters to more easily identify which
1.695 + anatomical areas are present in small samples of cortex.
1.696 + The method developed in aim (2) will provide a genoarchitectonic
1.697 + viewpoint that will contribute to the creation of a better map. The de-
1.698 + velopment of present-day cortical maps was driven by the application
1.699 + of histological stains. If a different set of stains had been available
1.700 + which identified a different set of features, then today’s cortical maps
1.701 + may have come out differently. It is likely that there are many repeated,
1.702 + salient spatial patterns in the gene expression which have not yet been
1.703 + captured by any stain. Therefore, cortical anatomy needs to incorpo-
1.704 + rate what we can learn from looking at the patterns of gene expression.
1.705 + While we do not here propose to analyze human gene expression
1.706 + data, it is conceivable that the methods we propose to develop could
1.707 + be used to suggest modifications to the human cortical map as well. In
1.708 + fact, the methods we will develop will be applicable to other datasets
1.709 + beyond the brain. We will provide an open-source toolbox to allow
1.710 + other researchers to easily use our methods. With these methods, re-
1.711 + searchers with gene expression for any area of the body will be able to
1.712 +efficiently find marker genes for anatomical regions, or to use gene expression to discover new anatomical pat-
1.713 +terning. As described above, marker genes have a variety of uses in the development of drugs and experimental
1.714 +manipulations, and in the anatomical characterization of tissue samples. The discovery of new ways to carve up
1.715 +anatomical structures into regions may lead to the discovery of new anatomical subregions in various structures,
1.716 +_________________________________________
1.717 + 15In both cases, the cause is that pairwise correlations between the gene expression of voxels in different areas but the same layer
1.718 +are often stronger than pairwise correlations between the gene expression of voxels in different layers but the same area. Therefore, a
1.719 +pairwise voxel correlation clustering algorithm will tend to create clusters representing cortical layers, not areas (there may be clusters
1.720 +which presumably correspond to the intersection of a layer and an area, but since one area will have many layer-area intersection
1.721 +clusters, further work is needed to make sense of these). The reason that Gene Finder cannot the find marker genes for cortical areas
1.722 +is that, although the user chooses a seed voxel, Gene Finder chooses the ROI for which genes will be found, and it creates that ROI by
1.723 +(pairwise voxel correlation) clustering around the seed.
1.724 +which will widely impact all areas of biology.
1.725
1.726 -Figure 2: Gene Pitx2
1.727 -is selectively underex-
1.728 -pressed in area SS. Although our particular application involves the 3D spatial distribution of gene expression, we
1.729 - anticipate that the methods developed in aims (1) and (2) will not be limited to gene expression
1.730 - data, but rather will generalize to any sort of high-dimensional data over points located in a
1.731 - low-dimensional space.
1.732 - The approach: Preliminary Studies
1.733 +Figure 2: Gene Pitx2
1.734 +is selectively underex-
1.735 +pressed in area SS. Although our particular application involves the 3D spatial distribution of gene ex-
1.736 + pression, we anticipate that the methods developed in aims (1) and (2) will not be limited
1.737 + to gene expression data, but rather will generalize to any sort of high-dimensional data
1.738 + over points located in a low-dimensional space.
1.739 + The approach: Preliminary Studies
1.740 Format conversion between SEV, MATLAB, NIFTI
1.741 - We have created software to (politely) download all of the SEV files16 from the Allen Institute
1.742 - website. We have also created software to convert between the SEV, MATLAB, and NIFTI file
1.743 - formats, as well as some of Caret’s file formats.
1.744 + We have created software to (politely) download all of the SEV files16 from the Allen
1.745 + Institute website. We have also created software to convert between the SEV, MATLAB,
1.746 + and NIFTI file formats, as well as some of Caret’s file formats.
1.747 Flatmap of cortex
1.748 - We downloaded the ABA data and applied a mask to select only those voxels which belong to
1.749 - cerebral cortex. We divided the cortex into hemispheres.
1.750 -Using Caret[7], we created a mesh representation of the surface of the selected voxels. For each gene, and for each node
1.751 -of the mesh, we calculated an average of the gene expression of the voxels “underneath” that mesh node. We then flattened
1.752 -the cortex, creating a two-dimensional mesh.
1.753 -____
1.754 - 16SEV is a sparse format for spatial data. It is the format in which the ABA data is made available.
1.755 -
1.756 + We downloaded the ABA data and applied a mask to select only those voxels which
1.757 +belong to cerebral cortex. We divided the cortex into hemispheres.
1.758 +Using Caret[7], we created a mesh representation of the surface of the selected voxels. For each gene, and
1.759 +for each node of the mesh, we calculated an average of the gene expression of the voxels “underneath” that
1.760 +mesh node. We then flattened the cortex, creating a two-dimensional mesh.
1.761
1.762
1.763 -Figure 3: The top row shows the two genes
1.764 -which (individually) best predict area AUD,
1.765 -according to logistic regression. The bot-
1.766 -tom row shows the two genes which (indi-
1.767 -vidually) best match area AUD, according
1.768 -to gradient similarity. From left to right and
1.769 -top to bottom, the genes are Ssr1, Efcbp1,
1.770 -Ptk7, and Aph1a. We sampled the nodes of the irregular, flat mesh in order to create a regular
1.771 - grid of pixel values. We converted this grid into a MATLAB matrix.
1.772 - We manually traced the boundaries of each of 49 cortical areas from the
1.773 - ABA coronal reference atlas slides. We then converted these manual traces
1.774 - into Caret-format regional boundary data on the mesh surface. We projected
1.775 - the regions onto the 2-d mesh, and then onto the grid, and then we converted
1.776 - the region data into MATLAB format.
1.777 - At this point, the data are in the form of a number of 2-D matrices, all in
1.778 - registration, with the matrix entries representing a grid of points (pixels) over
1.779 - the cortical surface:
1.780 - ∙ A 2-D matrix whose entries represent the regional label associated with
1.781 - each surface pixel
1.782 - ∙ For each gene, a 2-D matrix whose entries represent the average expres-
1.783 - sion level underneath each surface pixel
1.784 - We created a normalized version of the gene expression data by subtracting
1.785 - each gene’s mean expression level (over all surface pixels) and dividing the
1.786 - expression level of each gene by its standard deviation.
1.787 - The features and the target area are both functions on the surface pix-
1.788 - els. They can be referred to as scalar fields over the space of surface pixels;
1.789 - alternately, they can be thought of as images which can be displayed on the
1.790 - flatmapped surface.
1.791 - To move beyond a single average expression level for each surface pixel, we
1.792 -plan to create a separate matrix for each cortical layer to represent the average expression level within that layer. Cortical
1.793 -layers are found at different depths in different parts of the cortex. In preparation for extracting the layer-specific datasets,
1.794 -we have extended Caret with routines that allow the depth of the ROI for volume-to-surface projection to vary.
1.795 -In the Research Plan, we describe how we will automatically locate the layer depths. For validation, we have manually
1.796 -demarcated the depth of the outer boundary of cortical layer 5 throughout the cortex.
1.797 +Figure 3: The top row shows the two
1.798 +genes which (individually) best predict
1.799 +area AUD, according to logistic regres-
1.800 +sion. The bottom row shows the two
1.801 +genes which (individually) best match
1.802 +area AUD, according to gradient sim-
1.803 +ilarity. From left to right and top to
1.804 +bottom, the genes are Ssr1, Efcbp1,
1.805 +Ptk7, and Aph1a. We sampled the nodes of the irregular, flat mesh in order to create
1.806 + a regular grid of pixel values. We converted this grid into a MATLAB
1.807 + matrix.
1.808 + We manually traced the boundaries of each of 49 cortical areas
1.809 + from the ABA coronal reference atlas slides. We then converted these
1.810 + manual traces into Caret-format regional boundary data on the mesh
1.811 + surface. We projected the regions onto the 2-d mesh, and then onto
1.812 + the grid, and then we converted the region data into MATLAB format.
1.813 + At this point, the data are in the form of a number of 2-D matrices,
1.814 + all in registration, with the matrix entries representing a grid of points
1.815 + (pixels) over the cortical surface:
1.816 + ∙ A 2-D matrix whose entries represent the regional label associ-
1.817 + ated with each surface pixel
1.818 + ∙ For each gene, a 2-D matrix whose entries represent the average
1.819 + expression level underneath each surface pixel
1.820 + We created a normalized version of the gene expression data by
1.821 + subtracting each gene’s mean expression level (over all surface pixels)
1.822 + and dividing the expression level of each gene by its standard deviation.
1.823 + The features and the target area are both functions on the surface
1.824 + pixels. They can be referred to as scalar fields over the space of sur-
1.825 + face pixels; alternately, they can be thought of as images which can be
1.826 + displayed on the flatmapped surface.
1.827 +To move beyond a single average expression level for each surface pixel, we plan to create a separate matrix
1.828 +for each cortical layer to represent the average expression level within that layer. Cortical layers are found at
1.829 +different depths in different parts of the cortex. In preparation for extracting the layer-specific datasets, we have
1.830 +extended Caret with routines that allow the depth of the ROI for volume-to-surface projection to vary.
1.831 +In the Research Plan, we describe how we will automatically locate the layer depths. For validation, we have
1.832 +manually demarcated the depth of the outer boundary of cortical layer 5 throughout the cortex.
1.833 +_________________________________________
1.834 + 16SEV is a sparse format for spatial data. It is the format in which the ABA data is made available.
1.835 Feature selection and scoring methods
1.836 -Underexpression of a gene can serve as a marker Underexpression of a gene can sometimes serve as a marker. See,
1.837 -for example, Figure 2.
1.838 +Underexpression of a gene can serve as a marker Underexpression of a gene can sometimes serve as a
1.839 +marker. See, for example, Figure 2.
1.840
1.841
1.842 -Figure 4: Upper left: wwc1. Upper right:
1.843 -mtif2. Lower left: wwc1 + mtif2 (each
1.844 -pixel’s value on the lower left is the sum of
1.845 -the corresponding pixels in the upper row). Correlation Recall that the instances are surface pixels, and consider the
1.846 - problem of attempting to classify each instance as either a member of a partic-
1.847 - ular anatomical area, or not. The target area can be represented as a boolean
1.848 - mask over the surface pixels.
1.849 - One class of feature selection scoring methods contains methods which cal-
1.850 - culate some sort of “match” between each gene image and the target image.
1.851 - Those genes which match the best are good candidates for features.
1.852 - One of the simplest methods in this class is to use correlation as the match
1.853 - score. We calculated the correlation between each gene and each cortical area.
1.854 - The top row of Figure 1 shows the three genes most correlated with area SS.
1.855 - Conditional entropy An information-theoretic scoring method is to find
1.856 - features such that, if the features (gene expression levels) are known, uncer-
1.857 - tainty about the target (the regional identity) is reduced. Entropy measures
1.858 - uncertainty, so what we want is to find features such that the conditional dis-
1.859 - tribution of the target has minimal entropy. The distribution to which we are
1.860 - referring is the probability distribution over the population of surface pixels.
1.861 - The simplest way to use information theory is on discrete data, so we
1.862 - discretized our gene expression data by creating, for each gene, five thresholded
1.863 - boolean masks of the gene data. For each gene, we created a boolean mask
1.864 -of its expression levels using each of these thresholds: the mean of that gene, the mean minus one standard deviation, the
1.865 -mean minus two standard deviations, the mean plus one standard deviation, the mean plus two standard deviations.
1.866 -Now, for each region, we created and ran a forward stepwise procedure which attempted to find pairs of gene expression
1.867 -boolean masks such that the conditional entropy of the target area’s boolean mask, conditioned upon the pair of gene
1.868 -expression boolean masks, is minimized.
1.869 -This finds pairs of genes which are most informative (at least at these discretization thresholds) relative to the question,
1.870 -“Is this surface pixel a member of the target area?”. Its advantage over linear methods such as logistic regression is that it
1.871 -takes account of arbitrarily nonlinear relationships; for example, if the XOR of two variables predicts the target, conditional
1.872 -entropy would notice, whereas linear methods would not.
1.873 +Figure 4: Upper left: wwc1. Upper
1.874 +right: mtif2. Lower left: wwc1 + mtif2
1.875 +(each pixel’s value on the lower left is
1.876 +the sum of the corresponding pixels in
1.877 +the upper row). Correlation Recall that the instances are surface pixels, and con-
1.878 + sider the problem of attempting to classify each instance as either a
1.879 + member of a particular anatomical area, or not. The target area can be
1.880 + represented as a boolean mask over the surface pixels.
1.881 + One class of feature selection scoring methods contains methods
1.882 + which calculate some sort of “match” between each gene image and
1.883 + the target image. Those genes which match the best are good candi-
1.884 + dates for features.
1.885 + One of the simplest methods in this class is to use correlation as
1.886 + the match score. We calculated the correlation between each gene
1.887 + and each cortical area. The top row of Figure 1 shows the three genes
1.888 + most correlated with area SS.
1.889 + Conditional entropy An information-theoretic scoring method is
1.890 + to find features such that, if the features (gene expression levels) are
1.891 + known, uncertainty about the target (the regional identity) is reduced.
1.892 + Entropy measures uncertainty, so what we want is to find features such
1.893 + that the conditional distribution of the target has minimal entropy. The
1.894 + distribution to which we are referring is the probability distribution over
1.895 +the population of surface pixels.
1.896 +The simplest way to use information theory is on discrete data, so we discretized our gene expression data
1.897 +by creating, for each gene, five thresholded boolean masks of the gene data. For each gene, we created a
1.898 +boolean mask of its expression levels using each of these thresholds: the mean of that gene, the mean minus
1.899 +one standard deviation, the mean minus two standard deviations, the mean plus one standard deviation, the
1.900 +mean plus two standard deviations.
1.901 +Now, for each region, we created and ran a forward stepwise procedure which attempted to find pairs of gene
1.902 +expression boolean masks such that the conditional entropy of the target area’s boolean mask, conditioned upon
1.903 +the pair of gene expression boolean masks, is minimized.
1.904 +This finds pairs of genes which are most informative (at least at these discretization thresholds) relative to the
1.905 +question, “Is this surface pixel a member of the target area?”. Its advantage over linear methods such as logistic
1.906 +regression is that it takes account of arbitrarily nonlinear relationships; for example, if the XOR of two variables
1.907 +predicts the target, conditional entropy would notice, whereas linear methods would not.
1.908 +Gradient similarity We noticed that the previous two scoring methods, which are pointwise, often found
1.909 +genes whose pattern of expression did not look similar in shape to the target region. For this reason we designed
1.910 +a non-pointwise local scoring method to detect when a gene had a pattern of expression which looked like it had
1.911 +a boundary whose shape is similar to the shape of the target region. We call this scoring method “gradient
1.912 +similarity”.
1.913 +One might say that gradient similarity attempts to measure how much the border of the area of gene expres-
1.914 +sion and the border of the target region overlap. However, since gene expression falls off continuously rather
1.915 +than jumping from its maximum value to zero, the spatial pattern of a gene’s expression often does not have a
1.916 +discrete border. Therefore, instead of looking for a discrete border, we look for large gradients. Gradient similarity
1.917 +is a symmetric function over two images (i.e. two scalar fields). It is is high to the extent that matching pixels
1.918 +which have large values and large gradients also have gradients which are oriented in a similar direction. The
1.919 +formula is:
1.920 + ∑
1.921 + pixel<img src="cmsy8-32.png" alt="∈" />pixels cos(abs(∠∇1 -∠∇2)) ⋅|∇1| + |∇2|
1.922 + 2 ⋅ pixel_value1 + pixel_value2
1.923 + 2
1.924 +
1.925
1.926
1.927
1.928
1.929 -Figure 5: From left to right and top
1.930 -to bottom, single genes which roughly
1.931 -identify areas SS (somatosensory primary
1.932 -+ supplemental), SSs (supplemental so-
1.933 -matosensory), PIR (piriform), FRP (frontal
1.934 -pole), RSP (retrosplenial), COApm (Corti-
1.935 -cal amygdalar, posterior part, medial zone).
1.936 -Grouping some areas together, we have
1.937 -also found genes to identify the groups
1.938 +Figure 5: From left to right and top
1.939 +to bottom, single genes which roughly
1.940 +identify areas SS (somatosensory pri-
1.941 +mary + supplemental), SSs (supple-
1.942 +mental somatosensory), PIR (piriform),
1.943 +FRP (frontal pole), RSP (retrosple-
1.944 +nial), COApm (Cortical amygdalar, pos-
1.945 +terior part, medial zone). Grouping
1.946 +some areas together, we have also
1.947 +found genes to identify the groups
1.948 ACA+PL+ILA+DP+ORB+MO (anterior
1.949 -cingulate, prelimbic, infralimbic, dorsal pe-
1.950 -duncular, orbital, motor), posterior and lat-
1.951 -eral visual (VISpm, VISpl, VISI, VISp; pos-
1.952 -teromedial, posterolateral, lateral, and pri-
1.953 -mary visual; the posterior and lateral vi-
1.954 -sual area is distinguished from its neigh-
1.955 -bors, but not from the entire rest of the
1.956 -cortex). The genes are Pitx2, Aldh1a2,
1.957 -Ppfibp1, Slco1a5, Tshz2, Trhr, Col12a1,
1.958 -Ets1. Gradient similarity We noticed that the previous two scoring methods,
1.959 - which are pointwise, often found genes whose pattern of expression did not
1.960 - look similar in shape to the target region. For this reason we designed a
1.961 - non-pointwise local scoring method to detect when a gene had a pattern of
1.962 - expression which looked like it had a boundary whose shape is similar to the
1.963 - shape of the target region. We call this scoring method “gradient similarity”.
1.964 - One might say that gradient similarity attempts to measure how much the
1.965 - border of the area of gene expression and the border of the target region over-
1.966 - lap. However, since gene expression falls off continuously rather than jumping
1.967 - from its maximum value to zero, the spatial pattern of a gene’s expression often
1.968 - does not have a discrete border. Therefore, instead of looking for a discrete
1.969 - border, we look for large gradients. Gradient similarity is a symmetric function
1.970 - over two images (i.e. two scalar fields). It is is high to the extent that matching
1.971 - pixels which have large values and large gradients also have gradients which
1.972 - are oriented in a similar direction. The formula is:
1.973 - ∑
1.974 - pixel<img src="cmsy7-32.png" alt="∈" />pixels cos(abs(∠∇1 -∠∇2)) ⋅|∇1| + |∇2|
1.975 - 2 ⋅ pixel_value1 + pixel_value2
1.976 - 2
1.977 - where ∇1 and ∇2 are the gradient vectors of the two images at the current
1.978 - pixel; ∠∇i is the angle of the gradient of image i at the current pixel; |∇i| is
1.979 - the magnitude of the gradient of image i at the current pixel; and pixel_valuei
1.980 - is the value of the current pixel in image i.
1.981 - The intuition is that we want to see if the borders of the pattern in the
1.982 - two images are similar; if the borders are similar, then both images will have
1.983 - corresponding pixels with large gradients (because this is a border) which are
1.984 - oriented in a similar direction (because the borders are similar).
1.985 +cingulate, prelimbic, infralimbic, dor-
1.986 +sal peduncular, orbital, motor), poste-
1.987 +rior and lateral visual (VISpm, VISpl,
1.988 +VISI, VISp; posteromedial, posterolat-
1.989 +eral, lateral, and primary visual; the
1.990 +posterior and lateral visual area is dis-
1.991 +tinguished from its neighbors, but not
1.992 +from the entire rest of the cortex). The
1.993 +genes are Pitx2, Aldh1a2, Ppfibp1,
1.994 +Slco1a5, Tshz2, Trhr, Col12a1, Ets1. where ∇1 and ∇2 are the gradient vectors of the two images at the
1.995 + current pixel; ∠∇i is the angle of the gradient of image i at the current
1.996 + pixel; |∇i| is the magnitude of the gradient of image i at the current
1.997 + pixel; and pixel_valuei is the value of the current pixel in image i.
1.998 + The intuition is that we want to see if the borders of the pattern in
1.999 + the two images are similar; if the borders are similar, then both images
1.1000 + will have corresponding pixels with large gradients (because this is a
1.1001 + border) which are oriented in a similar direction (because the borders
1.1002 + are similar).
1.1003 Most of the genes in Figure 5 were identified via gradient similarity.
1.1004 - Gradient similarity provides information complementary to cor-
1.1005 - relation
1.1006 - To show that gradient similarity can provide useful information that cannot
1.1007 - be detected via pointwise analyses, consider Fig. 3. The top row of Fig. 3
1.1008 - displays the 3 genes which most match area AUD, according to a pointwise
1.1009 - method17. The bottom row displays the 3 genes which most match AUD ac-
1.1010 - cording to a method which considers local geometry18 The pointwise method
1.1011 - in the top row identifies genes which express more strongly in AUD than out-
1.1012 - side of it; its weakness is that this includes many areas which don’t have a
1.1013 - salient border matching the areal border. The geometric method identifies
1.1014 - genes whose salient expression border seems to partially line up with the bor-
1.1015 - der of AUD; its weakness is that this includes genes which don’t express over
1.1016 - the entire area. Genes which have high rankings using both pointwise and bor-
1.1017 - der criteria, such as Aph1a in the example, may be particularly good markers.
1.1018 - None of these genes are, individually, a perfect marker for AUD; we deliberately
1.1019 - chose a “difficult” area in order to better contrast pointwise with geometric
1.1020 - methods.
1.1021 - Areas which can be identified by single genes Using gradient simi-
1.1022 - larity, we have already found single genes which roughly identify some areas
1.1023 -and groupings of areas. For each of these areas, an example of a gene which roughly identifies it is shown in Figure 5. We
1.1024 -have not yet cross-verified these genes in other atlases.
1.1025 + Gradient similarity provides information complementary to
1.1026 + correlation
1.1027 + To show that gradient similarity can provide useful information that
1.1028 + cannot be detected via pointwise analyses, consider Fig. 3. The top
1.1029 + row of Fig. 3 displays the 3 genes which most match area AUD, ac-
1.1030 + cording to a pointwise method17. The bottom row displays the 3 genes
1.1031 + which most match AUD according to a method which considers local
1.1032 + geometry18 The pointwise method in the top row identifies genes which
1.1033 + express more strongly in AUD than outside of it; its weakness is that
1.1034 + this includes many areas which don’t have a salient border matching
1.1035 + the areal border. The geometric method identifies genes whose salient
1.1036 + expression border seems to partially line up with the border of AUD;
1.1037 + its weakness is that this includes genes which don’t express over the
1.1038 + entire area. Genes which have high rankings using both pointwise and
1.1039 + border criteria, such as Aph1a in the example, may be particularly good
1.1040 + markers. None of these genes are, individually, a perfect marker for
1.1041 + AUD; we deliberately chose a “difficult” area in order to better contrast
1.1042 + pointwise with geometric methods.
1.1043 + Areas which can be identified by single genes Using gradient
1.1044 + similarity, we have already found single genes which roughly identify
1.1045 + some areas and groupings of areas. For each of these areas, an ex-
1.1046 + ample of a gene which roughly identifies it is shown in Figure 5. We
1.1047 + have not yet cross-verified these genes in other atlases.
1.1048 + In addition, there are a number of areas which are almost identified
1.1049 + by single genes: COAa+NLOT (anterior part of cortical amygdalar area,
1.1050 + nucleus of the lateral olfactory tract), ENT (entorhinal), ACAv (ventral
1.1051 + anterior cingulate), VIS (visual), AUD (auditory).
1.1052 + These results validate our expectation that the ABA dataset can
1.1053 + be exploited to find marker genes for many cortical areas, while also
1.1054 + validating the relevancy of our new scoring method, gradient similarity.
1.1055 + Combinations of multiple genes are useful and necessary for
1.1056 + some areas
1.1057 + In Figure 4, we give an example of a cortical area which is not
1.1058 + marked by any single gene, but which can be identified combinatorially.
1.1059 +Acccording to logistic regression, gene wwc1 is the best fit single gene for predicting whether or not a pixel on
1.1060 +the cortical surface belongs to the motor area (area MO). The upper-left picture in Figure 4 shows wwc1’s spatial
1.1061 _________________________________________
1.1062 - 17For each gene, a logistic regression in which the response variable was whether or not a surface pixel was within area AUD, and the predictor
1.1063 -variable was the value of the expression of the gene underneath that pixel. The resulting scores were used to rank the genes in terms of how well
1.1064 -they predict area AUD.
1.1065 - 18For each gene the gradient similarity between (a) a map of the expression of each gene on the cortical surface and (b) the shape of area AUD,
1.1066 -was calculated, and this was used to rank the genes.
1.1067 -In addition, there are a number of areas which are almost identified by single genes: COAa+NLOT (anterior part of
1.1068 -cortical amygdalar area, nucleus of the lateral olfactory tract), ENT (entorhinal), ACAv (ventral anterior cingulate), VIS
1.1069 -(visual), AUD (auditory).
1.1070 -These results validate our expectation that the ABA dataset can be exploited to find marker genes for many cortical
1.1071 -areas, while also validating the relevancy of our new scoring method, gradient similarity.
1.1072 -Combinations of multiple genes are useful and necessary for some areas
1.1073 -In Figure 4, we give an example of a cortical area which is not marked by any single gene, but which can be identified
1.1074 -combinatorially. Acccording to logistic regression, gene wwc1 is the best fit single gene for predicting whether or not a
1.1075 -pixel on the cortical surface belongs to the motor area (area MO). The upper-left picture in Figure 4 shows wwc1’s spatial
1.1076 -expression pattern over the cortex. The lower-right boundary of MO is represented reasonably well by this gene, but the
1.1077 -gene overshoots the upper-left boundary. This flattened 2-D representation does not show it, but the area corresponding
1.1078 -to the overshoot is the medial surface of the cortex. MO is only found on the dorsal surface. Gene mtif2 is shown in the
1.1079 -upper-right. Mtif2 captures MO’s upper-left boundary, but not its lower-right boundary. Mtif2 does not express very much
1.1080 -on the medial surface. By adding together the values at each pixel in these two figures, we get the lower-left image. This
1.1081 -combination captures area MO much better than any single gene.
1.1082 -This shows that our proposal to develop a method to find combinations of marker genes is both possible and necessary.
1.1083 -Feature selection integrated with prediction As noted earlier, in general, any classifier can be used for feature
1.1084 -selection by running it inside a stepwise wrapper. Also, some learning algorithms integrate soft constraints on number of
1.1085 -features used. Examples of both of these will be seen in the section “Multivariate supervised learning”.
1.1086 + 17For each gene, a logistic regression in which the response variable was whether or not a surface pixel was within area AUD, and the
1.1087 +predictor variable was the value of the expression of the gene underneath that pixel. The resulting scores were used to rank the genes
1.1088 +in terms of how well they predict area AUD.
1.1089 + 18For each gene the gradient similarity between (a) a map of the expression of each gene on the cortical surface and (b) the shape of
1.1090 +area AUD, was calculated, and this was used to rank the genes.
1.1091 +expression pattern over the cortex. The lower-right boundary of MO is represented reasonably well by this gene,
1.1092 +but the gene overshoots the upper-left boundary. This flattened 2-D representation does not show it, but the
1.1093 +area corresponding to the overshoot is the medial surface of the cortex. MO is only found on the dorsal surface.
1.1094 +Gene mtif2 is shown in the upper-right. Mtif2 captures MO’s upper-left boundary, but not its lower-right boundary.
1.1095 +Mtif2 does not express very much on the medial surface. By adding together the values at each pixel in these
1.1096 +two figures, we get the lower-left image. This combination captures area MO much better than any single gene.
1.1097 +This shows that our proposal to develop a method to find combinations of marker genes is both possible and
1.1098 +necessary.
1.1099 +Feature selection integrated with prediction As noted earlier, in general, any classifier can be used for fea-
1.1100 +ture selection by running it inside a stepwise wrapper. Also, some learning algorithms integrate soft constraints
1.1101 +on number of features used. Examples of both of these will be seen in the section “Multivariate supervised
1.1102 +learning”.
1.1103 Multivariate supervised learning
1.1104
1.1105
1.1106
1.1107
1.1108 -Figure 6: First row: the first 6 reduced dimensions, using PCA. Second
1.1109 -row: the first 6 reduced dimensions, using NNMF. Third row: the first
1.1110 -six reduced dimensions, using landmark Isomap. Bottom row: examples
1.1111 -of kmeans clustering applied to reduced datasets to find 7 clusters. Left:
1.1112 -19 of the major subdivisions of the cortex. Second from left: PCA. Third
1.1113 -from left: NNMF. Right: Landmark Isomap. Additional details: In the
1.1114 -third and fourth rows, 7 dimensions were found, but only 6 displayed. In
1.1115 -the last row: for PCA, 50 dimensions were used; for NNMF, 6 dimensions
1.1116 -were used; for landmark Isomap, 7 dimensions were used. Forward stepwise logistic regression Lo-
1.1117 - gistic regression is a popular method for pre-
1.1118 - dictive modeling of categorial data. As a pi-
1.1119 - lot run, for five cortical areas (SS, AUD, RSP,
1.1120 - VIS, and MO), we performed forward stepwise
1.1121 - logistic regression to find single genes, pairs of
1.1122 - genes, and triplets of genes which predict areal
1.1123 - identify. This is an example of feature selec-
1.1124 - tion integrated with prediction using a stepwise
1.1125 - wrapper. Some of the single genes found were
1.1126 - shown in various figures throughout this doc-
1.1127 - ument, and Figure 4 shows a combination of
1.1128 - genes which was found.
1.1129 - We felt that, for single genes, gradient simi-
1.1130 - larity did a better job than logistic regression at
1.1131 - capturing our subjective impression of a “good
1.1132 - gene”.
1.1133 - SVM on all genes at once
1.1134 - In order to see how well one can do when
1.1135 - looking at all genes at once, we ran a support
1.1136 - vector machine to classify cortical surface pix-
1.1137 - els based on their gene expression profiles. We
1.1138 - achieved classification accuracy of about 81%19.
1.1139 - This shows that the genes included in the ABA
1.1140 - dataset are sufficient to define much of cortical
1.1141 - anatomy. However, as noted above, a classifier
1.1142 - that looks at all the genes at once isn’t as prac-
1.1143 - tically useful as a classifier that uses only a few
1.1144 - genes.
1.1145 +Figure 6: First row: the first 6 reduced dimensions, using PCA. Sec-
1.1146 +ond row: the first 6 reduced dimensions, using NNMF. Third row:
1.1147 +the first six reduced dimensions, using landmark Isomap. Bottom
1.1148 +row: examples of kmeans clustering applied to reduced datasets
1.1149 +to find 7 clusters. Left: 19 of the major subdivisions of the cortex.
1.1150 +Second from left: PCA. Third from left: NNMF. Right: Landmark
1.1151 +Isomap. Additional details: In the third and fourth rows, 7 dimen-
1.1152 +sions were found, but only 6 displayed. In the last row: for PCA,
1.1153 +50 dimensions were used; for NNMF, 6 dimensions were used; for
1.1154 +landmark Isomap, 7 dimensions were used. Forward stepwise logistic regression
1.1155 + Logistic regression is a popular method
1.1156 + for predictive modeling of categorial data.
1.1157 + As a pilot run, for five cortical areas (SS,
1.1158 + AUD, RSP, VIS, and MO), we performed
1.1159 + forward stepwise logistic regression to find
1.1160 + single genes, pairs of genes, and triplets
1.1161 + of genes which predict areal identify. This
1.1162 + is an example of feature selection inte-
1.1163 + grated with prediction using a stepwise
1.1164 + wrapper. Some of the single genes found
1.1165 + were shown in various figures throughout
1.1166 + this document, and Figure 4 shows a com-
1.1167 + bination of genes which was found.
1.1168 + We felt that, for single genes, gradi-
1.1169 + ent similarity did a better job than logistic
1.1170 + regression at capturing our subjective im-
1.1171 + pression of a “good gene”.
1.1172 + SVM on all genes at once
1.1173 + In order to see how well one can do
1.1174 + when looking at all genes at once, we ran
1.1175 + a support vector machine to classify corti-
1.1176 + cal surface pixels based on their gene ex-
1.1177 + pression profiles. We achieved classifica-
1.1178 + tion accuracy of about 81%19. This shows
1.1179 + that the genes included in the ABA dataset
1.1180 + are sufficient to define much of cortical
1.1181 + anatomy. However, as noted above, a clas-
1.1182 + sifier that looks at all the genes at once isn’t
1.1183 +as practically useful as a classifier that uses only a few genes.
1.1184 _________________________________________
1.1185 - 195-fold cross-validation.
1.1186 - Data-driven redrawing of the cor-
1.1187 - tical map
1.1188 -We have applied the following dimensionality reduction algorithms to reduce the dimensionality of the gene expression
1.1189 -profile associated with each voxel: Principal Components Analysis (PCA), Simple PCA (SPCA), Multi-Dimensional Scaling
1.1190 -(MDS), Isomap, Landmark Isomap, Laplacian eigenmaps, Local Tangent Space Alignment (LTSA), Hessian locally linear
1.1191 -embedding, Diffusion maps, Stochastic Neighbor Embedding (SNE), Stochastic Proximity Embedding (SPE), Fast Maximum
1.1192 -Variance Unfolding (FastMVU), Non-negative Matrix Factorization (NNMF). Space constraints prevent us from showing
1.1193 -many of the results, but as a sample, PCA, NNMF, and landmark Isomap are shown in the first, second, and third rows of
1.1194 -Figure 6.
1.1195 -After applying the dimensionality reduction, we ran clustering algorithms on the reduced data. To date we have tried
1.1196 -k-means and spectral clustering. The results of k-means after PCA, NNMF, and landmark Isomap are shown in the last
1.1197 -row of Figure 6. To compare, the leftmost picture on the bottom row of Figure 6 shows some of the major subdivisions of
1.1198 -cortex. These results clearly show that different dimensionality reduction techniques capture different aspects of the data
1.1199 -and lead to different clusterings, indicating the utility of our proposal to produce a detailed comparion of these techniques
1.1200 -as applied to the domain of genomic anatomy.
1.1201 + 195-fold cross-validation.
1.1202 +Data-driven redrawing of the cortical map
1.1203 +We have applied the following dimensionality reduction algorithms to reduce the dimensionality of the gene
1.1204 +expression profile associated with each pixel: Principal Components Analysis (PCA), Simple PCA (SPCA), Multi-
1.1205 +Dimensional Scaling (MDS), Isomap, Landmark Isomap, Laplacian eigenmaps, Local Tangent Space Alignment
1.1206 +(LTSA), Stochastic Proximity Embedding (SPE), Fast Maximum Variance Unfolding (FastMVU), Non-negative
1.1207 +Matrix Factorization (NNMF). Space constraints prevent us from showing many of the results, but as a sample,
1.1208 +PCA, NNMF, and landmark Isomap are shown in the first, second, and third rows of Figure 6.
1.1209 +After applying the dimensionality reduction, we ran clustering algorithms on the reduced data. To date we
1.1210 +have tried k-means and spectral clustering. The results of k-means after PCA, NNMF, and landmark Isomap are
1.1211 +shown in the last row of Figure 6. To compare, the leftmost picture on the bottom row of Figure 6 shows some
1.1212 +of the major subdivisions of cortex. These results clearly show that different dimensionality reduction techniques
1.1213 +capture different aspects of the data and lead to different clusterings, indicating the utility of our proposal to
1.1214 +produce a detailed comparion of these techniques as applied to the domain of genomic anatomy.
1.1215
1.1216 -Figure 7: Prototypes corresponding to sample gene clusters,
1.1217 -clustered by gradient similarity. Region boundaries for the
1.1218 -region that most matches each prototype are overlayed. Many areas are captured by clusters of genes We
1.1219 - also clustered the genes using gradient similarity to see if
1.1220 - the spatial regions defined by any clusters matched known
1.1221 - anatomical regions. Figure 7 shows, for ten sample gene
1.1222 - clusters, each cluster’s average expression pattern, compared
1.1223 - to a known anatomical boundary. This suggests that it is
1.1224 - worth attempting to cluster genes, and then to use the re-
1.1225 - sults to cluster voxels.
1.1226 +Figure 7: Prototypes corresponding to sample gene
1.1227 +clusters, clustered by gradient similarity. Region bound-
1.1228 +aries for the region that most matches each prototype
1.1229 +are overlayed. Many areas are captured by clusters of genes
1.1230 + We also clustered the genes using gradient similarity
1.1231 + to see if the spatial regions defined by any clusters
1.1232 + matched known anatomical regions. Figure 7 shows,
1.1233 + for ten sample gene clusters, each cluster’s average
1.1234 + expression pattern, compared to a known anatomical
1.1235 + boundary. This suggests that it is worth attempting to
1.1236 + cluster genes, and then to use the results to cluster
1.1237 + pixels.
1.1238 The approach: what we plan to do
1.1239 Flatmap cortex and segment cortical layers
1.1240 - There are multiple ways to flatten 3-D data into 2-D. We
1.1241 - will compare mappings from manifolds to planes which at-
1.1242 - tempt to preserve size (such as the one used by Caret[7])
1.1243 - with mappings which preserve angle (conformal maps). Our
1.1244 - method will include a statistical test that warns the user if
1.1245 -the assumption of 2-D structure seems to be wrong.
1.1246 -We have not yet made use of radial profiles. While the radial profiles may be used “raw”, for laminar structures like the
1.1247 -cortex another strategy is to group together voxels in the same cortical layer; each surface pixel would then be associated
1.1248 -with one expression level per gene per layer. We will develop a segmentation algorithm to automatically identify the layer
1.1249 -boundaries.
1.1250 + There are multiple ways to flatten 3-D data into 2-D.
1.1251 + We will compare mappings from manifolds to planes
1.1252 + which attempt to preserve size (such as the one used
1.1253 +by Caret[7]) with mappings which preserve angle (conformal maps). Our method will include a statistical test
1.1254 +that warns the user if the assumption of 2-D structure seems to be wrong.
1.1255 +We have not yet made use of radial profiles. While the radial profiles may be used “raw”, for laminar structures
1.1256 +like the cortex another strategy is to group together voxels in the same cortical layer; each surface pixel would
1.1257 +then be associated with one expression level per gene per layer. We will develop a segmentation algorithm to
1.1258 +automatically identify the layer boundaries.
1.1259 Develop algorithms that find genetic markers for anatomical regions
1.1260 -We will develop scoring methods for evaluating how good individual genes are at marking areas. We will compare pointwise,
1.1261 -geometric, and information-theoretic measures. We already developed one entirely new scoring method (gradient similarity),
1.1262 -but we may develop more. Scoring measures that we will explore will include the L1 norm, correlation, expression energy
1.1263 -ratio, conditional entropy, gradient similarity, Jaccard similarity, Dice similarity, Hough transform, and statistical tests such
1.1264 -as Student’s t-test, and the Mann-Whitney U test (a non-parametric test). In addition, any classifier induces a scoring
1.1265 -measure on genes by taking the prediction error when using that gene to predict the target.
1.1266 -Using some combination of these measures, we will develop a procedure to find single marker genes for anatomical regions:
1.1267 -for each cortical area, we will rank the genes by their ability to delineate each area. We will quantitatively compare the list
1.1268 -of single genes generated by our method to the lists generated by previous methods which are mentioned in Aim 1 Related
1.1269 -Work.
1.1270 -Some cortical areas have no single marker genes but can be identified by combinatorial coding. This requires multivariate
1.1271 -scoring measures and feature selection procedures. Many of the measures, such as expression energy, gradient similarity,
1.1272 -Jaccard, Dice, Hough, Student’s t, and Mann-Whitney U are univariate. We will extend these scoring measures for use
1.1273 -in multivariate feature selection, that is, for scoring how well combinations of genes, rather than individual genes, can
1.1274 -distinguish a target area. There are existing multivariate forms of some of the univariate scoring measures, for example,
1.1275 -Hotelling’s T-square is a multivariate analog of Student’s t.
1.1276 -We will develop a feature selection procedure for choosing the best small set of marker genes for a given anatomical
1.1277 -area. In addition to using the scoring measures that we develop, we will also explore (a) feature selection using a stepwise
1.1278 -wrapper over “vanilla” classifiers such as logistic regression, (b) supervised learning methods such as decision trees which
1.1279 -incrementally/greedily combine single gene markers into sets, and (c) supervised learning methods which use soft constraints
1.1280 -to minimize number of features used, such as sparse support vector machines.
1.1281 -Since errors of displacement and of shape may cause genes and target areas to match less than they should, we will
1.1282 -consider the robustness of feature selection methods in the presence of error. Some of these methods, such as the Hough
1.1283 -transform, are designed to be resistant in the presence of error, but many are not. We will consider extensions to scoring
1.1284 -measures that may improve their robustness; for example, a wrapper that runs a scoring method on small displacements
1.1285 -and distortions of the data adds robustness to registration error at the expense of computation time.
1.1286 -An area may be difficult to identify because the boundaries are misdrawn in the atlas, or because the shape of the natural
1.1287 -domain of gene expression corresponding to the area is different from the shape of the area as recognized by anatomists.
1.1288 -We will extend our procedure to handle difficult areas by combining areas or redrawing their boundaries. We will develop
1.1289 -extensions to our procedure which (a) detect when a difficult area could be fit if its boundary were redrawn slightly, and (b)
1.1290 -detect when a difficult area could be combined with adjacent areas to create a larger area which can be fit.
1.1291 -A future publication on the method that we develop in Aim 1 will review the scoring measures and quantitatively compare
1.1292 -their performance in order to provide a foundation for future research of methods of marker gene finding. We will measure
1.1293 -the robustness of the scoring measures as well as their absolute performance on our dataset.
1.1294 -Classifiers
1.1295 -We will explore and compare different classifiers. As noted above, this activity is not separate from the previous one,
1.1296 -because some supervised learning algorithms include feature selection, and any classifier can be combined with a stepwise
1.1297 -wrapper for use as a feature selection method. We will explore logistic regression (including spatial models[15]), decision
1.1298 -trees20 , sparse SVMs, generative mixture models (including naive bayes), kernel density estimation, instance-based learning
1.1299 -methods (such as k-nearest neighbor), genetic algorithms, and artificial neural networks.
1.1300 -Application to cortical areas
1.1301 -# confirm with EMAGE, GeneAtlas, GENSAT, etc, to fight overfitting, two hemis
1.1302 +Scoring measures and feature selection We will develop scoring methods for evaluating how good individual
1.1303 +genes are at marking areas. We will compare pointwise, geometric, and information-theoretic measures. We
1.1304 +already developed one entirely new scoring method (gradient similarity), but we may develop more. Scoring
1.1305 +measures that we will explore will include the L1 norm, correlation, expression energy ratio, conditional entropy,
1.1306 +gradient similarity, Jaccard similarity, Dice similarity, Hough transform, and statistical tests such as Student’s t-
1.1307 +test, and the Mann-Whitney U test (a non-parametric test). In addition, any classifier induces a scoring measure
1.1308 +on genes by taking the prediction error when using that gene to predict the target.
1.1309 +Using some combination of these measures, we will develop a procedure to find single marker genes for
1.1310 +anatomical regions: for each cortical area, we will rank the genes by their ability to delineate each area. We
1.1311 +will quantitatively compare the list of single genes generated by our method to the lists generated by previous
1.1312 +methods which are mentioned in Aim 1 Related Work.
1.1313 +Some cortical areas have no single marker genes but can be identified by combinatorial coding. This requires
1.1314 +multivariate scoring measures and feature selection procedures. Many of the measures, such as expression
1.1315 +energy, gradient similarity, Jaccard, Dice, Hough, Student’s t, and Mann-Whitney U are univariate. We will extend
1.1316 +these scoring measures for use in multivariate feature selection, that is, for scoring how well combinations of
1.1317 +genes, rather than individual genes, can distinguish a target area. There are existing multivariate forms of some
1.1318 +of the univariate scoring measures, for example, Hotelling’s T-square is a multivariate analog of Student’s t.
1.1319 +We will develop a feature selection procedure for choosing the best small set of marker genes for a given
1.1320 +anatomical area. In addition to using the scoring measures that we develop, we will also explore (a) feature
1.1321 +selection using a stepwise wrapper over “vanilla” classifiers such as logistic regression, (b) supervised learning
1.1322 +methods such as decision trees which incrementally/greedily combine single gene markers into sets, and (c)
1.1323 +supervised learning methods which use soft constraints to minimize number of features used, such as sparse
1.1324 +support vector machines.
1.1325 +Since errors of displacement and of shape may cause genes and target areas to match less than they should,
1.1326 +we will consider the robustness of feature selection methods in the presence of error. Some of these methods,
1.1327 +such as the Hough transform, are designed to be resistant in the presence of error, but many are not. We will
1.1328 +consider extensions to scoring measures that may improve their robustness; for example, a wrapper that runs a
1.1329 +scoring method on small displacements and distortions of the data adds robustness to registration error at the
1.1330 +expense of computation time.
1.1331 +An area may be difficult to identify because the boundaries are misdrawn in the atlas, or because the shape
1.1332 +of the natural domain of gene expression corresponding to the area is different from the shape of the area as
1.1333 +recognized by anatomists. We will extend our procedure to handle difficult areas by combining areas or redrawing
1.1334 +their boundaries. We will develop extensions to our procedure which (a) detect when a difficult area could be
1.1335 +fit if its boundary were redrawn slightly20, and (b) detect when a difficult area could be combined with adjacent
1.1336 +areas to create a larger area which can be fit.
1.1337 +A future publication on the method that we develop in Aim 1 will review the scoring measures and quantita-
1.1338 +tively compare their performance in order to provide a foundation for future research of methods of marker gene
1.1339 +finding. We will measure the robustness of the scoring measures as well as their absolute performance on our
1.1340 +dataset.
1.1341 +Classifiers We will explore and compare different classifiers. As noted above, this activity is not separate
1.1342 +from the previous one, because some supervised learning algorithms include feature selection, and any clas-
1.1343 +sifier can be combined with a stepwise wrapper for use as a feature selection method. We will explore logistic
1.1344 +regression (including spatial models[16]), decision trees21, sparse SVMs, generative mixture models (including
1.1345 +naive bayes), kernel density estimation, instance-based learning methods (such as k-nearest neighbor), genetic
1.1346 +algorithms, and artificial neural networks.
1.1347 Develop algorithms to suggest a division of a structure into anatomical parts
1.1348 -1.Explore dimensionality reduction algorithms applied to pixels: including TODO
1.1349 -2.Explore dimensionality reduction algorithms applied to genes: including TODO
1.1350 -3.Explore clustering algorithms applied to pixels: including TODO
1.1351 -4.Explore clustering algorithms applied to genes: including gene shaving[9], TODO
1.1352 -5.Develop an algorithm to use dimensionality reduction and/or hierarchial clustering to create anatomical maps
1.1353 -6.Run this algorithm on the cortex: present a hierarchial, genoarchitectonic map of the cortex
1.1354 -# Linear discriminant analysis
1.1355 -# jbt, coclustering
1.1356 -# self-organizing map
1.1357 -# Linear discriminant analysis
1.1358 -# compare using clustering scores
1.1359 -# multivariate gradient similarity
1.1360 -# deep belief nets
1.1361 -Apply these algorithms to the cortex
1.1362 -Using the methods developed in Aim 1, we will present, for each cortical area, a short list of markers to identify that
1.1363 -area; and we will also present lists of “panels” of genes that can be used to delineate many areas at once. Using the methods
1.1364 -developed in Aim 2, we will present one or more hierarchial cortical maps. We will identify and explain how the statistical
1.1365 -structure in the gene expression data led to any unexpected or interesting features of these maps, and we will provide
1.1366 -biological hypotheses to interpret any new cortical areas, or groupings of areas, which are discovered.
1.1367 +Explore dimensionality reduction on gene expression profiles We have already described the application
1.1368 +of ten dimensionality reduction algorithms for the purpose of replacing the gene expression profiles, which are
1.1369 +vectors of about 4000 gene expression levels, with a smaller number of features. We plan to further explore
1.1370 +and interpret these results, as well as to apply other unsupervised learning algorithms, including independent
1.1371 +components analysis, self-organizing maps, and generative models such as deep Boltzmann machines. We
1.1372 +will explore ways to quantitatively compare the relevance of the different dimensionality reduction methods for
1.1373 +identifying cortical areal boundaries.
1.1374 +Explore dimensionality reduction on pixels Instead of applying dimensionality reduction to the gene ex-
1.1375 _________________________________________
1.1376 - 20Actually, we have already begun to explore decision trees. For each cortical area, we have used the C4.5 algorithm to find a decision tree for
1.1377 -that area. We achieved good classification accuracy on our training set, but the number of genes that appeared in each tree was too large. We
1.1378 -plan to implement a pruning procedure to generate trees that use fewer genes.
1.1379 + 20Not just any redrawing is acceptable, only those which appear to be justified as a natural spatial domain of gene expression by
1.1380 +multiple sources of evidence. Interestingly, the need to detect “natural spatial domains of gene expression” in a data-driven fashion
1.1381 +means that the methods of Aim 2 might be useful in achieving Aim 1, as well – particularly discriminative dimensionality reduction.
1.1382 + 21Actually, we have already begun to explore decision trees. For each cortical area, we have used the C4.5 algorithm to find a decision
1.1383 +tree for that area. We achieved good classification accuracy on our training set, but the number of genes that appeared in each tree was
1.1384 +too large. We plan to implement a pruning procedure to generate trees that use fewer genes.
1.1385 +pression profiles, the same techniques can be applied instead to the pixels22. It is possible that the features
1.1386 +generated in this way by some dimensionality reduction techniques will directly correspond to interesting spatial
1.1387 +regions.
1.1388 +Explore clustering and segmentation algorithms on pixels We will explore clustering and segmenta-
1.1389 +tion algorithms in order to segment the pixels into regions. We will explore k-means, spectral clustering, gene
1.1390 +shaving[9], recursive division clustering, multivariate generalizations of edge detectors, multivariate generaliza-
1.1391 +tions of watershed transformations, region growing, active contours, graph partitioning methods, and recursive
1.1392 +agglomerative clustering with various linkage functions. These methods can be combined with dimensionality
1.1393 +reduction.
1.1394 +Explore clustering on genes We have already shown that the procedure of clustering genes according to
1.1395 +gradient similarity, and then creating an averaged prototype of each cluster’s expression pattern, yields some
1.1396 +spatial patterns which match cortical areas. We will further explore the clustering of genes.
1.1397 +In addition to using the cluster expression prototypes directly to identify spatial regions, this might be useful
1.1398 +as a component of dimensionality reduction. For example, one could imagine clustering similar genes and then
1.1399 +replacing their expression levels with a single average expression level, thereby removing some redundancy from
1.1400 +the gene expression profiles. One could then perform clustering on pixels (possibly after a second dimensionality
1.1401 +reduction step) in order to identify spatial regions. It remains to be seen whether removal of redundancy would
1.1402 +help or hurt the ultimate goal of identifying interesting spatial regions.
1.1403 +Explore co-clustering There are some algorithms which simultaineously incorporate clustering on instances
1.1404 +and on features (in our case, genes and pixels), for example, IRM[11]. These are called co-clustering or biclus-
1.1405 +tering algorithms.
1.1406 +Quantitatively compare different methods In order to tell which method is best for genomic anatomy, for
1.1407 +each experimental method we will compare the cortical map found by unsupervised learning to a cortical map
1.1408 +derived from the Allen Reference Atlas. In order to compare the experimental clustering with the reference
1.1409 +clustering, we will explore various quantitative metrics that purport to measure how similar two clusterings are,
1.1410 +such as Jaccard, Rand index, Fowlkes-Mallows, variation of information, Larsen, Van Dongen, and others.
1.1411 +Discriminative dimensionality reduction In addition to using a purely data-driven approach to identify
1.1412 +spatial regions, it might be useful to see how well the known regions can be reconstructed from a small number
1.1413 +of features, even if those features are chosen by using knowledge of the regions. For example, linear discriminant
1.1414 +analysis could be used as a dimensionality reduction technique in order to identify a few features which are the
1.1415 +best linear summary of gene expression profiles for the purpose of discriminating between regions. This reduced
1.1416 +feature set could then be used to cluster pixels into regions. Perhaps the resulting clusters will be similar to the
1.1417 +reference atlas, yet more faithful to natural spatial domains of gene expression than the reference atlas is.
1.1418 +Apply the new methods to the cortex
1.1419 +Using the methods developed in Aim 1, we will present, for each cortical area, a short list of markers to identify
1.1420 +that area; and we will also present lists of “panels” of genes that can be used to delineate many areas at once.
1.1421 +Because in most cases the ABA coronal dataset only contains one ISH per gene, it is possible for an unrelated
1.1422 +combination of genes to seem to identify an area when in fact it is only coincidence. There are two ways we will
1.1423 +validate our marker genes to guard against this. First, we will confirm that putative combinations of marker genes
1.1424 +express the same pattern in both hemispheres. Second, we will manually validate our final results on other gene
1.1425 +expression datasets such as EMAGE, GeneAtlas, and GENSAT.
1.1426 +Using the methods developed in Aim 2, we will present one or more hierarchial cortical maps. We will identify
1.1427 +and explain how the statistical structure in the gene expression data led to any unexpected or interesting features
1.1428 +_________________________________________
1.1429 + 22Consider a matrix whose rows represent pixel locations, and whose columns represent genes. An entry in this matrix represents the
1.1430 +gene expression level at a given pixel. One can look at this matrix as a collection of pixels, each corresponding to a vector of many gene
1.1431 +expression levels; or one can look at it as a collection of genes, each corresponding to a vector giving that gene’s expression at each
1.1432 +pixel. Similarly, dimensionality reduction can be used to replace a large number of genes with a small number of features, or it can be
1.1433 +used to replace a large number of pixels with a small number of features.
1.1434 +of these maps, and we will provide biological hypotheses to interpret any new cortical areas, or groupings of
1.1435 +areas, which are discovered.
1.1436 Timeline and milestones
1.1437 Finding marker genes
1.1438 -∙September-November 2009: Develop an automated mechanism for segmenting the cortical voxels into layers
1.1439 -∙November 2009 (milestone): Have completed construction of a flatmapped, cortical dataset with information for each
1.1440 -layer
1.1441 -∙October 2009-April 2010: Develop scoring methods and to test them in various supervised learning frameworks. Also
1.1442 -test out various dimensionality reduction schemes in combination with supervised learning. create or extend supervised
1.1443 -learning frameworks which use multivariate versions of the best scoring methods.
1.1444 -∙January 2010 (milestone): Submit a publication on single marker genes for cortical areas
1.1445 -∙February-July 2010: Continue to develop scoring methods and supervised learning frameworks. Explore the best way
1.1446 -to integrate radial profiles with supervised learning. Explore the best way to make supervised learning techniques
1.1447 -robust against incorrect labels (i.e. when the areas drawn on the input cortical map are slightly off). Quantitatively
1.1448 -compare the performance of different supervised learning techniques. Validate marker genes found in the ABA dataset
1.1449 -by checking against other gene expression datasets. Create documentation and unit tests for software toolbox for Aim
1.1450 -1. Respond to user bug reports for Aim 1 software toolbox.
1.1451 -∙June 2010 (milestone): Submit a paper describing a method fulfilling Aim 1. Release toolbox.
1.1452 -∙July 2010 (milestone): Submit a paper describing combinations of marker genes for each cortical area, and a small
1.1453 -number of marker genes that can, in combination, define most of the areas at once
1.1454 +September-November 2009: Develop an automated mechanism for segmenting the cortical voxels into layers
1.1455 +November 2009 (milestone): Have completed construction of a flatmapped, cortical dataset with information
1.1456 +for each layer
1.1457 +October 2009-April 2010: Develop scoring methods and to test them in various supervised learning frameworks.
1.1458 +Also test out various dimensionality reduction schemes in combination with supervised learning. create or extend
1.1459 +supervised learning frameworks which use multivariate versions of the best scoring methods.
1.1460 +January 2010 (milestone): Submit a publication on single marker genes for cortical areas
1.1461 +February-July 2010: Continue to develop scoring methods and supervised learning frameworks. Explore the
1.1462 +best way to integrate radial profiles with supervised learning. Explore the best way to make supervised learning
1.1463 +techniques robust against incorrect labels (i.e. when the areas drawn on the input cortical map are slightly
1.1464 +off). Quantitatively compare the performance of different supervised learning techniques. Validate marker genes
1.1465 +found in the ABA dataset by checking against other gene expression datasets. Create documentation and unit
1.1466 +tests for software toolbox for Aim 1. Respond to user bug reports for Aim 1 software toolbox.
1.1467 +June 2010 (milestone): Submit a paper describing a method fulfilling Aim 1. Release toolbox.
1.1468 +July 2010 (milestone): Submit a paper describing combinations of marker genes for each cortical area, and a
1.1469 +small number of marker genes that can, in combination, define most of the areas at once
1.1470 Revealing new ways to parcellate a structure into regions
1.1471 -∙June 2010-March 2011: Explore dimensionality reduction algorithms for Aim 2. Explore standard hierarchial clus-
1.1472 -tering algorithms, used in combination with dimensionality reduction, for Aim 2. Explore co-clustering algorithms.
1.1473 -Think about how radial profile information can be used for Aim 2. Adapt clustering algorithms to use radial profile
1.1474 -information. Quantitatively compare the performance of different dimensionality reduction and clustering techniques.
1.1475 -Quantitatively compare the value of different flatmapping methods and ways of representing radial profiles.
1.1476 -∙March 2011 (milestone): Submit a paper describing a method fulfilling Aim 2. Release toolbox.
1.1477 -∙February-May 2011: Using the methods developed for Aim 2, explore the genomic anatomy of the cortex. If new ways
1.1478 -of organizing the cortex into areas are discovered, read the literature and talk to people to learn about research related
1.1479 -to interpreting our results. Create documentation and unit tests for software toolbox for Aim 2. Respond to user bug
1.1480 -reports for Aim 2 software toolbox.
1.1481 -∙May 2011 (milestone): Submit a paper on the genomic anatomy of the cortex, using the methods developed in Aim 2
1.1482 -∙May-August 2011: Revisit Aim 1 to see if what was learned during Aim 2 can improve the methods for Aim 1. Follow
1.1483 -up on responses to our papers. Possibly submit another paper.
1.1484 +June 2010-March 2011: Explore dimensionality reduction algorithms for Aim 2. Explore standard hierarchial
1.1485 +clustering algorithms, used in combination with dimensionality reduction, for Aim 2. Explore co-clustering algo-
1.1486 +rithms. Think about how radial profile information can be used for Aim 2. Adapt clustering algorithms to use radial
1.1487 +profile information. Quantitatively compare the performance of different dimensionality reduction and clustering
1.1488 +techniques. Quantitatively compare the value of different flatmapping methods and ways of representing radial
1.1489 +profiles.
1.1490 +March 2011 (milestone): Submit a paper describing a method fulfilling Aim 2. Release toolbox.
1.1491 +February-May 2011: Using the methods developed for Aim 2, explore the genomic anatomy of the cortex. If
1.1492 +new ways of organizing the cortex into areas are discovered, read the literature and talk to people to learn about
1.1493 +research related to interpreting our results. Create documentation and unit tests for software toolbox for Aim 2.
1.1494 +Respond to user bug reports for Aim 2 software toolbox.
1.1495 +May 2011 (milestone): Submit a paper on the genomic anatomy of the cortex, using the methods developed in
1.1496 +Aim 2
1.1497 +May-August 2011: Revisit Aim 1 to see if what was learned during Aim 2 can improve the methods for Aim 1.
1.1498 +Follow up on responses to our papers. Possibly submit another paper.
1.1499 Bibliography & References Cited
1.1500 -[1]Chris Adamson, Leigh Johnston, Terrie Inder, Sandra Rees, Iven Mareels, and Gary Egan. A Tracking Approach to
1.1501 -Parcellation of the Cerebral Cortex, volume Volume 3749/2005 of Lecture Notes in Computer Science, pages 294–301.
1.1502 -Springer Berlin / Heidelberg, 2005.
1.1503 -[2]J. Annese, A. Pitiot, I. D. Dinov, and A. W. Toga. A myelo-architectonic method for the structural classification of
1.1504 -cortical areas. NeuroImage, 21(1):15–26, 2004.
1.1505 -[3]Tanya Barrett, Dennis B. Troup, Stephen E. Wilhite, Pierre Ledoux, Dmitry Rudnev, Carlos Evangelista, Irene F.
1.1506 -Kim, Alexandra Soboleva, Maxim Tomashevsky, and Ron Edgar. NCBI GEO: mining tens of millions of expression
1.1507 -profiles–database and tools update. Nucl. Acids Res., 35(suppl_1):D760–765, 2007.
1.1508 -[4]George W. Bell, Tatiana A. Yatskievych, and Parker B. Antin. GEISHA, a whole-mount in situ hybridization gene
1.1509 -expression screen in chicken embryos. Developmental Dynamics, 229(3):677–687, 2004.
1.1510 -[5]James P Carson, Tao Ju, Hui-Chen Lu, Christina Thaller, Mei Xu, Sarah L Pallas, Michael C Crair, Joe Warren, Wah
1.1511 -Chiu, and Gregor Eichele. A digital atlas to characterize the mouse brain transcriptome. PLoS Comput Biol, 1(4):e41,
1.1512 -2005.
1.1513 -[6]Mark H. Chin, Alex B. Geng, Arshad H. Khan, Wei-Jun Qian, Vladislav A. Petyuk, Jyl Boline, Shawn Levy, Arthur W.
1.1514 -Toga, Richard D. Smith, Richard M. Leahy, and Desmond J. Smith. A genome-scale map of expression for a mouse
1.1515 -brain section obtained using voxelation. Physiol. Genomics, 30(3):313–321, August 2007.
1.1516 -[7]D C Van Essen, H A Drury, J Dickson, J Harwell, D Hanlon, and C H Anderson. An integrated software suite for surface-
1.1517 -based analyses of cerebral cortex. Journal of the American Medical Informatics Association: JAMIA, 8(5):443–59, 2001.
1.1518 -PMID: 11522765.
1.1519 -[8]Shiaoching Gong, Chen Zheng, Martin L. Doughty, Kasia Losos, Nicholas Didkovsky, Uta B. Schambra, Norma J.
1.1520 -Nowak, Alexandra Joyner, Gabrielle Leblanc, Mary E. Hatten, and Nathaniel Heintz. A gene expression atlas of the
1.1521 -central nervous system based on bacterial artificial chromosomes. Nature, 425(6961):917–925, October 2003.
1.1522 -[9]Trevor Hastie, Robert Tibshirani, Michael Eisen, Ash Alizadeh, Ronald Levy, Louis Staudt, Wing Chan, David Botstein,
1.1523 -and Patrick Brown. ’Gene shaving’ as a method for identifying distinct sets of genes with similar expression patterns.
1.1524 -Genome Biology, 1(2):research0003.1–research0003.21, 2000.
1.1525 -[10]Jano Hemert and Richard Baldock. Matching Spatial Regions with Combinations of Interacting Gene Expression Pat-
1.1526 -terns, volume 13 of Communications in Computer and Information Science, pages 347–361. Springer Berlin Heidelberg,
1.1527 -2008.
1.1528 -[11]F. Kruggel, M. K. Brckner, Th. Arendt, C. J. Wiggins, and D. Y. von Cramon. Analyzing the neocortical fine-structure.
1.1529 -Medical Image Analysis, 7(3):251–264, September 2003.
1.1530 -[12]Erh-Fang Lee, Jyl Boline, and Arthur W. Toga. A High-Resolution anatomical framework of the neonatal mouse brain
1.1531 -for managing gene expression data. Frontiers in Neuroinformatics, 1:6, 2007. PMC2525996.
1.1532 -[13]Susan Magdaleno, Patricia Jensen, Craig L. Brumwell, Anna Seal, Karen Lehman, Andrew Asbury, Tony Cheung,
1.1533 -Tommie Cornelius, Diana M. Batten, Christopher Eden, Shannon M. Norland, Dennis S. Rice, Nilesh Dosooye, Sundeep
1.1534 -Shakya, Perdeep Mehta, and Tom Curran. BGEM: an in situ hybridization database of gene expression in the embryonic
1.1535 -and adult mouse nervous system. PLoS Biology, 4(4):e86 EP –, April 2006.
1.1536 -[14]Lydia Ng, Amy Bernard, Chris Lau, Caroline C Overly, Hong-Wei Dong, Chihchau Kuan, Sayan Pathak, Susan M
1.1537 -Sunkin, Chinh Dang, Jason W Bohland, Hemant Bokil, Partha P Mitra, Luis Puelles, John Hohmann, David J Anderson,
1.1538 -Ed S Lein, Allan R Jones, and Michael Hawrylycz. An anatomic gene expression atlas of the adult mouse brain. Nat
1.1539 -Neurosci, 12(3):356–362, March 2009.
1.1540 -[15]Christopher J. Paciorek. Computational techniques for spatial logistic regression with large data sets. Computational
1.1541 -Statistics & Data Analysis, 51(8):3631–3653, May 2007.
1.1542 -[16]George Paxinos and Keith B.J. Franklin. The Mouse Brain in Stereotaxic Coordinates. Academic Press, 2 edition, July
1.1543 -2001.
1.1544 -[17]A. Schleicher, N. Palomero-Gallagher, P. Morosan, S. Eickhoff, T. Kowalski, K. Vos, K. Amunts, and K. Zilles. Quanti-
1.1545 -tative architectural analysis: a new approach to cortical mapping. Anatomy and Embryology, 210(5):373–386, December
1.1546 -2005.
1.1547 -[18]Oliver Schmitt, Lars Hmke, and Lutz Dmbgen. Detection of cortical transition regions utilizing statistical analyses of
1.1548 -excess masses. NeuroImage, 19(1):42–63, May 2003.
1.1549 -[19]Constance M. Smith, Jacqueline H. Finger, Terry F. Hayamizu, Ingeborg J. McCright, Janan T. Eppig, James A.
1.1550 -Kadin, Joel E. Richardson, and Martin Ringwald. The mouse gene expression database (GXD): 2007 update. Nucl.
1.1551 -Acids Res., 35(suppl_1):D618–623, 2007.
1.1552 -[20]Judy Sprague, Leyla Bayraktaroglu, Dave Clements, Tom Conlin, David Fashena, Ken Frazer, Melissa Haendel, Dou-
1.1553 -glas G Howe, Prita Mani, Sridhar Ramachandran, Kevin Schaper, Erik Segerdell, Peiran Song, Brock Sprunger, Sierra
1.1554 -Taylor, Ceri E Van Slyke, and Monte Westerfield. The zebrafish information network: the zebrafish model organism
1.1555 -database. Nucleic Acids Research, 34(Database issue):D581–5, 2006. PMID: 16381936.
1.1556 -[21]Larry Swanson. Brain Maps: Structure of the Rat Brain. Academic Press, 3 edition, November 2003.
1.1557 -[22]Carol L. Thompson, Sayan D. Pathak, Andreas Jeromin, Lydia L. Ng, Cameron R. MacPherson, Marty T. Mortrud,
1.1558 -Allison Cusick, Zackery L. Riley, Susan M. Sunkin, Amy Bernard, Ralph B. Puchalski, Fred H. Gage, Allan R. Jones,
1.1559 -Vladimir B. Bajic, Michael J. Hawrylycz, and Ed S. Lein. Genomic anatomy of the hippocampus. Neuron, 60(6):1010–
1.1560 -1021, December 2008.
1.1561 -[23]Pavel Tomancak, Amy Beaton, Richard Weiszmann, Elaine Kwan, ShengQiang Shu, Suzanna E Lewis, Stephen
1.1562 -Richards, Michael Ashburner, Volker Hartenstein, Susan E Celniker, and Gerald M Rubin. Systematic determina-
1.1563 -tion of patterns of gene expression during drosophila embryogenesis. Genome Biology, 3(12):research008818814, 2002.
1.1564 -PMC151190.
1.1565 -[24]Jano van Hemert and Richard Baldock. Mining Spatial Gene Expression Data for Association Rules, volume 4414/2007
1.1566 -of Lecture Notes in Computer Science, pages 66–76. Springer Berlin / Heidelberg, 2007.
1.1567 -[25]Shanmugasundaram Venkataraman, Peter Stevenson, Yiya Yang, Lorna Richardson, Nicholas Burton, Thomas P. Perry,
1.1568 -Paul Smith, Richard A. Baldock, Duncan R. Davidson, and Jeffrey H. Christiansen. EMAGE edinburgh mouse atlas
1.1569 -of gene expression: 2008 update. Nucl. Acids Res., 36(suppl_1):D860–865, 2008.
1.1570 -[26]Axel Visel, Christina Thaller, and Gregor Eichele. GenePaint.org: an atlas of gene expression patterns in the mouse
1.1571 -embryo. Nucl. Acids Res., 32(suppl_1):D552–556, 2004.
1.1572 -[27]Robert H Waterston, Kerstin Lindblad-Toh, Ewan Birney, Jane Rogers, Josep F Abril, Pankaj Agarwal, Richa Agar-
1.1573 -wala, Rachel Ainscough, Marina Alexandersson, Peter An, Stylianos E Antonarakis, John Attwood, Robert Baertsch,
1.1574 -Jonathon Bailey, Karen Barlow, Stephan Beck, Eric Berry, Bruce Birren, Toby Bloom, Peer Bork, Marc Botcherby,
1.1575 -Nicolas Bray, Michael R Brent, Daniel G Brown, Stephen D Brown, Carol Bult, John Burton, Jonathan Butler,
1.1576 -Robert D Campbell, Piero Carninci, Simon Cawley, Francesca Chiaromonte, Asif T Chinwalla, Deanna M Church,
1.1577 -Michele Clamp, Christopher Clee, Francis S Collins, Lisa L Cook, Richard R Copley, Alan Coulson, Olivier Couronne,
1.1578 -James Cuff, Val Curwen, Tim Cutts, Mark Daly, Robert David, Joy Davies, Kimberly D Delehaunty, Justin Deri,
1.1579 -Emmanouil T Dermitzakis, Colin Dewey, Nicholas J Dickens, Mark Diekhans, Sheila Dodge, Inna Dubchak, Diane M
1.1580 -Dunn, Sean R Eddy, Laura Elnitski, Richard D Emes, Pallavi Eswara, Eduardo Eyras, Adam Felsenfeld, Ginger A
1.1581 -Fewell, Paul Flicek, Karen Foley, Wayne N Frankel, Lucinda A Fulton, Robert S Fulton, Terrence S Furey, Diane Gage,
1.1582 -Richard A Gibbs, Gustavo Glusman, Sante Gnerre, Nick Goldman, Leo Goodstadt, Darren Grafham, Tina A Graves,
1.1583 -Eric D Green, Simon Gregory, Roderic Guig, Mark Guyer, Ross C Hardison, David Haussler, Yoshihide Hayashizaki,
1.1584 -LaDeana W Hillier, Angela Hinrichs, Wratko Hlavina, Timothy Holzer, Fan Hsu, Axin Hua, Tim Hubbard, Adrienne
1.1585 -Hunt, Ian Jackson, David B Jaffe, L Steven Johnson, Matthew Jones, Thomas A Jones, Ann Joy, Michael Kamal,
1.1586 -Elinor K Karlsson, Donna Karolchik, Arkadiusz Kasprzyk, Jun Kawai, Evan Keibler, Cristyn Kells, W James Kent,
1.1587 -Andrew Kirby, Diana L Kolbe, Ian Korf, Raju S Kucherlapati, Edward J Kulbokas, David Kulp, Tom Landers, J P
1.1588 -Leger, Steven Leonard, Ivica Letunic, Rosie Levine, Jia Li, Ming Li, Christine Lloyd, Susan Lucas, Bin Ma, Donna R
1.1589 -Maglott, Elaine R Mardis, Lucy Matthews, Evan Mauceli, John H Mayer, Megan McCarthy, W Richard McCombie,
1.1590 -Stuart McLaren, Kirsten McLay, John D McPherson, Jim Meldrim, Beverley Meredith, Jill P Mesirov, Webb Miller,
1.1591 -Tracie L Miner, Emmanuel Mongin, Kate T Montgomery, Michael Morgan, Richard Mott, James C Mullikin, Donna M
1.1592 -Muzny, William E Nash, Joanne O Nelson, Michael N Nhan, Robert Nicol, Zemin Ning, Chad Nusbaum, Michael J
1.1593 -O’Connor, Yasushi Okazaki, Karen Oliver, Emma Overton-Larty, Lior Pachter, Gens Parra, Kymberlie H Pepin, Jane
1.1594 -Peterson, Pavel Pevzner, Robert Plumb, Craig S Pohl, Alex Poliakov, Tracy C Ponce, Chris P Ponting, Simon Potter,
1.1595 -Michael Quail, Alexandre Reymond, Bruce A Roe, Krishna M Roskin, Edward M Rubin, Alistair G Rust, Ralph San-
1.1596 -tos, Victor Sapojnikov, Brian Schultz, Jrg Schultz, Matthias S Schwartz, Scott Schwartz, Carol Scott, Steven Seaman,
1.1597 -Steve Searle, Ted Sharpe, Andrew Sheridan, Ratna Shownkeen, Sarah Sims, Jonathan B Singer, Guy Slater, Arian
1.1598 -Smit, Douglas R Smith, Brian Spencer, Arne Stabenau, Nicole Stange-Thomann, Charles Sugnet, Mikita Suyama,
1.1599 -Glenn Tesler, Johanna Thompson, David Torrents, Evanne Trevaskis, John Tromp, Catherine Ucla, Abel Ureta-Vidal,
1.1600 -Jade P Vinson, Andrew C Von Niederhausern, Claire M Wade, Melanie Wall, Ryan J Weber, Robert B Weiss, Michael C
1.1601 -Wendl, Anthony P West, Kris Wetterstrand, Raymond Wheeler, Simon Whelan, Jamey Wierzbowski, David Willey,
1.1602 -Sophie Williams, Richard K Wilson, Eitan Winter, Kim C Worley, Dudley Wyman, Shan Yang, Shiaw-Pyng Yang,
1.1603 -Evgeny M Zdobnov, Michael C Zody, and Eric S Lander. Initial sequencing and comparative analysis of the mouse
1.1604 -genome. Nature, 420(6915):520–62, December 2002. PMID: 12466850.
1.1605 +[1]Chris Adamson, Leigh Johnston, Terrie Inder, Sandra Rees, Iven Mareels, and Gary Egan. A Tracking
1.1606 +Approach to Parcellation of the Cerebral Cortex, volume Volume 3749/2005 of Lecture Notes in Computer
1.1607 +Science, pages 294–301. Springer Berlin / Heidelberg, 2005.
1.1608 +[2]J. Annese, A. Pitiot, I. D. Dinov, and A. W. Toga. A myelo-architectonic method for the structural classification
1.1609 +of cortical areas. NeuroImage, 21(1):15–26, 2004.
1.1610 +[3]Tanya Barrett, Dennis B. Troup, Stephen E. Wilhite, Pierre Ledoux, Dmitry Rudnev, Carlos Evangelista,
1.1611 +Irene F. Kim, Alexandra Soboleva, Maxim Tomashevsky, and Ron Edgar. NCBI GEO: mining tens of millions
1.1612 +of expression profiles–database and tools update. Nucl. Acids Res., 35(suppl_1):D760–765, 2007.
1.1613 +[4]George W. Bell, Tatiana A. Yatskievych, and Parker B. Antin. GEISHA, a whole-mount in situ hybridization
1.1614 +gene expression screen in chicken embryos. Developmental Dynamics, 229(3):677–687, 2004.
1.1615 +[5]James P Carson, Tao Ju, Hui-Chen Lu, Christina Thaller, Mei Xu, Sarah L Pallas, Michael C Crair, Joe
1.1616 +Warren, Wah Chiu, and Gregor Eichele. A digital atlas to characterize the mouse brain transcriptome.
1.1617 +PLoS Comput Biol, 1(4):e41, 2005.
1.1618 +[6]Mark H. Chin, Alex B. Geng, Arshad H. Khan, Wei-Jun Qian, Vladislav A. Petyuk, Jyl Boline, Shawn Levy,
1.1619 +Arthur W. Toga, Richard D. Smith, Richard M. Leahy, and Desmond J. Smith. A genome-scale map of
1.1620 +expression for a mouse brain section obtained using voxelation. Physiol. Genomics, 30(3):313–321, August
1.1621 +2007.
1.1622 +[7]D C Van Essen, H A Drury, J Dickson, J Harwell, D Hanlon, and C H Anderson. An integrated software suite
1.1623 +for surface-based analyses of cerebral cortex. Journal of the American Medical Informatics Association:
1.1624 +JAMIA, 8(5):443–59, 2001. PMID: 11522765.
1.1625 +[8]Shiaoching Gong, Chen Zheng, Martin L. Doughty, Kasia Losos, Nicholas Didkovsky, Uta B. Scham-
1.1626 +bra, Norma J. Nowak, Alexandra Joyner, Gabrielle Leblanc, Mary E. Hatten, and Nathaniel Heintz. A
1.1627 +gene expression atlas of the central nervous system based on bacterial artificial chromosomes. Nature,
1.1628 +425(6961):917–925, October 2003.
1.1629 +[9]Trevor Hastie, Robert Tibshirani, Michael Eisen, Ash Alizadeh, Ronald Levy, Louis Staudt, Wing Chan,
1.1630 +David Botstein, and Patrick Brown. ’Gene shaving’ as a method for identifying distinct sets of genes with
1.1631 +similar expression patterns. Genome Biology, 1(2):research0003.1–research0003.21, 2000.
1.1632 +[10]Jano Hemert and Richard Baldock. Matching Spatial Regions with Combinations of Interacting Gene Ex-
1.1633 +pression Patterns, volume 13 of Communications in Computer and Information Science, pages 347–361.
1.1634 +Springer Berlin Heidelberg, 2008.
1.1635 +[11]C Kemp, JB Tenenbaum, TL Griffiths, T Yamada, and N Ueda. Learning systems of concepts with an infinite
1.1636 +relational model. In AAAI, 2006.
1.1637 +[12]F. Kruggel, M. K. Brckner, Th. Arendt, C. J. Wiggins, and D. Y. von Cramon. Analyzing the neocortical
1.1638 +fine-structure. Medical Image Analysis, 7(3):251–264, September 2003.
1.1639 +[13]Erh-Fang Lee, Jyl Boline, and Arthur W. Toga. A High-Resolution anatomical framework of the neonatal
1.1640 +mouse brain for managing gene expression data. Frontiers in Neuroinformatics, 1:6, 2007. PMC2525996.
1.1641 +[14]Susan Magdaleno, Patricia Jensen, Craig L. Brumwell, Anna Seal, Karen Lehman, Andrew Asbury, Tony
1.1642 +Cheung, Tommie Cornelius, Diana M. Batten, Christopher Eden, Shannon M. Norland, Dennis S. Rice,
1.1643 +Nilesh Dosooye, Sundeep Shakya, Perdeep Mehta, and Tom Curran. BGEM: an in situ hybridization
1.1644 +database of gene expression in the embryonic and adult mouse nervous system. PLoS Biology, 4(4):e86
1.1645 +EP –, April 2006.
1.1646 +[15]Lydia Ng, Amy Bernard, Chris Lau, Caroline C Overly, Hong-Wei Dong, Chihchau Kuan, Sayan Pathak, Su-
1.1647 +san M Sunkin, Chinh Dang, Jason W Bohland, Hemant Bokil, Partha P Mitra, Luis Puelles, John Hohmann,
1.1648 +David J Anderson, Ed S Lein, Allan R Jones, and Michael Hawrylycz. An anatomic gene expression atlas
1.1649 +of the adult mouse brain. Nat Neurosci, 12(3):356–362, March 2009.
1.1650 +[16]Christopher J. Paciorek. Computational techniques for spatial logistic regression with large data sets. Com-
1.1651 +putational Statistics & Data Analysis, 51(8):3631–3653, May 2007.
1.1652 +[17]George Paxinos and Keith B.J. Franklin. The Mouse Brain in Stereotaxic Coordinates. Academic Press, 2
1.1653 +edition, July 2001.
1.1654 +[18]A. Schleicher, N. Palomero-Gallagher, P. Morosan, S. Eickhoff, T. Kowalski, K. Vos, K. Amunts, and
1.1655 +K. Zilles. Quantitative architectural analysis: a new approach to cortical mapping. Anatomy and Em-
1.1656 +bryology, 210(5):373–386, December 2005.
1.1657 +[19]Oliver Schmitt, Lars Hmke, and Lutz Dmbgen. Detection of cortical transition regions utilizing statistical
1.1658 +analyses of excess masses. NeuroImage, 19(1):42–63, May 2003.
1.1659 +[20]Constance M. Smith, Jacqueline H. Finger, Terry F. Hayamizu, Ingeborg J. McCright, Janan T. Eppig,
1.1660 +James A. Kadin, Joel E. Richardson, and Martin Ringwald. The mouse gene expression database (GXD):
1.1661 +2007 update. Nucl. Acids Res., 35(suppl_1):D618–623, 2007.
1.1662 +[21]Judy Sprague, Leyla Bayraktaroglu, Dave Clements, Tom Conlin, David Fashena, Ken Frazer, Melissa
1.1663 +Haendel, Douglas G Howe, Prita Mani, Sridhar Ramachandran, Kevin Schaper, Erik Segerdell, Peiran
1.1664 +Song, Brock Sprunger, Sierra Taylor, Ceri E Van Slyke, and Monte Westerfield. The zebrafish information
1.1665 +network: the zebrafish model organism database. Nucleic Acids Research, 34(Database issue):D581–5,
1.1666 +2006. PMID: 16381936.
1.1667 +[22]Larry Swanson. Brain Maps: Structure of the Rat Brain. Academic Press, 3 edition, November 2003.
1.1668 +[23]Carol L. Thompson, Sayan D. Pathak, Andreas Jeromin, Lydia L. Ng, Cameron R. MacPherson, Marty T.
1.1669 +Mortrud, Allison Cusick, Zackery L. Riley, Susan M. Sunkin, Amy Bernard, Ralph B. Puchalski, Fred H.
1.1670 +Gage, Allan R. Jones, Vladimir B. Bajic, Michael J. Hawrylycz, and Ed S. Lein. Genomic anatomy of the
1.1671 +hippocampus. Neuron, 60(6):1010–1021, December 2008.
1.1672 +[24]Pavel Tomancak, Amy Beaton, Richard Weiszmann, Elaine Kwan, ShengQiang Shu, Suzanna E Lewis,
1.1673 +Stephen Richards, Michael Ashburner, Volker Hartenstein, Susan E Celniker, and Gerald M Rubin. Sys-
1.1674 +tematic determination of patterns of gene expression during drosophila embryogenesis. Genome Biology,
1.1675 +3(12):research008818814, 2002. PMC151190.
1.1676 +[25]Jano van Hemert and Richard Baldock. Mining Spatial Gene Expression Data for Association Rules, volume
1.1677 +4414/2007 of Lecture Notes in Computer Science, pages 66–76. Springer Berlin / Heidelberg, 2007.
1.1678 +[26]Shanmugasundaram Venkataraman, Peter Stevenson, Yiya Yang, Lorna Richardson, Nicholas Burton,
1.1679 +Thomas P. Perry, Paul Smith, Richard A. Baldock, Duncan R. Davidson, and Jeffrey H. Christiansen.
1.1680 +EMAGE edinburgh mouse atlas of gene expression: 2008 update. Nucl. Acids Res., 36(suppl_1):D860–
1.1681 +865, 2008.
1.1682 +[27]Axel Visel, Christina Thaller, and Gregor Eichele. GenePaint.org: an atlas of gene expression patterns in
1.1683 +the mouse embryo. Nucl. Acids Res., 32(suppl_1):D552–556, 2004.
1.1684 +[28]Robert H Waterston, Kerstin Lindblad-Toh, Ewan Birney, Jane Rogers, Josep F Abril, Pankaj Agarwal, Richa
1.1685 +Agarwala, Rachel Ainscough, Marina Alexandersson, Peter An, Stylianos E Antonarakis, John Attwood,
1.1686 +Robert Baertsch, Jonathon Bailey, Karen Barlow, Stephan Beck, Eric Berry, Bruce Birren, Toby Bloom, Peer
1.1687 +Bork, Marc Botcherby, Nicolas Bray, Michael R Brent, Daniel G Brown, Stephen D Brown, Carol Bult, John
1.1688 +Burton, Jonathan Butler, Robert D Campbell, Piero Carninci, Simon Cawley, Francesca Chiaromonte, Asif T
1.1689 +Chinwalla, Deanna M Church, Michele Clamp, Christopher Clee, Francis S Collins, Lisa L Cook, Richard R
1.1690 +Copley, Alan Coulson, Olivier Couronne, James Cuff, Val Curwen, Tim Cutts, Mark Daly, Robert David, Joy
1.1691 +Davies, Kimberly D Delehaunty, Justin Deri, Emmanouil T Dermitzakis, Colin Dewey, Nicholas J Dickens,
1.1692 +Mark Diekhans, Sheila Dodge, Inna Dubchak, Diane M Dunn, Sean R Eddy, Laura Elnitski, Richard D Emes,
1.1693 +Pallavi Eswara, Eduardo Eyras, Adam Felsenfeld, Ginger A Fewell, Paul Flicek, Karen Foley, Wayne N
1.1694 +Frankel, Lucinda A Fulton, Robert S Fulton, Terrence S Furey, Diane Gage, Richard A Gibbs, Gustavo
1.1695 +Glusman, Sante Gnerre, Nick Goldman, Leo Goodstadt, Darren Grafham, Tina A Graves, Eric D Green,
1.1696 +Simon Gregory, Roderic Guig, Mark Guyer, Ross C Hardison, David Haussler, Yoshihide Hayashizaki,
1.1697 +LaDeana W Hillier, Angela Hinrichs, Wratko Hlavina, Timothy Holzer, Fan Hsu, Axin Hua, Tim Hubbard,
1.1698 +Adrienne Hunt, Ian Jackson, David B Jaffe, L Steven Johnson, Matthew Jones, Thomas A Jones, Ann Joy,
1.1699 +Michael Kamal, Elinor K Karlsson, Donna Karolchik, Arkadiusz Kasprzyk, Jun Kawai, Evan Keibler, Cristyn
1.1700 +Kells, W James Kent, Andrew Kirby, Diana L Kolbe, Ian Korf, Raju S Kucherlapati, Edward J Kulbokas, David
1.1701 +Kulp, Tom Landers, J P Leger, Steven Leonard, Ivica Letunic, Rosie Levine, Jia Li, Ming Li, Christine Lloyd,
1.1702 +Susan Lucas, Bin Ma, Donna R Maglott, Elaine R Mardis, Lucy Matthews, Evan Mauceli, John H Mayer,
1.1703 +Megan McCarthy, W Richard McCombie, Stuart McLaren, Kirsten McLay, John D McPherson, Jim Meldrim,
1.1704 +Beverley Meredith, Jill P Mesirov, Webb Miller, Tracie L Miner, Emmanuel Mongin, Kate T Montgomery,
1.1705 +Michael Morgan, Richard Mott, James C Mullikin, Donna M Muzny, William E Nash, Joanne O Nelson,
1.1706 +Michael N Nhan, Robert Nicol, Zemin Ning, Chad Nusbaum, Michael J O’Connor, Yasushi Okazaki, Karen
1.1707 +Oliver, Emma Overton-Larty, Lior Pachter, Gens Parra, Kymberlie H Pepin, Jane Peterson, Pavel Pevzner,
1.1708 +Robert Plumb, Craig S Pohl, Alex Poliakov, Tracy C Ponce, Chris P Ponting, Simon Potter, Michael Quail,
1.1709 +Alexandre Reymond, Bruce A Roe, Krishna M Roskin, Edward M Rubin, Alistair G Rust, Ralph Santos,
1.1710 +Victor Sapojnikov, Brian Schultz, Jrg Schultz, Matthias S Schwartz, Scott Schwartz, Carol Scott, Steven
1.1711 +Seaman, Steve Searle, Ted Sharpe, Andrew Sheridan, Ratna Shownkeen, Sarah Sims, Jonathan B Singer,
1.1712 +Guy Slater, Arian Smit, Douglas R Smith, Brian Spencer, Arne Stabenau, Nicole Stange-Thomann, Charles
1.1713 +Sugnet, Mikita Suyama, Glenn Tesler, Johanna Thompson, David Torrents, Evanne Trevaskis, John Tromp,
1.1714 +Catherine Ucla, Abel Ureta-Vidal, Jade P Vinson, Andrew C Von Niederhausern, Claire M Wade, Melanie
1.1715 +Wall, Ryan J Weber, Robert B Weiss, Michael C Wendl, Anthony P West, Kris Wetterstrand, Raymond
1.1716 +Wheeler, Simon Whelan, Jamey Wierzbowski, David Willey, Sophie Williams, Richard K Wilson, Eitan Win-
1.1717 +ter, Kim C Worley, Dudley Wyman, Shan Yang, Shiaw-Pyng Yang, Evgeny M Zdobnov, Michael C Zody, and
1.1718 +Eric S Lander. Initial sequencing and comparative analysis of the mouse genome. Nature, 420(6915):520–
1.1719 +62, December 2002. PMID: 12466850.
1.1720
1.1721