bshanks@112: Introduction
bshanks@112: Massive new datasets obtained with techniques such as in situ hybridization (ISH), immunohisto-
bshanks@112: chemistry, in situ transgenic reporter, microarray voxelation, and others, allow the expression levels
bshanks@112: of many genes at many locations to be compared.  Our goal is to develop automated methods to
bshanks@112: relate spatial variation in gene expression to anatomy.  We want to find marker genes for specific
bshanks@96: anatomical regions, and also to draw new anatomical maps based on gene expression patterns.
bshanks@112: We will validate these methods by applying them to 46 anatomical areas within the cerebral cortex,
bshanks@112: by using the Allen Mouse Brain Atlas coronal dataset (ABA).
bshanks@112:    This project has three primary goals:
bshanks@112:    (1) develop an algorithm to screen spatial gene expression data for combinations of marker
bshanks@112: genes which selectively target anatomical regions.
bshanks@112:    (2) develop an algorithm to suggest new ways of carving up a structure into anatomically dis-
bshanks@112: tinct regions, based on spatial patterns in gene expression.
bshanks@112:    (3) adapt our tools for the analysis of multi/hyperspectral imaging data from the Geographic
bshanks@112: Information Systems (GIS) community.
bshanks@112:    We will create a 2-D &#8220;flat map&#8221; dataset of the mouse cerebral cortex that contains a flattened
bshanks@112: version of the Allen Mouse Brain Atlas ISH data, as well as the boundaries of cortical anatomical
bshanks@112: areas.  We will use this dataset to validate the methods developed in (1) and (2).  In addition to
bshanks@112: its use in neuroscience, this dataset will be useful as a sample dataset for the machine learning
bshanks@112: community.
bshanks@112:    Although our particular application involves the 3D spatial distribution of gene expression, the
bshanks@112: methods we will develop will generalize to any high-dimensional data over points located in a low-
bshanks@112: dimensional space. In particular, our methods could be applied to the analysis of multi/hyperspectral
bshanks@112: imaging data, or alternately to genome-wide sequencing data derived from sets of tissues and dis-
bshanks@112: ease states.
bshanks@112:    All algorithms that we develop will be implemented in a GPL open-source software toolkit. The
bshanks@112: toolkit and the datasets will be published and freely available for others to use.
bshanks@112: __________________
bshanks@112:  Background and related work
bshanks@112: Cortical anatomy
bshanks@112:       The cortex is divided into areas and layers. Because of the cortical columnar organization, the
bshanks@112: parcellation of the cortex into areas can be drawn as a 2-D map on the surface of the cortex. In the
bshanks@112: third dimension, the boundaries between the areas continue downwards into the cortical depth,
bshanks@112: perpendicular to the surface. The layer boundaries run parallel to the surface. One can picture an
bshanks@112: area of the cortex as a slice of a six-layered cake1.
bshanks@112:       It is known that different cortical areas have distinct roles in both normal functioning and in
bshanks@112: disease processes, yet there are no known marker genes for most cortical areas. When it is nec-
bshanks@112: essary to divide a tissue sample into cortical areas, this is a manual process that requires a skilled
bshanks@112:     1Outside of isocortex, the number of layers varies.
bshanks@112:                                             1
bshanks@112: 
bshanks@112: human  to  combine  multiple  visual  cues  and  interpret  them  in  the  context  of  their  approximate
bshanks@112: location upon the cortical surface.
bshanks@112:    Even the questions of how many areas should be recognized in cortex, and what their arrange-
bshanks@112: ment is, are still not completely settled.  A proposed division of the cortex into areas is called a
bshanks@112: cortical map. In the rodent, the lack of a single agreed-upon map can be seen by contrasting the
bshanks@112: recent maps given by Swanson[21] on the one hand, and Paxinos and Franklin[16] on the other.
bshanks@112: While the maps are certainly very similar in their general arrangement, significant differences re-
bshanks@112: main.
bshanks@112:    The Allen Mouse Brain Atlas dataset
bshanks@112:    The Allen Mouse Brain Atlas (ABA) data[13] were produced by doing in-situ hybridization on
bshanks@112: slices of male, 56-day-old C57BL/6J mouse brains.  Pictures were taken of the processed slice,
bshanks@112: and these pictures were semi-automatically analyzed to create a digital measurement of gene
bshanks@112: expression levels at each location in each slice.  Per slice, cellular spatial resolution is achieved.
bshanks@112: Using this method, a single physical slice can only be used to measure one single gene; many
bshanks@112: different mouse brains were needed in order to measure the expression of many genes.
bshanks@112:    Mus musculus is thought to contain about 22,000 protein-coding genes[26]. The ABA contains
bshanks@112: data on about 20,000 genes in sagittal sections, out of which over 4,000 genes are also measured
bshanks@112: in coronal sections.  Our dataset is derived from only the coronal subset of the ABA2. An auto-
bshanks@112: mated nonlinear alignment procedure located the 2D data from the various slices in a single 3D
bshanks@112: coordinate system.  In the final 3D coordinate system, voxels are cubes with 200 microns on a
bshanks@112: side.  There are 67x41x58 = 159,326 voxels, of which 51,533 are in the brain[15].  For each voxel
bshanks@112: and each gene, the expression energy[13] within that voxel is made available.
bshanks@112:    The ABA is not the only large public spatial gene expression dataset[8][25][5][14][24][4][23][20][3].
bshanks@112: However, with the exception of the ABA, GenePaint[25], and EMAGE[24], most of the other re-
bshanks@112: sources have not (yet) extracted the expression intensity from the ISH images and registered the
bshanks@112: results into a single 3-D space.
bshanks@112:    The remainder of the background section will be divided into three parts, one for each major
bshanks@112: goal.
bshanks@112:  Goal 1, From Areas to Genes: Given a map of regions, find genes that mark those regions
bshanks@112: Machine  learning  terminology:  classifiers  The  task  of  looking  for  marker  genes  for  known
bshanks@112: anatomical regions means that one is looking for a set of genes such that, if the expression level
bshanks@112: of those genes is known, then the locations of the regions can be inferred.
bshanks@112:    If we define the regions so that they cover the entire anatomical structure to be subdivided,
bshanks@112: and  restrict  ourselves  to  looking  at  one  voxel  at  a  time,  we  may  say  that  we  are  using  gene
bshanks@112: expression  in  each  voxel  to  assign  that  voxel  to  the  proper  area.   We  call  this  a  classification
bshanks@112: task,  because each voxel is being assigned to a class (namely,  its region).   An understanding
bshanks@112: of the relationship between the combination of gene expression levels and the locations of the
bshanks@112: regions may be expressed as a function. The input to this function is a voxel, along with the gene
bshanks@112: expression levels within that voxel; the output is the regional identity of the target voxel, that is, the
bshanks@112: ____________________________________
bshanks@112:     2The sagittal data do not cover the entire cortex, and also have greater registration error[15]. Genes were selected
bshanks@112: by the Allen Institute for coronal sectioning based on, &#8220;classes of known neuroscientific interest... or through post hoc
bshanks@112: identification of a marked non-ubiquitous expression pattern&#8221;[15].
bshanks@112:                                             2
bshanks@112: 
bshanks@112: region to which the target voxel belongs. We call this function a classifier. In general, the input to
bshanks@112: a classifier is called an instance, and the output is called a label (or a class label).
bshanks@112:    Our goal is not to produce a single classifier, but rather to develop an automated method for
bshanks@112: determining a classifier for any known anatomical structure.  Therefore, we seek a procedure by
bshanks@112: which a gene expression dataset may be analyzed in concert with an anatomical atlas in order to
bshanks@112: produce a classifier.  The initial gene expression dataset used in the construction of the classifier
bshanks@112: is called training data.  In the machine learning literature, this sort of procedure may be thought
bshanks@112: of as a supervised learning task, defined as a task in which the goal is to learn a mapping from
bshanks@112: instances to labels, and the training data consists of a set of instances (voxels) for which the labels
bshanks@112: (regions) are known.
bshanks@112:    Each gene expression level is called a feature, and the selection of which genes3  to look at is
bshanks@112: called feature selection. Feature selection is one component of the task of learning a classifier.
bshanks@112:    One class of feature selection methods assigns some sort of score to each candidate gene.
bshanks@112: The top-ranked genes are then chosen.  Some scoring measures can assign a score to a set of
bshanks@112: selected genes, not just to a single gene; in this case, a dynamic procedure may be used in which
bshanks@112: features are added and subtracted from the selected set depending on how much they raise the
bshanks@112: score. Such procedures are called &#8220;stepwise&#8221; or &#8220;greedy&#8221;.
bshanks@112:    Although the classifier itself may only look at the gene expression data within each voxel be-
bshanks@112: fore classifying that voxel, the algorithm which constructs the classifier may look over the entire
bshanks@112: dataset.  We can categorize score-based feature selection methods depending on how the score
bshanks@112: of calculated.  Often the score calculation consists of assigning a sub-score to each voxel, and
bshanks@112: then aggregating these sub-scores into a final score.  If only information from nearby voxels is
bshanks@112: used to calculate a voxel&#8217;s sub-score, then we say it is a local scoring method. If only information
bshanks@112: from the voxel itself is used to calculate a voxel&#8217;s sub-score, then we say it is a pointwise scoring
bshanks@112: method.
bshanks@112:  Our Strategy for Goal 1
bshanks@112: Key questions when choosing a learning method are:  What are the instances?   What are the
bshanks@112: features? How are the features chosen? Here are four principles that outline our answers to these
bshanks@112: questions.
bshanks@112:    Principle 1: Combinatorial gene expression
bshanks@112:    It is too much to hope that every anatomical region of interest will be identified by a single
bshanks@112: gene.  For example, in the cortex, there are some areas which are not clearly delineated by any
bshanks@112: gene included in the ABA coronal dataset.  However, at least some of these areas can be delin-
bshanks@112: eated by looking at combinations of genes (an example of an area for which multiple genes are
bshanks@112: necessary and sufficient is provided in Preliminary Results, Figure 4).  Therefore, each instance
bshanks@112: should contain multiple features (genes).
bshanks@112:    Principle 2: Only look at combinations of small numbers of genes
bshanks@112:    When the classifier classifies a voxel, it is only allowed to look at the expression of the genes
bshanks@112: which have been selected as features.  The more data that are available to a classifier, the better
bshanks@112: that it can do. Why not include every gene as a feature? The reason is that we wish to employ the
bshanks@112: classifier in situations in which it is not feasible to gather data about every gene. For example, if we
bshanks@112: ____________________________________
bshanks@112:     3Strictly speaking, the features are gene expression levels, but we&#8217;ll call them genes.
bshanks@112:                                             3
bshanks@112: 
bshanks@112: want to use the expression of marker genes as a trigger for some regionally-targeted intervention,
bshanks@112: then our intervention must contain a molecular mechanism to check the expression level of each
bshanks@112: marker gene before it triggers.  It is currently infeasible to design a molecular trigger that checks
bshanks@112: the level of more than a handful of genes. Therefore, we must select only a few genes as features.
bshanks@112:    The requirement to find combinations of only a small number of genes limits us from straightfor-
bshanks@112: wardly applying many of the most simple techniques from the field of supervised machine learning.
bshanks@112: In the parlance of machine learning, our task combines feature selection with supervised learning.
bshanks@112:    Principle 3: Use geometry in feature selection
bshanks@112:    When doing feature selection with score-based methods, the simplest thing to do would be
bshanks@112: to score the performance of each voxel by itself and then combine these scores (pointwise scor-
bshanks@112: ing). A more powerful approach is to also use information about the geometric relations between
bshanks@112: each voxel and its neighbors; this requires non-pointwise, local scoring methods. See Preliminary
bshanks@112: Results, figure 3 for evidence of the complementary nature of pointwise and local scoring methods.
bshanks@112:    Principle 4: Work in 2-D whenever possible
bshanks@112:    There are many anatomical structures which are commonly characterized in terms of a two-
bshanks@112: dimensional manifold. When it is known that the structure that one is looking for is two-dimensional,
bshanks@112: the results may be improved by allowing the analysis algorithm to take advantage of this prior
bshanks@112: knowledge. In addition, it is easier for humans to visualize and work with 2-D data.
bshanks@112:  Goal 2, From Genes to Areas: given gene expression data, discover a map of regions
bshanks@101: Machine learning terminology: clustering
bshanks@112:    If one is given a dataset consisting merely of instances, with no class labels, then analysis of
bshanks@112: the dataset is referred to as unsupervised learning in the jargon of machine learning.  One thing
bshanks@112: that you can do with such a dataset is to group instances together.  A set of similar instances is
bshanks@112: called a cluster, and the activity of grouping the data into clusters is called clustering or cluster
bshanks@112: analysis.
bshanks@112:    The task of deciding how to carve up a structure into anatomical regions can be put into these
bshanks@112: terms. The instances are once again voxels (or pixels) along with their associated gene expression
bshanks@112: profiles. We make the assumption that voxels from the same anatomical region have similar gene
bshanks@112: expression profiles, at least compared to the other regions.  This means that clustering voxels is
bshanks@112: the same as finding potential regions; we seek a partitioning of the voxels into regions, that is, into
bshanks@112: clusters of voxels with similar gene expression.
bshanks@112:    It is desirable to determine not just one set of regions, but also how these regions relate to
bshanks@112: each other. The outcome of clustering may be a hierarchical tree of clusters, rather than a single
bshanks@112: set of clusters which partition the voxels. This is called hierarchical clustering.
bshanks@112:    Similarity scores A crucial choice when designing a clustering method is how to measure
bshanks@112: similarity, across either pairs of instances, or clusters, or both.  There is much overlap between
bshanks@112: scoring methods for feature selection (discussed above under Goal 1) and scoring methods for
bshanks@112: similarity.
bshanks@112:    Dimensionality reduction In this section, we discuss reducing the length of the per-pixel gene
bshanks@112: expression feature vector.  By &#8220;dimension&#8221;, we mean the dimension of this vector, not the spatial
bshanks@112:                                             4
bshanks@112: 
bshanks@112: dimension of the underlying data.
bshanks@112:                    
bshanks@112:                    
bshanks@104: Figure   1:     Top   row:     Genes   Nfic
bshanks@112: and   A930001M12Rik  are   the   most
bshanks@104: correlated  with  area  SS  (somatosen-
bshanks@112: sory  cortex).     Bottom  row:    Genes
bshanks@104: C130038G02Rik   and    Cacna1i   are
bshanks@112: those  with  the  best  fit  using  logistic
bshanks@104: regression.    Within  each  picture,  the
bshanks@104: vertical  axis  roughly  corresponds  to
bshanks@104: anterior at the top and posterior at the
bshanks@104: bottom, and the horizontal axis roughly
bshanks@104: corresponds  to  medial  at  the  left  and
bshanks@104: lateral at the right.   The red outline is
bshanks@104: the boundary of region SS. Pixels are
bshanks@104: colored  according  to  correlation,  with
bshanks@104: red meaning high correlation and blue
bshanks@112: meaning low.                             Unlike Goal 1, there is no externally-imposed need to
bshanks@112:                                       select only a handful of informative genes for inclusion
bshanks@112:                                       in the instances.   However,  some clustering algorithms
bshanks@112:                                       perform better on small numbers of features4. There are
bshanks@112:                                       techniques which &#8220;summarize&#8221; a larger number of fea-
bshanks@112:                                       tures  using  a  smaller  number  of  features;  these  tech-
bshanks@112:                                       niques go by the name of feature extraction or dimen-
bshanks@112:                                       sionality reduction. The small set of features that such a
bshanks@112:                                       technique yields is called the reduced feature set.  Note
bshanks@112:                                       that the features in the reduced feature set do not neces-
bshanks@112:                                       sarily correspond to genes; each feature in the reduced
bshanks@112:                                       set may be any function of the set of gene expression
bshanks@112:                                       levels.
bshanks@112:                                          Clustering genes rather than voxels Although the
bshanks@112:                                       ultimate goal is to cluster the instances (voxels or pixels),
bshanks@112:                                       one  strategy  to  achieve  this  goal  is  to  first  cluster  the
bshanks@112:                                       features  (genes).   There  are  two  ways  that  clusters  of
bshanks@112:                                       genes could be used.
bshanks@112:                                          Gene clusters could be used as part of dimensionality
bshanks@112:                                       reduction:  rather than have one feature for each gene,
bshanks@112:                                       we could have one reduced feature for each gene cluster.
bshanks@112:                                          Gene clusters could also be used to directly yield a
bshanks@112:                                       clustering  on  instances.   This  is  because  many  genes
bshanks@112:                                       have an expression pattern which seems to pick out a
bshanks@112:                                       single, spatially contiguous region. This suggests the fol-
bshanks@112:                                       lowing procedure: cluster together genes which pick out
bshanks@112:                                       similar regions, and then to use the more popular com-
bshanks@112:                                       mon regions as the final clusters. In Preliminary Results,
bshanks@112:                                       Figure 7, we show that a number of anatomically recog-
bshanks@112: nized cortical regions, as well as some &#8220;superregions&#8221; formed by lumping together a few regions,
bshanks@112: are associated with gene clusters in this fashion.
bshanks@112:  Goal 3: interoperability with multi/hyperspectral imaging analysis software
bshanks@112: A typical color image associates each pixel with a vector of three values. Multispectral and hyper-
bshanks@112: spectral images, however, are images which associate each pixel with a vector containing many
bshanks@112: values.   The  different  positions  in  the  vector  correspond  to  different  bands  of  electromagnetic
bshanks@112: wavelengths5.
bshanks@112:    Some analysis techniques for hyperspectral imaging, especially preprocessing and calibration
bshanks@112: techniques, make use of the information that the different values captured at each pixel represent
bshanks@112: ____________________________________
bshanks@112:     4First, because the number of features in the reduced dataset is less than in the original dataset, the running time of
bshanks@112: clustering algorithms may be much less.  Second, it is thought that some clustering algorithms may give better results
bshanks@112: on reduced data.
bshanks@112:      5In hyperspectral imaging, the bands are adjacent, and the number of different bands is larger. For conciseness, we
bshanks@112: discuss only hyperspectral imaging, but our methods are also well suited to multispectral imaging with many bands.
bshanks@112:                                             5
bshanks@112: 
bshanks@112: adjacent wavelengths of light, which can be combined to make a spectrum.  Other analysis tech-
bshanks@112: niques ignore the interpretation of the values measured, and their relationship to each other within
bshanks@112: the electromagnetic spectrum, instead treating them blindly as completely separate features.
bshanks@112:    With  both  hyperspectral  imaging  and  spatial  gene  expression  data,  each  location  in  space
bshanks@112: is associated with more than three numerical feature values.  The analysis of hyperspectral im-
bshanks@112: ages can involve supervised classification and unsupervised learning. Often hyperspectral images
bshanks@112: come from satellites looking at the Earth, and it is desirable to classify what sort of objects occupy
bshanks@112: a given area of land. Sometimes detailed training data is not available, in which case it is desirable
bshanks@112: at least to cluster together those regions of land which contain similar objects.
bshanks@112:    We believe that it may be possible for these two different field to share some common compu-
bshanks@112: tational tools. To this end, we intend to make use of existing hyperspectral imaging software when
bshanks@112: possible, and to develop new software in such a way so as to make it easy to use for the purpose
bshanks@112: of hyperspectral image analysis, as well as for our primary purpose of spatial gene expression
bshanks@112: data analysis.
bshanks@112:  Related work
bshanks@112:   
bshanks@112: Figure  2:   Gene  Pitx2
bshanks@112: is   selectively   underex-
bshanks@112: pressed in area SS.      As noted above, the GIS community has developed tools for supervised
bshanks@112:                         classification and unsupervised clustering in the context of the analysis
bshanks@112:                         of hyperspectral imaging data.  One tool is Spectral Python6.  Spectral
bshanks@112:                         Python implements various supervised and unsupervised classification
bshanks@112:                         methods,  as  well  as  utility  functions  for  loading,  viewing,  and  saving
bshanks@112:                         spatial data.  Although Spectral Python has feature extraction methods
bshanks@112:                         (such  as  principal  components  analysis)  which  create  a  small  set  of
bshanks@112:                         new features computed based on the original features, it does not have
bshanks@112:                         feature  selection  methods,  that  is,  methods  to  select  a  small  subset
bshanks@112:                         out of the original features (although feature selection in hyperspectral
bshanks@112:                         imaging has been investigated by others[19].
bshanks@112:    There is a substantial body of work on the analysis of gene expression data. Most of this con-
bshanks@112: cerns gene expression data which are not fundamentally spatial7. Here we review only that work
bshanks@112: which concerns the automated analysis of spatial gene expression data with respect to anatomy.
bshanks@112:    Relating to Goal 1, GeneAtlas[5] and EMAGE [24] allow the user to construct a search query by
bshanks@112: demarcating regions and then specifying either the strength of expression or the name of another
bshanks@112: gene or dataset whose expression pattern is to be matched. Neither GeneAtlas nor EMAGE allow
bshanks@112: one to search for combinations of genes that define a region in concert.
bshanks@112:    Relating to Goal 2, EMAGE[24] allows the user to select a dataset from among a large number
bshanks@112: of alternatives, or by running a search query, and then to cluster the genes within that dataset.
bshanks@112: EMAGE clusters via hierarchical complete linkage clustering.
bshanks@112:    [15] describes AGEA, &#8221;Anatomic Gene Expression Atlas&#8221;. AGEA has three components. Gene
bshanks@112: Finder:  The user selects a seed voxel and the system (1) chooses a cluster which includes the
bshanks@112: seed voxel, (2) yields a list of genes which are overexpressed in that cluster. Correlation: The user
bshanks@112: selects a seed voxel and the system then shows the user how much correlation there is between
bshanks@112: the gene expression profile of the seed voxel and every other voxel.  Clusters:  AGEA includes a
bshanks@112: ____________________________________
bshanks@112:     6http://spectralpython.sourceforge.net/
bshanks@112:    7By &#8220;fundamentally  spatial&#8221; we mean that there is information from a large number of spatial locations indexed by
bshanks@112: spatial coordinates; not just data which have only a few different locations or which is indexed by anatomical label.
bshanks@112:                                             6
bshanks@112: 
bshanks@112: preset hierarchical clustering of voxels based on a recursive bifurcation algorithm with correlation
bshanks@112: as the similarity metric.  AGEA has been applied to the cortex.  The paper describes interesting
bshanks@112: results on the structure of correlations between voxel gene expression profiles within a handful of
bshanks@112: cortical areas.  However, that analysis neither looks for genes marking cortical areas, nor does it
bshanks@112: suggest a cortical map based on gene expression data. Neither of the other components of AGEA
bshanks@112: can be applied to cortical areas; AGEA&#8217;s Gene Finder cannot be used to find marker genes for the
bshanks@112: cortical areas; and AGEA&#8217;s hierarchical clustering does not produce clusters corresponding to the
bshanks@112: cortical areas8.
bshanks@112:                    
bshanks@112:                    
bshanks@104: Figure 3:  The top row shows the two
bshanks@104: genes which (individually) best predict
bshanks@104: area AUD, according to logistic regres-
bshanks@112: sion.   The  bottom  row  shows  the  two
bshanks@112: genes  which  (individually)  best  match
bshanks@112: area  AUD,  according  to  gradient  sim-
bshanks@104: ilarity.    From  left  to  right  and  top  to
bshanks@104: bottom,  the  genes  are  Ssr1,  Efcbp1,
bshanks@112: Ptk7, and Aph1a.                         [6] looks at the mean expression level of genes within
bshanks@112:                                       anatomical regions, and applies a Student&#8217;s t-test to de-
bshanks@112:                                       termine whether the mean expression level of a gene is
bshanks@112:                                       significantly higher in the target region.  This relates to
bshanks@112:                                       our Goal 1.  [6] also clusters genes, relating to our Goal
bshanks@112:                                       2.  For each cluster, prototypical spatial expression pat-
bshanks@112:                                       terns were created by averaging the genes in the cluster.
bshanks@112:                                       The prototypes were analyzed manually, without cluster-
bshanks@112:                                       ing voxels.
bshanks@112:                                          These related works differ from our strategy for Goal
bshanks@112:                                       1 in at least three ways. First, they find only single genes,
bshanks@112:                                       whereas  we  will  also  look  for  combinations  of  genes.
bshanks@112:                                       Second,  they  usually  can  only  use  overexpression  as
bshanks@112:                                       a marker, whereas we will also search for underexpres-
bshanks@112:                                       sion. Third, they use scores based on pointwise expres-
bshanks@112:                                       sion levels, whereas we will also use geometric scores
bshanks@112:                                       such as gradient similarity (described in Preliminary Re-
bshanks@112:                                       sults).   Figures  4,  2,  and  3  in  the  Preliminary  Results
bshanks@112:                                       section contain evidence that each of our three choices
bshanks@112:                                       is the right one.
bshanks@112:                                          [10]  describes  a  technique  to  find  combinations  of
bshanks@112:                                       marker  genes  to  pick  out  an  anatomical  region.   They
bshanks@112: use an evolutionary algorithm to evolve logical operators which combine boolean (thresholded)
bshanks@112: images in order to match a target image.  They apply their technique for finding combinations of
bshanks@112: marker genes for the purpose of clustering genes around a &#8220;seed gene&#8221;.
bshanks@112:    Relating to our Goal 2, some researchers have attempted to parcellate cortex on the basis of
bshanks@112: non-gene expression data. For example, [17], [2], [18], and [1] associate spots on the cortex with
bshanks@112: the radial profile9  of response to some stain ([12] uses MRI), extract features from this profile, and
bshanks@112: then use similarity between surface pixels to cluster.
bshanks@112:    [22]  describes  an  analysis  of  the  anatomy  of  the  hippocampus  using  the  ABA  dataset.   In
bshanks@112: addition  to  manual  analysis,  two  clustering  methods  were  employed,  a  modified  Non-negative
bshanks@112: Matrix Factorization (NNMF), and a hierarchical bifurcation clustering scheme using correlation as
bshanks@112: ____________________________________
bshanks@112:     8In both cases, the cause is that pairwise correlations between the gene expression of voxels in different areas but
bshanks@112: the same layer are often stronger than pairwise correlations between the gene expression of voxels in different layers
bshanks@112: but the same area. Therefore, a pairwise voxel correlation clustering algorithm will tend to create clusters representing
bshanks@112: cortical layers, not areas.
bshanks@112:      9A radial profile is a profile along a line perpendicular to the cortical surface.
bshanks@112:                                             7
bshanks@112: 
bshanks@112: similarity. The paper yielded impressive results, proving the usefulness of computational genomic
bshanks@112: anatomy.  We have run NNMF on the cortical dataset, and while the results are promising, other
bshanks@112: methods may perform as well or better (see Preliminary Results, Figure 6).
bshanks@112:    Comparing previous work with our Goal 1, there has been fruitful work on finding marker genes,
bshanks@112: but only one of the projects explored combinations of marker genes, and none of them compared
bshanks@112: the results obtained by using different algorithms or scoring methods.  Comparing previous work
bshanks@112: with Goal 2, although some projects obtained clusterings, there has not been much comparison
bshanks@112: between different algorithms or scoring methods, so it is likely that the best clustering method for
bshanks@112: this application has not yet been found. Also, none of these projects did a separate dimensionality
bshanks@112: reduction step before clustering pixels, or tried to cluster genes first in order to guide automated
bshanks@112: clustering of pixels into spatial regions, or used co-clustering algorithms.
bshanks@112:    In summary, (a) only one of the previous projects explores combinations of marker genes, (b)
bshanks@112: there has been almost no comparison of different algorithms or scoring methods, and (c) there
bshanks@112: has been no work on computationally finding marker genes applied to cortical areas, or on finding
bshanks@112: a hierarchical clustering that will yield a map of cortical areas de novo from gene expression data.
bshanks@112:    Our project is guided by a concrete application with a well-specified criterion of success (how
bshanks@112: well we can find marker genes for / reproduce the layout of cortical areas), which will provide a
bshanks@112: solid basis for comparing different methods.
bshanks@112: _________________________________________________
bshanks@112:  Data sharing plan
bshanks@112:                              
bshanks@112:               
bshanks@104: Figure  4:   Upper  left:   wwc1.    Upper
bshanks@104: right: mtif2. Lower left: wwc1 + mtif2
bshanks@104: (each pixel&#8217;s value on the lower left is
bshanks@104: the sum of the corresponding pixels in
bshanks@112: the upper row).                                         We are enthusiastic about the sharing of methods and
bshanks@112:                                                                  data, and at the conclusion of the project, we will make
bshanks@112:                                                                  all of our data and computer source code publically avail-
bshanks@112:                                                                  able, either in supplemental attachments to publications,
bshanks@112:                                                                  or on a website. The source code will be released under
bshanks@112:                                                                  the  GNU  Public  License.   We  intend  to  include  a  soft-
bshanks@112:                                                                  ware  program  which,  when  run,  will  take  as  input  the
bshanks@112:                                                                  Allen  Brain  Atlas  raw  data,  and  produce  as  output  all
bshanks@112:                                                                  numbers and charts found in publications resulting from
bshanks@112:                                                                  the project.  Source code to be released will include ex-
bshanks@112:                                                                  tensions  to  Caret[7],  an  existing  open-source  scientific
bshanks@112:                                                                  imaging  program,  and  to  Spectral  Python.   Data  to  be
bshanks@112:                                                                  released  will  include  the  2-D  &#8220;flat  map&#8221;  dataset.   This
bshanks@112:                                                                  dataset will be submitted to a machine learning dataset
bshanks@112:                                                                  repository.
bshanks@112:                                                                   Broader impacts
bshanks@112:                                                            In addition to validating the usefulness of the algorithms,
bshanks@112:                                                                  the application of these methods to cortex will produce
bshanks@112: immediate benefits, because there are currently no known genetic markers for most cortical areas.
bshanks@112:       The method developed in Goal 1 will be applied to each cortical area to find a set of marker
bshanks@112: genes such that the combinatorial expression pattern of those genes uniquely picks out the target
bshanks@112: area.  Finding marker genes will be useful for drug discovery as well as for experimentation be-
bshanks@112: cause marker genes can be used to design interventions which selectively target individual cortical
bshanks@112: areas.
bshanks@112:                                             8
bshanks@112: 
bshanks@112:    The application of the marker gene finding algorithm to the cortex will also support the develop-
bshanks@112: ment of new neuroanatomical methods.  In addition to finding markers for each individual cortical
bshanks@112: areas, we will find a small panel of genes that can find many of the areal boundaries at once.
bshanks@112:    The method developed in Goal 2 will provide a genoarchitectonic viewpoint that will contribute
bshanks@112: to the creation of a better cortical map.
bshanks@112:    The methods we will develop will be applicable to other datasets beyond the brain, and even to
bshanks@112: datasets outside of biology. The software we develop will be useful for the analysis of hyperspectral
bshanks@112: images. Our project will draw attention to this area of overlap between neuroscience and GIS, and
bshanks@112: may lead to future collaborations between these two fields.  The cortical dataset that we produce
bshanks@112: will be useful in the machine learning community as a sample dataset that new algorithms can be
bshanks@112: tested against. The availability of this sample dataset to the machine learning community may lead
bshanks@112: to more interest in the design of machine learning algorithms to analyze spatial gene expression.
bshanks@112: _
bshanks@112:  Preliminary Results
bshanks@112:  Format conversion between SEV, MATLAB, NIFTI
bshanks@112: We  have  created  software  to  (politely)  download  all  of  the  SEV  files10  from  the  Allen  Institute
bshanks@112: website.  We have also created software to convert between the SEV, MATLAB, and NIFTI file
bshanks@112: formats, as well as some of Caret&#8217;s file formats.
bshanks@112:  Flatmap of cortex
bshanks@112: We downloaded the ABA data and selected only those voxels which belong to cerebral cortex.
bshanks@112: We divided the cortex into hemispheres. Using Caret[7], we created a mesh representation of the
bshanks@112: surface of the selected voxels.  For each gene, and for each node of the mesh, we calculated an
bshanks@112: average of the gene expression of the voxels &#8220;underneath&#8221; that mesh node.  We then flattened
bshanks@112: the cortex, creating a two-dimensional mesh.  We converted this grid into a MATLAB matrix.  We
bshanks@112: manually traced the boundaries of each of 46 cortical areas from the ABA coronal reference atlas
bshanks@112: slides, and converted this region data into MATLAB format.
bshanks@112:       At this point, the data are in the form of a number of 2-D matrices, all in registration, with the
bshanks@112: matrix entries representing a grid of points (pixels) over the cortical surface.  There is one 2-D
bshanks@112: matrix whose entries represent the regional label associated with each surface pixel. And for each
bshanks@112: gene, there is a 2-D matrix whose entries represent the average expression level underneath each
bshanks@112: surface pixel. The features and the target area are both functions on the surface pixels. They can
bshanks@112: be referred to as scalar fields over the space of surface pixels; alternately, they can be thought of
bshanks@112: as images which can be displayed on the flatmapped surface.
bshanks@112:  Feature selection and scoring methods
bshanks@112: Underexpression of a gene can serve as a marker Underexpression of a gene can sometimes
bshanks@112: serve as a marker. For example, see Figure 2.
bshanks@112:       Correlation Recall that the instances are surface pixels, and consider the problem of attempt-
bshanks@112: ing to classify each instance as either a member of a particular anatomical area, or not. The target
bshanks@112: area can be represented as a boolean mask over the surface pixels.
bshanks@112:    10SEV is a sparse format for spatial data. It is the format in which the ABA data is made available.
bshanks@112:                                             9
bshanks@112: 
bshanks@112:    We calculated the correlation between each gene and each cortical area. The top row of Figure
bshanks@112: 1 shows the three genes most correlated with area SS.
bshanks@112:    Conditional entropy
bshanks@112:    For each region, we created and ran a forward stepwise procedure which attempted to find
bshanks@112: pairs of genes such that the conditional entropy of the target area&#8217;s boolean mask, conditioned
bshanks@112: upon the gene pair&#8217;s thresholded expression levels, is minimized.
bshanks@112:    This finds pairs of genes which are most informative (at least at these threshold levels) relative
bshanks@112: to the question, &#8220;Is this surface pixel a member of the target area?&#8221;.  The advantage over linear
bshanks@112: methods such as logistic regression is that this takes account of arbitrarily nonlinear relationships;
bshanks@112: for  example,  if  the  XOR  of  two  variables  predicts  the  target,  conditional  entropy  would  notice,
bshanks@112: whereas linear methods would not.
bshanks@112:    Gradient similarity We noticed that the previous two scoring methods, which are pointwise,
bshanks@112: often found genes whose pattern of expression did not look similar in shape to the target region.
bshanks@112: For this reason we designed a non-pointwise scoring method to detect when a gene had a pattern
bshanks@112: of expression which looked like it had a boundary whose shape is similar to the shape of the target
bshanks@112: region. We call this scoring method &#8220;gradient similarity&#8221;. The formula is:
bshanks@112:                 &#x2211;
bshanks@112:              pixel<img src="cmsy8-32.png" alt="&#x2208;" />pixels cos(&#x2220;&#x2207;1 -&#x2220;&#x2207;2) &#x22C5;|&#x2207;1| + |&#x2207;2| 
bshanks@112:    2        &#x22C5; pixel_value1 + pixel_value2 
bshanks@112:                       2
bshanks@112:    where &#x2207;1 and &#x2207;2 are the gradient vectors of the two images at the current pixel; &#x2220;&#x2207;i is the
bshanks@112: angle of the gradient of image i at the current pixel; |&#x2207;i| is the magnitude of the gradient of image
bshanks@112: i at the current pixel; and pixel_valuei is the value of the current pixel in image i.
bshanks@112:    The intuition is that we want to see if the borders of the pattern in the two images are similar; if
bshanks@112: the borders are similar, then both images will have corresponding pixels with large gradients (be-
bshanks@112: cause this is a border) which are oriented in a similar direction (because the borders are similar).
bshanks@112:    Gradient similarity provides information complementary to correlation
bshanks@112:    To  show  that  gradient  similarity  can  provide  useful  information  that  cannot  be  detected  via
bshanks@112: pointwise analyses, consider Fig.  3.  The pointwise method in the top row identifies genes which
bshanks@112: express more strongly in AUD than outside of it;  its weakness is that this includes many areas
bshanks@112: which don&#8217;t have a salient border matching the areal border.   The geometric method identifies
bshanks@112: genes  whose  salient  expression  border  seems  to  partially  line  up  with  the  border  of  AUD;  its
bshanks@112: weakness is that this includes genes which don&#8217;t express over the entire area.
bshanks@112:    Areas which can be identified by single genes Using gradient similarity, we have already
bshanks@112: found single genes which roughly identify some areas and groupings of areas. For each of these
bshanks@112: areas, an example of a gene which roughly identifies it is shown in Figure 5.  We have not yet
bshanks@112: cross-verified these genes in other atlases.
bshanks@112:    In addition, there are a number of areas which are almost identified by single genes: COAa+NLOT
bshanks@112: (anterior part of cortical amygdalar area, nucleus of the lateral olfactory tract), ENT (entorhinal),
bshanks@112: ACAv (ventral anterior cingulate), VIS (visual), AUD (auditory).
bshanks@112:    These results validate our expectation that the ABA dataset can be exploited to find marker
bshanks@112: genes for many cortical areas,  while also validating the relevancy of our new scoring method,
bshanks@112: gradient similarity.
bshanks@112:                                             10
bshanks@112: 
bshanks@112:                    
bshanks@112:                    
bshanks@112:                    
bshanks@112:                    
bshanks@104: Figure  5:   From  left  to  right  and  top
bshanks@104: to bottom, single genes which roughly
bshanks@104: identify areas SS (somatosensory pri-
bshanks@112: mary  +  supplemental),  SSs  (supple-
bshanks@104: mental somatosensory), PIR (piriform),
bshanks@112: FRP (frontal pole), RSP (retrosplenial),
bshanks@112: COApm   (Cortical   amygdalar,   poste-
bshanks@112: rior   part,   medial   zone).      Grouping
bshanks@112: some  areas  together,  we  have  also
bshanks@112: found   genes   to   identify   the   groups
bshanks@104: ACA+PL+ILA+DP+ORB+MO  (anterior
bshanks@104: cingulate,   prelimbic,   infralimbic,   dor-
bshanks@104: sal peduncular,  orbital,  motor),  poste-
bshanks@112: rior  and  lateral  visual  (VISpm,  VISpl,
bshanks@104: VISI, VISp;  posteromedial,  posterolat-
bshanks@112: eral,  lateral,  and  primary  visual;  the
bshanks@104: posterior and lateral visual area is dis-
bshanks@104: tinguished  from  its  neighbors,  but  not
bshanks@104: from the entire rest of the cortex).  The
bshanks@112: genes  are  Pitx2,  Aldh1a2,  Ppfibp1,
bshanks@112: Slco1a5, Tshz2, Trhr, Col12a1, Ets1.      Combinations  of  multiple  genes  are  useful  and
bshanks@112:                                       necessary for some areas
bshanks@112:                                          In  Figure  4,  we  give  an  example  of  a  cortical  area
bshanks@112:                                       which is not marked by any single gene, but which can be
bshanks@112:                                       identified  combinatorially.   According  to  logistic  regres-
bshanks@112:                                       sion, gene wwc1 is the best fit single gene for predicting
bshanks@112:                                       whether or not a pixel on the cortical surface belongs to
bshanks@112:                                       the motor area (area MO). The upper-left picture in Fig-
bshanks@112:                                       ure 4 shows wwc1&#8217;s spatial expression pattern over the
bshanks@112:                                       cortex.  The lower-right boundary of MO is represented
bshanks@112:                                       reasonably well by this gene,  but the gene overshoots
bshanks@112:                                       the upper-left boundary.  This flattened 2-D representa-
bshanks@112:                                       tion does not show it, but the area corresponding to the
bshanks@112:                                       overshoot is the medial surface of the cortex. MO is only
bshanks@112:                                       found on the dorsal surface.  Gene mtif2 is shown in the
bshanks@112:                                       upper-right. Mtif2 captures MO&#8217;s upper-left boundary, but
bshanks@112:                                       not its lower-right boundary. Mtif2 does not express very
bshanks@112:                                       much on the medial surface. By adding together the val-
bshanks@112:                                       ues at each pixel in these two figures, we get the lower-
bshanks@112:                                       left  image.   This  combination  captures  area  MO  much
bshanks@112:                                       better than any single gene.
bshanks@112:                                          This shows that our proposal to develop a method to
bshanks@112:                                       find combinations of marker genes is both possible and
bshanks@112:                                       necessary.
bshanks@112:                                       Multivariate supervised learning
bshanks@112:                                       Forward stepwise logistic regression Logistic regres-
bshanks@112:                                       sion is a popular method for predictive modeling of cat-
bshanks@112:                                       egorical data.  As a pilot run, for five cortical areas (SS,
bshanks@112:                                       AUD, RSP, VIS, and MO), we performed forward step-
bshanks@112:                                       wise  logistic  regression  to  find  single  genes,  pairs  of
bshanks@112:                                       genes, and triplets of genes which predict areal identify.
bshanks@112:                                       This is an example of feature selection integrated with
bshanks@112:                                       prediction using a stepwise wrapper.  Some of the sin-
bshanks@112:                                       gle genes found were shown in various figures through-
bshanks@112:                                       out this document, and Figure 4 shows a combination of
bshanks@112:                                       genes which was found.
bshanks@112:                                          SVM on all genes at once
bshanks@112:                                          In order to see how well one can do when looking at
bshanks@112:                                       all genes at once, we ran a support vector machine to
bshanks@112:                                       classify cortical surface pixels based on their gene ex-
bshanks@112:                                       pression profiles. We achieved classification accuracy of
bshanks@112:                                       about 81%11. However, as noted above, a classifier that
bshanks@112: ____________________________________
bshanks@112:    115-fold cross-validation.
bshanks@112:                                             11
bshanks@112: 
bshanks@112:                                       looks at all the genes at once isn&#8217;t as practically useful
bshanks@112:                                       as a classifier that uses only a few genes.
bshanks@112:  Data-driven redrawing of the cortical map
bshanks@112: We have applied the following dimensionality reduction algorithms to reduce the dimensionality
bshanks@112: of the gene expression profile associated with each pixel: Principal Components Analysis (PCA),
bshanks@112: Simple PCA, Multi-Dimensional Scaling, Isomap, Landmark Isomap, Laplacian eigenmaps, Local
bshanks@112: Tangent Space Alignment, Stochastic Proximity Embedding, Fast Maximum Variance Unfolding,
bshanks@112: Non-negative Matrix Factorization (NNMF). Space constraints prevent us from showing many of
bshanks@112: the results, but as a sample, PCA, NNMF, and landmark Isomap are shown in the first, second,
bshanks@112: and third rows of Figure 6.
bshanks@112:    After applying the dimensionality reduction, we ran clustering algorithms on the reduced data.
bshanks@112: To date we have tried k-means and spectral clustering. The results of k-means after PCA, NNMF,
bshanks@112: and landmark Isomap are shown in the bottom row of Figure 6.  To compare, the leftmost picture
bshanks@112: on the bottom row of Figure 6 shows some of the major subdivisions of cortex. These results show
bshanks@112: that different dimensionality reduction techniques capture different aspects of the data and lead
bshanks@112: to different clusterings, indicating the utility of our proposal to produce a detailed comparison of
bshanks@112: these techniques as applied to the domain of genomic anatomy.
bshanks@112:    Many areas are captured by clusters of genes We also clustered the genes using gradient
bshanks@112: similarity to see if the spatial regions defined by any clusters matched known anatomical regions.
bshanks@112: Figure 7 shows, for ten sample gene clusters, each cluster&#8217;s average expression pattern, com-
bshanks@112: pared to a known anatomical boundary. This suggests that it is worth attempting to cluster genes,
bshanks@112: and then to use the results to cluster pixels.
bshanks@112:  Our plan: what remains to be done
bshanks@112:  Flatmap cortex and segment cortical layers
bshanks@112: There are multiple ways to flatten 3-D data into 2-D. We will compare mappings from manifolds to
bshanks@112: planes which attempt to preserve size (such as the one used by Caret[7]) with mappings which
bshanks@112: preserve angle (conformal maps). We will also develop a segmentation algorithm to automatically
bshanks@112: identify the layer boundaries.
bshanks@112:  Develop algorithms that find genetic markers for anatomical regions
bshanks@112: Scoring measures and feature selection We will develop scoring methods for evaluating how
bshanks@112: good individual genes are at marking areas. We will compare pointwise, geometric, and information-
bshanks@112: theoretic measures. We already developed one entirely new scoring method (gradient similarity),
bshanks@112: but we may develop more.  Scoring measures that we will explore will include the L1 norm, cor-
bshanks@112: relation, expression energy ratio, conditional entropy, gradient similarity, Jaccard similarity, Dice
bshanks@112: similarity, Hough transform, and statistical tests such as Student&#8217;s t-test, and the Mann-Whitney
bshanks@112: U test (a non-parametric test). In addition, any classifier induces a scoring measure on genes by
bshanks@112: taking the prediction error when using that gene to predict the target.
bshanks@112:    Using some combination of these measures, we will develop a procedure to find single marker
bshanks@112: genes for anatomical regions:  for each cortical area,  we will rank the genes by their ability to
bshanks@112: delineate  that  area.   We  will  quantitatively  compare  the  list  of  single  genes  generated  by  our
bshanks@112: method to the lists generated by methods which are mentioned in Related Work.
bshanks@112:                                             12
bshanks@112: 
bshanks@112:                                                  
bshanks@104: Figure 6: First row: the first 6 reduced dimensions, using PCA. Sec-
bshanks@112: ond row: the first 6 reduced dimensions, using NNMF. Third row: the
bshanks@112: first six reduced dimensions, using landmark Isomap.  Bottom row:
bshanks@112: examples of kmeans clustering applied to reduced datasets to find
bshanks@112: 7 clusters.   Left:  19 of the major subdivisions of the cortex.   Sec-
bshanks@112: ond from left: PCA. Third from left: NNMF. Right: Landmark Isomap.
bshanks@112: Additional details:  In the third and fourth rows, 7 dimensions were
bshanks@112: found, but only 6 displayed. In the last row: for PCA, 50 dimensions
bshanks@112: were used; for NNMF, 6 dimensions were used; for landmark Isomap,
bshanks@112: 7 dimensions were used.                                             Some cortical areas have
bshanks@112:                                                                  no single marker genes but
bshanks@112:                                                                  can  be  identified  by  com-
bshanks@112:                                                                  binatorial coding.   This re-
bshanks@112:                                                                  quires  multivariate  scoring
bshanks@112:                                                                  measures  and  feature  se-
bshanks@112:                                                                  lection  procedures.   Many
bshanks@112:                                                                  of   the   measures,    such
bshanks@112:                                                                  as expression energy, gra-
bshanks@112:                                                                  dient   similarity,    Jaccard,
bshanks@112:                                                                  Dice,  Hough,  Student&#8217;s  t,
bshanks@112:                                                                  and  Mann-Whitney  U  are
bshanks@112:                                                                  univariate.      We   will   ex-
bshanks@112:                                                                  tend  these  scoring  mea-
bshanks@112:                                                                  sures for use in multivariate
bshanks@112:                                                                  feature  selection,   that  is,
bshanks@112:                                                                  for  scoring  how  well  com-
bshanks@112:                                                                  binations  of  genes,  rather
bshanks@112:                                                                  than individual genes,  can
bshanks@112:                                                                  distinguish  a  target  area.
bshanks@112:                                                                  There   are   existing   mul-
bshanks@112:                                                                  tivariate   forms   of   some
bshanks@112:                                                                  of   the   univariate   scoring
bshanks@112:                                                                  measures,    for    example,
bshanks@112:                                                                  Hotelling&#8217;s   T-square   is   a
bshanks@112:                                                                  multivariate analog of Stu-
bshanks@112:                                                                  dent&#8217;s t.
bshanks@112:                                                                     We  will  develop  a  fea-
bshanks@112: ture selection procedure for choosing the best small set of marker genes for a given anatomical
bshanks@112: area.  In addition to using the scoring measures that we develop, we will also explore (a) feature
bshanks@112: selection using a stepwise wrapper over &#8220;vanilla&#8221; classifiers such as logistic regression, (b) super-
bshanks@112: vised learning methods such as decision trees which incrementally/greedily combine single gene
bshanks@112: markers into sets,  and (c) supervised learning methods which use soft constraints to minimize
bshanks@112: number of features used, such as sparse support vector machines (SVMs).
bshanks@112:    Since errors of displacement and of shape may cause genes and target areas to match less
bshanks@112: than they should, we will consider the robustness of feature selection methods in the presence of
bshanks@112: error.  Some of these methods, such as the Hough transform, are designed to be resistant in the
bshanks@112: presence of error, but many are not.
bshanks@112:    An area may be difficult to identify because the boundaries are misdrawn in the atlas, or be-
bshanks@112: cause the shape of the natural domain of gene expression corresponding to the area is different
bshanks@112: from the shape of the area as recognized by anatomists.  We will develop extensions to our pro-
bshanks@112: cedure which (a) detect when a difficult area could be fit if its boundary were redrawn slightly12,
bshanks@112: ____________________________________
bshanks@112:    12Not just any redrawing is acceptable, only those which appear to be justified as a natural spatial domain of gene ex-
bshanks@112: pression by multiple sources of evidence. Interestingly, the need to detect &#8220;natural spatial domains of gene expression&#8221;
bshanks@112: in a data-driven fashion means that the methods of Goal 2 might be useful in achieving Goal 1, as well &#8211; particularly
bshanks@112:                                             13
bshanks@112: 
bshanks@112: and (b) detect when a difficult area could be combined with adjacent areas to create a larger area
bshanks@112: which can be fit.
bshanks@112:    A future publication on the method that we develop in Goal 1 will review the scoring measures
bshanks@112: and quantitatively compare their performance in order to provide a foundation for future research
bshanks@112: of methods of marker gene finding.  We will measure the robustness of the scoring measures as
bshanks@112: well as their absolute performance on our dataset.
bshanks@112:  Develop algorithms to suggest a division of a structure into anatomical parts
bshanks@112:  
bshanks@112: Figure 7:  Prototypes corresponding to sample gene clus-
bshanks@112: ters, clustered by gradient similarity. Region boundaries for
bshanks@112: the region that most matches each prototype are overlaid.  Dimensionality reduction on gene
bshanks@112:                                                         expression  profiles  We  have  al-
bshanks@112:                                                         ready  described  the  application  of
bshanks@112:                                                         ten   dimensionality   reduction   algo-
bshanks@112:                                                         rithms  for  the  purpose  of  replacing
bshanks@112:                                                         the  gene  expression  profiles,  which
bshanks@112:                                                         are vectors of about 4000 gene ex-
bshanks@112:                                                         pression levels, with a smaller num-
bshanks@112:                                                         ber of features. We plan to further ex-
bshanks@112:                                                         plore and interpret these results, as
bshanks@112:                                                         well as to apply other unsupervised
bshanks@112:                                                         learning  algorithms,  including  inde-
bshanks@112:                                                         pendent  components  analysis,  self-
bshanks@112: organizing maps,  and generative models such as deep Boltzmann machines.   We will explore
bshanks@112: ways to quantitatively compare the relevance of the different dimensionality reduction methods for
bshanks@112: identifying cortical areal boundaries.
bshanks@112:    Dimensionality reduction on pixels Instead of applying dimensionality reduction to the gene
bshanks@112: expression profiles, the same techniques can be applied instead to the pixels.  It is possible that
bshanks@112: the features generated in this way by some dimensionality reduction techniques will directly corre-
bshanks@112: spond to interesting spatial regions.
bshanks@112:    Clustering and segmentation on pixels We will explore clustering and image segmentation
bshanks@112: algorithms in order to segment the pixels into regions.  We will explore k-means, spectral cluster-
bshanks@112: ing, gene shaving[9], recursive division clustering, multivariate generalizations of edge detectors,
bshanks@112: multivariate generalizations of watershed transformations, region growing, active contours, graph
bshanks@112: partitioning methods, and recursive agglomerative clustering with various linkage functions. These
bshanks@112: methods can be combined with dimensionality reduction.
bshanks@112:    Clustering on genes We have already shown that the procedure of clustering genes according
bshanks@112: to gradient similarity, and then creating an averaged prototype of each cluster&#8217;s expression pattern,
bshanks@112: yields some spatial patterns which match cortical areas (Figure 7).  We will further explore the
bshanks@112: clustering of genes.
bshanks@112:    In addition to using the cluster expression prototypes directly to identify spatial regions, this
bshanks@112: might be useful as a component of dimensionality reduction.   For example,  one could imagine
bshanks@112: clustering similar genes and then replacing their expression levels with a single average expression
bshanks@112: ____________________________________
bshanks@112: discriminative dimensionality reduction.
bshanks@112:                                             14
bshanks@112: 
bshanks@112: level,  thereby  removing  some  redundancy  from  the  gene  expression  profiles.   One  could  then
bshanks@112: perform clustering on pixels (possibly after a second dimensionality reduction step) in order to
bshanks@112: identify spatial regions.  It remains to be seen whether removal of redundancy would help or hurt
bshanks@112: the ultimate goal of identifying interesting spatial regions.
bshanks@112:    Co-clustering We will explore some algorithms which simultaneously incorporate clustering
bshanks@112: on instances and on features (in our case, pixels and genes), for example, IRM[11].  These are
bshanks@112: called co-clustering or biclustering algorithms.
bshanks@112:    Compare different methods In order to tell which method is best for genomic anatomy, for
bshanks@112: each experimental method we will compare the cortical map found by unsupervised learning to a
bshanks@112: cortical map derived from the Allen Reference Atlas.  We will explore various quantitative metrics
bshanks@112: that purport to measure how similar two clusterings are, such as Jaccard, Rand index, Fowlkes-
bshanks@112: Mallows, variation of information, Larsen, Van Dongen, and others.
bshanks@112:    Discriminative dimensionality reduction In addition to using a purely data-driven approach
bshanks@112: to identify spatial regions,  it might be useful to see how well the known regions can be recon-
bshanks@112: structed from a small number of features, even if those features are chosen by using knowledge of
bshanks@112: the regions. For example, linear discriminant analysis could be used as a dimensionality reduction
bshanks@112: technique in order to identify a few features which are the best linear summary of gene expression
bshanks@112: profiles for the purpose of discriminating between regions. This reduced feature set could then be
bshanks@112: used to cluster pixels into regions.  Perhaps the resulting clusters will be similar to the reference
bshanks@112: atlas, yet more faithful to natural spatial domains of gene expression than the reference atlas is.
bshanks@112:  Apply the new methods to the cortex
bshanks@112: Using the methods developed in Goal 1,  we will present,  for each cortical area,  a short list of
bshanks@112: markers to identify that area; and we will also present lists of &#8220;panels&#8221; of genes that can be used
bshanks@112: to delineate many areas at once.
bshanks@112:    Because in most cases the ABA coronal dataset only contains one ISH per gene, it is possible
bshanks@112: for an unrelated combination of genes to seem to identify an area when in fact it is only coinci-
bshanks@112: dence.  There are three ways we will validate our marker genes to guard against this.  First, we
bshanks@112: will confirm that putative combinations of marker genes express the same pattern in both hemi-
bshanks@112: spheres.  Second, we will manually validate our final results on other gene expression datasets
bshanks@112: such as EMAGE, GeneAtlas, and GENSAT[8]. Third, we may conduct ISH experiments jointly with
bshanks@112: collaborators to get further data on genes of particular interest.
bshanks@112:    Using  the  methods  developed  in  Goal  2,  we  will  present  one  or  more  hierarchical  cortical
bshanks@112: maps. We will identify and explain how the statistical structure in the gene expression data led to
bshanks@112: any unexpected or interesting features of these maps, and we will provide biological hypotheses
bshanks@112: to interpret any new cortical areas, or groupings of areas, which are discovered.
bshanks@112:  Apply the new methods to hyperspectral datasets
bshanks@112: Our  software  will  be  able  to  read  and  write  file  formats  common  in  the  hyperspectral  imaging
bshanks@112: community such as Erdas LAN and ENVI, and it will be able to convert between the SEV and NIFTI
bshanks@112: formats from neuroscience and the ENVI format from GIS. The methods developed in Goals 1 and
bshanks@112: 2 will be implemented either as part of Spectral Python or as a separate tool that interoperates
bshanks@112: with Spectral Python. The methods will be run on hyperspectral satellite image datasets, and their
bshanks@112: performance will be compared to existing hyperspectral analysis techniques.
bshanks@112:                                             15
bshanks@112: 
bshanks@112:                         References Cited
bshanks@112:  [1] Chris Adamson, Leigh Johnston, Terrie Inder, Sandra Rees, Iven Mareels, and Gary Egan.
bshanks@112:      A Tracking Approach to Parcellation of the Cerebral Cortex,  volume 3749/2005 of Lecture
bshanks@112:      Notes in Computer Science, pages 294&#8211;301. Springer Berlin / Heidelberg, 2005.
bshanks@112:  [2] J. Annese, A. Pitiot, I. D. Dinov, and A. W. Toga.  A myelo-architectonic method for the struc-
bshanks@112:      tural classification of cortical areas. NeuroImage, 21(1):15&#8211;26, 2004.
bshanks@112:  [3] Tanya Barrett, Dennis B. Troup, Stephen E. Wilhite, Pierre Ledoux, Dmitry Rudnev, Carlos
bshanks@112:      Evangelista, Irene F. Kim, Alexandra Soboleva, Maxim Tomashevsky, and Ron Edgar.  NCBI
bshanks@112:      GEO: mining tens of millions of expression profiles&#8211;database and tools update.  Nucl. Acids
bshanks@112:      Res., 35(suppl_1):D760&#8211;765, 2007.
bshanks@112:  [4] George W. Bell,  Tatiana A. Yatskievych,  and Parker B. Antin.   GEISHA, a whole-mount in
bshanks@112:      situ hybridization gene expression screen in chicken embryos.   Developmental Dynamics,
bshanks@112:      229(3):677&#8211;687, 2004.
bshanks@112:  [5] James P Carson, Tao Ju, Hui-Chen Lu, Christina Thaller, Mei Xu, Sarah L Pallas, Michael C
bshanks@112:      Crair, Joe Warren, Wah Chiu, and Gregor Eichele.  A digital atlas to characterize the mouse
bshanks@112:      brain transcriptome. PLoS Comput Biol, 1(4):e41, 2005.
bshanks@112:  [6] Mark H. Chin, Alex B. Geng, Arshad H. Khan, Wei-Jun Qian, Vladislav A. Petyuk, Jyl Boline,
bshanks@112:      Shawn Levy, Arthur W. Toga, Richard D. Smith, Richard M. Leahy, and Desmond J. Smith.
bshanks@112:      A  genome-scale  map  of  expression  for  a  mouse  brain  section  obtained  using  voxelation.
bshanks@112:      Physiol. Genomics, 30(3):313&#8211;321, August 2007.
bshanks@112:  [7] D C Van Essen, H A Drury, J Dickson, J Harwell, D Hanlon, and C H Anderson. An integrated
bshanks@112:      software suite for surface-based analyses of cerebral cortex. Journal of the American Medical
bshanks@112:      Informatics Association: JAMIA, 8(5):443&#8211;59, 2001. PMID: 11522765.
bshanks@112:  [8] Shiaoching Gong, Chen Zheng, Martin L. Doughty, Kasia Losos, Nicholas Didkovsky, Uta B.
bshanks@112:      Schambra,  Norma  J.  Nowak,  Alexandra  Joyner,  Gabrielle  Leblanc,  Mary  E.  Hatten,  and
bshanks@112:      Nathaniel Heintz.  A gene expression atlas of the central nervous system based on bacte-
bshanks@112:      rial artificial chromosomes. Nature, 425(6961):917&#8211;925, October 2003.
bshanks@112:  [9] Trevor Hastie, Robert Tibshirani, Michael Eisen, Ash Alizadeh, Ronald Levy, Louis Staudt,
bshanks@112:      Wing Chan, David Botstein, and Patrick Brown. &#8217;Gene shaving&#8217; as a method for identifying dis-
bshanks@112:      tinct sets of genes with similar expression patterns.  Genome Biology, 1(2):research0003.1&#8211;
bshanks@112:      research0003.21, 2000.
bshanks@112: [10] Jano Hemert and Richard Baldock. Matching Spatial Regions with Combinations of Interact-
bshanks@112:      ing Gene Expression Patterns, volume 13 of Communications in Computer and Information
bshanks@112:      Science, pages 347&#8211;361. Springer Berlin Heidelberg, 2008.
bshanks@112: [11] C Kemp, JB Tenenbaum, TL Griffiths, T Yamada, and N Ueda. Learning systems of concepts
bshanks@112:      with an infinite relational model. In AAAI, 2006.
bshanks@112: [12] F. Kruggel, M. K. Brckner, Th. Arendt, C. J. Wiggins, and D. Y. von Cramon.  Analyzing the
bshanks@112:      neocortical fine-structure. Medical Image Analysis, 7(3):251&#8211;264, September 2003.
bshanks@112:                                             16
bshanks@112: 
bshanks@112: [13] Ed S. Lein, Michael J. Hawrylycz, Nancy Ao, Mikael Ayres, Amy Bensinger, Amy Bernard,
bshanks@112:      Andrew  F.  Boe,  Mark  S.  Boguski,  Kevin  S.  Brockway,  Emi  J.  Byrnes,  Lin  Chen,  Li  Chen,
bshanks@112:      Tsuey-Ming Chen, Mei Chi Chin, Jimmy Chong, Brian E. Crook, Aneta Czaplinska, Chinh N.
bshanks@112:      Dang, Suvro Datta, Nick R. Dee, Aimee L. Desaki, Tsega Desta, Ellen Diep, Tim A. Dolbeare,
bshanks@112:      Matthew J. Donelan,  Hong-Wei Dong,  Jennifer G. Dougherty,  Ben J. Duncan,  Amanda J.
bshanks@112:      Ebbert, Gregor Eichele, Lili K. Estin, Casey Faber, Benjamin A. Facer, Rick Fields, Shanna R.
bshanks@112:      Fischer, Tim P. Fliss, Cliff Frensley, Sabrina N. Gates, Katie J. Glattfelder, Kevin R. Halverson,
bshanks@112:      Matthew R. Hart, John G. Hohmann, Maureen P. Howell, Darren P. Jeung, Rebecca A. John-
bshanks@112:      son, Patrick T. Karr, Reena Kawal, Jolene M. Kidney, Rachel H. Knapik, Chihchau L. Kuan,
bshanks@112:      James  H.  Lake,  Annabel  R.  Laramee,  Kirk  D.  Larsen,  Christopher  Lau,  Tracy  A.  Lemon,
bshanks@112:      Agnes J. Liang, Ying Liu, Lon T. Luong, Jesse Michaels, Judith J. Morgan, Rebecca J. Mor-
bshanks@112:      gan, Marty T. Mortrud, Nerick F. Mosqueda, Lydia L. Ng, Randy Ng, Geralyn J. Orta, Car-
bshanks@112:      oline C. Overly, Tu H. Pak, Sheana E. Parry, Sayan D. Pathak, Owen C. Pearson, Ralph B.
bshanks@112:      Puchalski,  Zackery  L.  Riley,  Hannah  R.  Rockett,  Stephen  A.  Rowland,  Joshua  J.  Royall,
bshanks@112:      Marcos J. Ruiz, Nadia R. Sarno, Katherine Schaffnit, Nadiya V. Shapovalova, Taz Sivisay,
bshanks@112:      Clifford R. Slaughterbeck, Simon C. Smith, Kimberly A. Smith, Bryan I. Smith, Andy J. Sodt,
bshanks@112:      Nick N. Stewart,  Kenda-Ruth Stumpf,  Susan M. Sunkin,  Madhavi Sutram,  Angelene Tam,
bshanks@112:      Carey D. Teemer, Christina Thaller, Carol L. Thompson, Lee R. Varnam, Axel Visel, Ray M.
bshanks@112:      Whitlock, Paul E. Wohnoutka, Crissa K. Wolkey, Victoria Y. Wong, Matthew Wood, Murat B.
bshanks@112:      Yaylaoglu, Rob C. Young, Brian L. Youngstrom, Xu Feng Yuan, Bin Zhang, Theresa A. Zwing-
bshanks@112:      man, and Allan R. Jones.  Genome-wide atlas of gene expression in the adult mouse brain.
bshanks@112:      Nature, 445(7124):168&#8211;176, 2007.
bshanks@112: [14] Susan Magdaleno, Patricia Jensen, Craig L. Brumwell, Anna Seal, Karen Lehman, Andrew
bshanks@112:      Asbury, Tony Cheung, Tommie Cornelius, Diana M. Batten, Christopher Eden, Shannon M.
bshanks@112:      Norland, Dennis S. Rice, Nilesh Dosooye, Sundeep Shakya, Perdeep Mehta, and Tom Cur-
bshanks@112:      ran. BGEM: an in situ hybridization database of gene expression in the embryonic and adult
bshanks@112:      mouse nervous system. PLoS Biology, 4(4):e86 EP &#8211;, April 2006.
bshanks@112: [15] Lydia  Ng,  Amy  Bernard,  Chris  Lau,  Caroline  C  Overly,  Hong-Wei  Dong,  Chihchau  Kuan,
bshanks@112:      Sayan Pathak,  Susan M Sunkin,  Chinh Dang,  Jason W Bohland,  Hemant Bokil,  Partha P
bshanks@112:      Mitra, Luis Puelles, John Hohmann, David J Anderson, Ed S Lein, Allan R Jones, and Michael
bshanks@112:      Hawrylycz.   An  anatomic  gene  expression  atlas  of  the  adult  mouse  brain.   Nat  Neurosci,
bshanks@112:      12(3):356&#8211;362, March 2009.
bshanks@112: [16] George Paxinos and Keith B.J. Franklin.  The Mouse Brain in Stereotaxic Coordinates.  Aca-
bshanks@112:      demic Press, 2 edition, July 2001.
bshanks@112: [17] A.  Schleicher,   N.  Palomero-Gallagher,   P.  Morosan,   S.  Eickhoff,   T.  Kowalski,   K.  Vos,
bshanks@112:      K.  Amunts,  and  K.  Zilles.   Quantitative  architectural  analysis:  a  new  approach  to  cortical
bshanks@112:      mapping. Anatomy and Embryology, 210(5):373&#8211;386, December 2005.
bshanks@112: [18] Oliver Schmitt, Lars Hmke, and Lutz Dmbgen. Detection of cortical transition regions utilizing
bshanks@112:      statistical analyses of excess masses. NeuroImage, 19(1):42&#8211;63, May 2003.
bshanks@112: [19] S.B. Serpico and L. Bruzzone.  A new search algorithm for feature selection in hyperspec-
bshanks@112:      tral  remote  sensing  images.    Geoscience  and  Remote  Sensing,  IEEE  Transactions  on,
bshanks@112:      39(7):1360&#8211;1367, 2001.
bshanks@112:                                             17
bshanks@112: 
bshanks@112: [20] Constance M. Smith, Jacqueline H. Finger, Terry F. Hayamizu, Ingeborg J. McCright, Janan T.
bshanks@112:      Eppig, James A. Kadin, Joel E. Richardson, and Martin Ringwald.  The mouse gene expres-
bshanks@112:      sion database (GXD): 2007 update. Nucl. Acids Res., 35(suppl_1):D618&#8211;623, 2007.
bshanks@112: [21] Larry Swanson. Brain Maps: Structure of the Rat Brain. Academic Press, 3 edition, November
bshanks@112:      2003.
bshanks@112: [22] Carol L. Thompson, Sayan D. Pathak, Andreas Jeromin, Lydia L. Ng, Cameron R. MacPher-
bshanks@112:      son,  Marty  T.  Mortrud,  Allison  Cusick,  Zackery  L.  Riley,  Susan  M.  Sunkin,  Amy  Bernard,
bshanks@112:      Ralph B. Puchalski, Fred H. Gage, Allan R. Jones, Vladimir B. Bajic, Michael J. Hawrylycz,
bshanks@112:      and Ed S. Lein.  Genomic anatomy of the hippocampus.  Neuron, 60(6):1010&#8211;1021, Decem-
bshanks@112:      ber 2008.
bshanks@112: [23] Pavel  Tomancak,   Amy  Beaton,   Richard  Weiszmann,   Elaine  Kwan,   ShengQiang  Shu,
bshanks@112:      Suzanna E Lewis, Stephen Richards, Michael Ashburner, Volker Hartenstein, Susan E Cel-
bshanks@112:      niker, and Gerald M Rubin.  Systematic determination of patterns of gene expression during
bshanks@112:      drosophila embryogenesis. Genome Biology, 3(12):research008818814, 2002. PMC151190.
bshanks@112: [24] Shanmugasundaram   Venkataraman,   Peter   Stevenson,   Yiya   Yang,   Lorna   Richardson,
bshanks@112:      Nicholas Burton,  Thomas P. Perry,  Paul Smith,  Richard A. Baldock,  Duncan R. Davidson,
bshanks@112:      and Jeffrey H. Christiansen.  EMAGE edinburgh mouse atlas of gene expression:  2008 up-
bshanks@112:      date. Nucl. Acids Res., 36(suppl_1):D860&#8211;865, 2008.
bshanks@112: [25] Axel Visel, Christina Thaller, and Gregor Eichele. GenePaint.org: an atlas of gene expression
bshanks@112:      patterns in the mouse embryo. Nucl. Acids Res., 32(suppl_1):D552&#8211;556, 2004.
bshanks@112: [26] Robert H Waterston, Kerstin Lindblad-Toh, Ewan Birney, Jane Rogers, Josep F Abril, Pankaj
bshanks@112:      Agarwal, Richa Agarwala, Rachel Ainscough, Marina Alexandersson, Peter An, Stylianos E
bshanks@112:      Antonarakis, John Attwood, Robert Baertsch, Jonathon Bailey, Karen Barlow, Stephan Beck,
bshanks@112:      Eric Berry, Bruce Birren, Toby Bloom, Peer Bork, Marc Botcherby, Nicolas Bray, Michael R
bshanks@112:      Brent,  Daniel  G  Brown,  Stephen  D  Brown,  Carol  Bult,  John  Burton,  Jonathan  Butler,
bshanks@112:      Robert D Campbell,  Piero Carninci,  Simon Cawley,  Francesca Chiaromonte,  Asif T Chin-
bshanks@112:      walla, Deanna M Church, Michele Clamp, Christopher Clee, Francis S Collins, Lisa L Cook,
bshanks@112:      Richard  R  Copley,  Alan  Coulson,  Olivier  Couronne,  James  Cuff,  Val  Curwen,  Tim  Cutts,
bshanks@112:      Mark Daly, Robert David, Joy Davies, Kimberly D Delehaunty, Justin Deri, Emmanouil T Der-
bshanks@112:      mitzakis,  Colin Dewey,  Nicholas J Dickens,  Mark Diekhans,  Sheila Dodge,  Inna Dubchak,
bshanks@112:      Diane  M  Dunn,  Sean  R  Eddy,  Laura  Elnitski,  Richard  D  Emes,  Pallavi  Eswara,  Eduardo
bshanks@112:      Eyras, Adam Felsenfeld, Ginger A Fewell, Paul Flicek, Karen Foley, Wayne N Frankel, Lu-
bshanks@112:      cinda A Fulton, Robert S Fulton, Terrence S Furey, Diane Gage, Richard A Gibbs, Gustavo
bshanks@112:      Glusman, Sante Gnerre, Nick Goldman, Leo Goodstadt, Darren Grafham, Tina A Graves,
bshanks@112:      Eric D Green, Simon Gregory, Roderic Guig, Mark Guyer, Ross C Hardison, David Haussler,
bshanks@112:      Yoshihide Hayashizaki, LaDeana W Hillier, Angela Hinrichs, Wratko Hlavina, Timothy Holzer,
bshanks@112:      Fan Hsu, Axin Hua, Tim Hubbard, Adrienne Hunt, Ian Jackson, David B Jaffe, L Steven John-
bshanks@112:      son, Matthew Jones, Thomas A Jones, Ann Joy, Michael Kamal, Elinor K Karlsson, Donna
bshanks@112:      Karolchik, Arkadiusz Kasprzyk, Jun Kawai, Evan Keibler, Cristyn Kells, W James Kent, An-
bshanks@112:      drew Kirby, Diana L Kolbe, Ian Korf, Raju S Kucherlapati, Edward J Kulbokas, David Kulp,
bshanks@112:      Tom Landers, J P Leger, Steven Leonard, Ivica Letunic, Rosie Levine, Jia Li, Ming Li, Chris-
bshanks@112:      tine Lloyd, Susan Lucas, Bin Ma, Donna R Maglott, Elaine R Mardis, Lucy Matthews, Evan
bshanks@112:                                             18
bshanks@112: 
bshanks@112:      Mauceli, John H Mayer, Megan McCarthy, W Richard McCombie, Stuart McLaren, Kirsten
bshanks@112:      McLay, John D McPherson, Jim Meldrim, Beverley Meredith, Jill P Mesirov, Webb Miller, Tra-
bshanks@112:      cie L Miner, Emmanuel Mongin, Kate T Montgomery, Michael Morgan, Richard Mott, James C
bshanks@112:      Mullikin, Donna M Muzny, William E Nash, Joanne O Nelson, Michael N Nhan, Robert Nicol,
bshanks@112:      Zemin Ning,  Chad Nusbaum,  Michael J O&#8217;Connor,  Yasushi Okazaki,  Karen Oliver,  Emma
bshanks@112:      Overton-Larty, Lior Pachter, Gens Parra, Kymberlie H Pepin, Jane Peterson, Pavel Pevzner,
bshanks@112:      Robert Plumb, Craig S Pohl, Alex Poliakov, Tracy C Ponce, Chris P Ponting, Simon Potter,
bshanks@112:      Michael Quail, Alexandre Reymond, Bruce A Roe, Krishna M Roskin, Edward M Rubin, Alis-
bshanks@112:      tair G Rust, Ralph Santos, Victor Sapojnikov, Brian Schultz, Jrg Schultz, Matthias S Schwartz,
bshanks@112:      Scott Schwartz, Carol Scott, Steven Seaman, Steve Searle, Ted Sharpe, Andrew Sheridan,
bshanks@112:      Ratna Shownkeen, Sarah Sims, Jonathan B Singer, Guy Slater, Arian Smit, Douglas R Smith,
bshanks@112:      Brian Spencer, Arne Stabenau, Nicole Stange-Thomann, Charles Sugnet, Mikita Suyama,
bshanks@112:      Glenn Tesler, Johanna Thompson, David Torrents, Evanne Trevaskis, John Tromp, Cather-
bshanks@112:      ine Ucla, Abel Ureta-Vidal, Jade P Vinson, Andrew C Von Niederhausern, Claire M Wade,
bshanks@112:      Melanie  Wall,  Ryan  J  Weber,  Robert  B  Weiss,  Michael  C  Wendl,  Anthony  P  West,  Kris
bshanks@112:      Wetterstrand, Raymond Wheeler, Simon Whelan, Jamey Wierzbowski, David Willey, Sophie
bshanks@112:      Williams, Richard K Wilson, Eitan Winter, Kim C Worley, Dudley Wyman, Shan Yang, Shiaw-
bshanks@112:      Pyng Yang, Evgeny M Zdobnov, Michael C Zody, and Eric S Lander.  Initial sequencing and
bshanks@112:      comparative analysis of the mouse genome.  Nature, 420(6915):520&#8211;62, December 2002.
bshanks@112:      PMID: 12466850.
bshanks@112:                                             19
bshanks@112: 
bshanks@112: