bshanks@0: Specific aims
bshanks@15:             Massive new datasets obtained with techniques such as in situ hybridization
bshanks@0:             (ISH) and BAC-transgenics allow the expression levels of many genes at many
bshanks@0:             locations to be compared. Our goal is to develop automated methods to relate
bshanks@0:             spatial variation in gene expression to anatomy. We want to find marker genes
bshanks@0:             for specific anatomical regions, and also to draw new anatomical maps based on
bshanks@0:             gene expression patterns. We have three specific aims:
bshanks@0:              (1) develop an algorithm to screen spatial gene expression data for combina-
bshanks@0:                  tions of marker genes which selectively target anatomical regions
bshanks@0:              (2) develop an algorithm to suggest new ways of carving up a structure into
bshanks@0:                  anatomical subregions, based on spatial patterns in gene expression
bshanks@0:              (3) create a 2-D &#8220;flat map&#8221; dataset of the mouse cerebral cortex that contains
bshanks@0:                  a flattened version of the Allen Mouse Brain Atlas ISH data, as well as
bshanks@0:                  the boundaries of cortical anatomical areas.  Use this dataset to validate
bshanks@0:                  the methods developed in (1) and (2).
bshanks@0:                In addition to validating the usefulness of the algorithms, the application of
bshanks@0:             these methods to cerebral cortex will produce immediate benefits, because there
bshanks@0:             are currently no known genetic markers for many cortical areas.  The results
bshanks@0:             of the project will support the development of new ways to selectively target
bshanks@0:             cortical areas, and it will support the development of a method for identifying
bshanks@0:             the cortical areal boundaries present in small tissue samples.
bshanks@0:                All algorithms that we develop will be implemented in an open-source soft-
bshanks@0:             ware toolkit.  The toolkit, as well as the machine-readable datasets developed
bshanks@0:             in aim (3), will be published and freely available for others to use.
bshanks@0:              Background and significance
bshanks@0:              Aim 1
bshanks@16:             Machine learning terminology: supervised learning
bshanks@16:                The task of looking for marker genes for anatomical subregions means that
bshanks@16:             one is looking for a set of genes such that, if the expression level of those genes
bshanks@16:             is known, then the locations of the subregions can be inferred.
bshanks@0:                If we define the subregions so that they cover the entire anatomical structure
bshanks@0:             to be divided, then instead of saying that we are using gene expression to find
bshanks@0:             the locations of the subregions, we may say that we are using gene expression to
bshanks@0:             determine to which subregion each voxel within the structure belongs. We call
bshanks@0:             this a classification task, because each voxel is being assigned to a class (namely,
bshanks@0:             its subregion).
bshanks@0:                Therefore, an understanding of the relationship between the combination of
bshanks@0:             their expression levels and the locations of the subregions may be expressed as
bshanks@16:             a function. The input to this function is a voxel, along with the gene expression
bshanks@15:                                             1
bshanks@15: 
bshanks@0:             levels within that voxel;  the output is the subregional identity of the target
bshanks@0:             voxel, that is, the subregion to which the target voxel belongs.  We call this
bshanks@0:             function a classifier.  In general, the input to a classifier is called an instance,
bshanks@15:             and the output is called a label (or a class label).
bshanks@0:                The object of aim 1 is not to produce a single classifier, but rather to develop
bshanks@0:             an automated method for determining a classifier for any known anatomical
bshanks@0:             structure.  Therefore, we seek a procedure by which a gene expression dataset
bshanks@0:             may be analyzed in concert with an anatomical atlas in order to produce a
bshanks@0:             classifier.  Such a procedure is a type of a machine learning procedure.  The
bshanks@0:             construction of the classifier is called training (also learning), and the initial
bshanks@0:             gene expression dataset used in the construction of the classifier is called training
bshanks@0:             data.
bshanks@0:                In the machine learning literature, this sort of procedure may be thought
bshanks@0:             of as a supervised learning task, defined as a task in whcih the goal is to learn
bshanks@0:             a mapping from instances to labels, and the training data consists of a set of
bshanks@0:             instances (voxels) for which the labels (subregions) are known.
bshanks@0:                Each gene expression level is called a feature, and the selection of which
bshanks@0:             genes to include is called feature selection.  Feature selection is one component
bshanks@0:             of the task of learning a classifier.  Some methods for learning classifiers start
bshanks@0:             out with a separate feature selection phase, whereas other methods combine
bshanks@0:             feature selection with other aspects of training.
bshanks@0:                One class of feature selection methods assigns some sort of score to each
bshanks@0:             candidate gene. The top-ranked genes are then chosen. Some scoring measures
bshanks@0:             can assign a score to a set of selected genes, not just to a single gene; in this
bshanks@0:             case, a dynamic procedure may be used in which features are added and sub-
bshanks@0:             tracted from the selected set depending on how much they raise the score. Such
bshanks@0:             procedures are called &#8220;stepwise&#8221; or &#8220;greedy&#8221;.
bshanks@0:                Although the classifier itself may only look at the gene expression data within
bshanks@0:             each voxel before classifying that voxel, the learning algorithm which constructs
bshanks@0:             the classifier may look over the entire dataset.  We can categorize score-based
bshanks@0:             feature selection methods depending on how the score of calculated.   Often
bshanks@0:             the score calculation consists of assigning a sub-score to each voxel, and then
bshanks@0:             aggregating these sub-scores into a final score (the aggregation is often a sum or
bshanks@0:             a sum of squares). If only information from nearby voxels is used to calculate a
bshanks@0:             voxel&#8217;s sub-score, then we say it is a local scoring method.  If only information
bshanks@0:             from the voxel itself is used to calculate a voxel&#8217;s sub-score, then we say it is a
bshanks@0:             pointwise scoring method.
bshanks@0:                Key questions when choosing a learning method are: What are the instances?
bshanks@0:             What are the features?  How are the features chosen?  Here are four principles
bshanks@0:             that outline our answers to these questions.
bshanks@16:                Principle 1: Combinatorial gene expression
bshanks@16:                Above, we defined an &#8220;instance&#8221; as the combination of a voxel with the
bshanks@16:             &#8220;associated gene expression data&#8221;. In our case this refers to the expression level
bshanks@16:             of genes within the voxel, but should we include the expression levels of all
bshanks@16:             genes, or only a few of them?
bshanks@16:                It is too much to hope that every anatomical region of interest will be iden-
bshanks@15:                                             2
bshanks@15: 
bshanks@0:             tified by a single gene. For example, in the cortex, there are some areas which
bshanks@0:             are not clearly delineated by any gene included in the Allen Brain Atlas (ABA)
bshanks@0:             dataset.  However, at least some of these areas can be delineated by looking
bshanks@0:             at combinations of genes (an example of an area for which multiple genes are
bshanks@0:             necessary and sufficient is provided in Preliminary Results).
bshanks@16:                Principle 2: Only look at combinations of small numbers of genes
bshanks@16:                When the classifier classifies a voxel, it is only allowed to look at the expres-
bshanks@16:             sion of the genes which have been selected as features.  The more data that is
bshanks@16:             available to a classifier, the better that it can do.  For example, perhaps there
bshanks@16:             are weak correlations over many genes that add up to a strong signal. So, why
bshanks@16:             not include every gene as a feature?  The reason is that we wish to employ
bshanks@16:             the classifier in situations in which it is not feasible to gather data about every
bshanks@16:             gene. For example, if we want to use the expression of marker genes as a trigger
bshanks@16:             for some regionally-targeted intervention, then our intervention must contain a
bshanks@16:             molecular mechanism to check the expression level of each marker gene before
bshanks@16:             it triggers.  It is currently infeasible to design a molecular trigger that checks
bshanks@16:             the level of more than a handful of genes. Similarly, if the goal is to develop a
bshanks@16:             procedure to do ISH on tissue samples in order to label their anatomy, then it
bshanks@16:             is infeasible to label more than a few genes.  Therefore, we must select only a
bshanks@16:             few genes as features.
bshanks@16:                Principle 3: Use geometry in feature selection
bshanks@16:                When doing feature selection with score-based methods, the simplest thing
bshanks@16:             to do would be to score the performance of each voxel by itself and then com-
bshanks@16:             bine these scores (pointwise scoring).  A more powerful approach is to also use
bshanks@16:             information about the geometric relations between each voxel and its neighbors;
bshanks@16:             this requires non-pointwise, local scoring methods. See Preliminary Results for
bshanks@16:             evidence of the complementary nature of pointwise and local scoring methods.
bshanks@16:                Principle 4: Work in 2-D whenever possible
bshanks@16:                There are many anatomical structures which are commonly characterized in
bshanks@0:             terms of a two-dimensional manifold. When it is known that the structure that
bshanks@0:             one is looking for is two-dimensional, the results may be improved by allowing
bshanks@0:             the analysis algorithm to take advantage of this prior knowledge.  In addition,
bshanks@0:             it is easier for humans to visualize and work with 2-D data.
bshanks@0:                Therefore, when possible, the instances should represent pixels, not voxels.
bshanks@1:              Aim 2
bshanks@16:             Machine learning terminology: clustering
bshanks@16:                If one is given a dataset consisting merely of instances, with no class labels,
bshanks@16:             then analysis of the dataset is referred to as unsupervised learning in the jargon
bshanks@16:             of machine learning. One thing that you can do with such a dataset is to group
bshanks@15:             instances together. A set of similar instances is called a cluster, and the activity
bshanks@15:             of finding grouping the data into clusters is called clustering or cluster analysis.
bshanks@15:                The task of deciding how to carve up a structure into anatomical subregions
bshanks@15:             can be put into these terms.  The instances are once again voxels (or pixels)
bshanks@15:             along with their associated gene expression profiles.  We make the assumption
bshanks@16:                                             3
bshanks@16: 
bshanks@15:             that voxels from the same subregion have similar gene expression profiles, at
bshanks@15:             least compared to the other subregions.  This means that clustering voxels is
bshanks@15:             the same as finding potential subregions; we seek a partitioning of the voxels
bshanks@15:             into subregions, that is, into clusters of voxels with similar gene expression.
bshanks@15:                It is desirable to determine not just one set of subregions,  but also how
bshanks@15:             these subregions relate to each other, if at all; perhaps some of the subregions
bshanks@15:             are more similar to each other than to the rest, suggesting that, although at a
bshanks@15:             fine spatial scale they could be considered separate, on a coarser spatial scale
bshanks@15:             they could be grouped together into one large subregion.  This suggests the
bshanks@15:             outcome of clustering may be a hierarchial tree of clusters, rather than a single
bshanks@15:             set of clusters which partition the voxels. This is called hierarchial clustering.
bshanks@16:                Similarity scores
bshanks@16:                todo
bshanks@16:                Spatially contiguous clusters; image segmentation
bshanks@16:                We have shown that aim 2 is a type of clustering task.   In fact,  it is a
bshanks@16:             special type of clustering task because we have an additional constraint on
bshanks@16:             clusters; voxels grouped together into a cluster must be spatially contiguous.
bshanks@16:             In Preliminary Results, we show that one can get reasonable results without
bshanks@16:             enforcing this constraint, however, we plan to compare these results against
bshanks@16:             other methods which guarantee contiguous clusters.
bshanks@15:                Perhaps the biggest source of continguous clustering algorithms is the field
bshanks@15:             of computer vision, which has produced a variety of image segmentation algo-
bshanks@15:             rithms.  Image segmentation is the task of partitioning the pixels in a digital
bshanks@15:             image into clusters, usually contiguous clusters.  Aim 2 is similar to an image
bshanks@15:             segmentation task. There are two main differences; in our task, there are thou-
bshanks@15:             sands of color channels (one for each gene), rather than just three.  There are
bshanks@15:             imaging tasks which use more than three colors, however, for example multispec-
bshanks@15:             tral imaging and hyperspectral imaging, which are often used to process satellite
bshanks@15:             imagery. A more crucial difference is that there are various cues which are ap-
bshanks@15:             propriate for detecting sharp object boundaries in a visual scene but which are
bshanks@15:             not appropriate for segmenting abstract spatial data such as gene expression.
bshanks@15:             Although many image segmentation algorithms can be expected to work well
bshanks@15:             for segmenting other sorts of spatially arranged data, some of these algorithms
bshanks@15:             are specialized for visual images.
bshanks@16:                Dimensionality reduction
bshanks@16:                Unlike aim 1, there is no externally-imposed need to select only a handful
bshanks@16:             of informative genes for inclusion in the instances.  However, some clustering
bshanks@16:             algorithms perform better on small numbers of features.  There are techniques
bshanks@15:             which &#8220;summarize&#8221; a larger number of features using a smaller number of fea-
bshanks@15:             tures; these techniques go by the name of feature extraction or dimensionality
bshanks@15:             reduction.  The small set of features that such a technique yields is called the
bshanks@15:             reduced feature set. After the reduced feature set is created, the instances may
bshanks@15:             be replaced by reduced instances, which have as their features the reduced fea-
bshanks@15:             ture set rather than the original feature set of all gene expression levels.  Note
bshanks@15:             that the features in the reduced feature set do not necessarily correspond to
bshanks@15:             genes; each feature in the reduced set may be any function of the set of gene
bshanks@16:                                             4
bshanks@16: 
bshanks@15:             expression levels.
bshanks@15:                Another use for dimensionality reduction is to visualize the relationships
bshanks@15:             between subregions.  For example, one might want to make a 2-D plot upon
bshanks@15:             which each subregion is represented by a single point, and with the property
bshanks@15:             that subregions with similar gene expression profiles should be nearby on the
bshanks@15:             plot (that is, the property that distance between pairs of points in the plot
bshanks@15:             should be proportional to some measure of dissimilarity in gene expression). It
bshanks@15:             is likely that no arrangement of the points on a 2-D plan will exactly satisfy
bshanks@15:             this property &#8211; however, dimensionality reduction techniques allow one to find
bshanks@15:             arrangements of points that approximately satisfy that property.   Note that
bshanks@15:             in this application, dimensionality reduction is being applied after clustering;
bshanks@15:             whereas in the previous paragraph, we were talking about using dimensionality
bshanks@15:             reduction before clustering.
bshanks@16:                Clustering genes rather than voxels
bshanks@16:                Although the ultimate goal is to cluster the instances (voxels or pixels), one
bshanks@15:             strategy to achieve this goal is to first cluster the features (genes).  There are
bshanks@15:             two ways that clusters of genes could be used.
bshanks@15:                Gene clusters could be used as part of dimensionality reduction: rather than
bshanks@15:             have one feature for each gene, we could have one reduced feature for each gene
bshanks@15:             cluster.
bshanks@15:                Gene clusters could also be used to directly yield a clustering on instances.
bshanks@15:             This is because many genes have an expression pattern which seems to pick
bshanks@15:             out a single, spatially continguous subregion. Therefore, it seems likely that an
bshanks@15:             anatomically interesting subregion will have multiple genes which each individ-
bshanks@15:             ually pick it out1. This suggests the following procedure: cluster together genes
bshanks@15:             which pick out similar subregions, and then to use the more popular common
bshanks@15:             subregions as the final clusters. In the Preliminary Data we show that a num-
bshanks@15:             ber of anatomically recognized cortical regions, as well as some &#8220;superregions&#8221;
bshanks@15:             formed by lumping together a few regions, are associated with gene clusters in
bshanks@15:             this fashion.
bshanks@0:              Aim 3
bshanks@16:             Background
bshanks@16:                The cortex is divided into areas and layers.  To a first approximation, the
bshanks@16:             parcellation of the cortex into areas can be drawn as a 2-D map on the surface of
bshanks@16:             the cortex.  In the third dimension, the boundaries between the areas continue
bshanks@16:             downwards into the cortical depth,  perpendicular to the surface.   The layer
bshanks@0:             boundaries run parallel to the surface. One can picture an area of the cortex as
bshanks@0:             a slice of many-layered cake.
bshanks@16: ___
bshanks@16:    1This would seem to contradict our finding in aim 1 that some cortical areas are combina-
bshanks@16: torially coded by multiple genes.  However, it is possible that the currently accepted cortical
bshanks@16: maps divide the cortex into subregions which are unnatural from the point of view of gene
bshanks@16: expression; perhaps there is some other way to map the cortex for which each subregion can
bshanks@16: be identified by single genes.
bshanks@16:                                             5
bshanks@16: 
bshanks@0:                Although it is known that different cortical areas have distinct roles in both
bshanks@0:             normal functioning and in disease processes, there are no known marker genes
bshanks@0:             for many cortical areas.  When it is necessary to divide a tissue sample into
bshanks@0:             cortical areas, this is a manual process that requires a skilled human to combine
bshanks@0:             multiple visual cues and interpret them in the context of their approximate
bshanks@0:             location upon the cortical surface.
bshanks@0:                Even the questions of how many areas should be recognized in cortex, and
bshanks@0:             what their arrangement is, are still not completely settled. A proposed division
bshanks@0:             of the cortex into areas is called a cortical map.  In the rodent, the lack of a
bshanks@0:             single agreed-upon map can be seen by contrasting the recent maps given by
bshanks@0:             Swanson?? on the one hand, and Paxinos and Franklin?? on the other. While
bshanks@0:             the maps are certainly very similar in their general arrangement, significant
bshanks@0:             differences remain in the details.
bshanks@16:                Significance
bshanks@16:                The method developed in aim (1) will be applied to each cortical area to find
bshanks@0:             a set of marker genes such that the combinatorial expression pattern of those
bshanks@0:             genes uniquely picks out the target area.  Finding marker genes will be useful
bshanks@0:             for drug discovery as well as for experimentation because marker genes can be
bshanks@0:             used to design interventions which selectively target individual cortical areas.
bshanks@0:                The application of the marker gene finding algorithm to the cortex will
bshanks@0:             also support the development of new neuroanatomical methods. In addition to
bshanks@0:             finding markers for each individual cortical areas, we will find a small panel
bshanks@0:             of genes that can find many of the areal boundaries at once.  This panel of
bshanks@0:             marker genes will allow the development of an ISH protocol that will allow
bshanks@0:             experimenters to more easily identify which anatomical areas are present in
bshanks@0:             small samples of cortex.
bshanks@0:                The method developed in aim (3) will provide a genoarchitectonic viewpoint
bshanks@0:             that will contribute to the creation of a better map. The development of present-
bshanks@0:             day cortical maps was driven by the application of histological stains.   It is
bshanks@0:             conceivable that if a different set of stains had been available which identified
bshanks@0:             a different set of features, then the today&#8217;s cortical maps would have come out
bshanks@0:             differently. Since the number of classes of stains is small compared to the number
bshanks@0:             of genes, it is likely that there are many repeated, salient spatial patterns in
bshanks@0:             the gene expression which have not yet been captured by any stain. Therefore,
bshanks@0:             current ideas about cortical anatomy need to incorporate what we can learn
bshanks@0:             from looking at the patterns of gene expression.
bshanks@0:                While we do not here propose to analyze human gene expression data, it is
bshanks@0:             conceivable that the methods we propose to develop could be used to suggest
bshanks@0:             modifications to the human cortical map as well.
bshanks@0:              Related work
bshanks@1:             todo
bshanks@15:                vs. AGEA &#8211; i wrote something on this but i&#8217;m going to rewrite it
bshanks@16:                                             6
bshanks@16: 
bshanks@0:              Preliminary work
bshanks@15:              Format conversion between SEV, MATLAB, NIFTI
bshanks@15:             todo
bshanks@15:              Flatmap of cortex
bshanks@15:             todo
bshanks@16:                Using combinations of multiple genes is necessary and sufficient to
bshanks@15:             delineate some cortical areas
bshanks@16:                Here we give an example of a cortical area which is not marked by any
bshanks@16:             single gene, but which can be identified combinatorially.  according to logistic
bshanks@16:             regression, gene wwc12 is the best fit single gene for predicting whether or not a
bshanks@16:             pixel on the cortical surface belongs to the motor area (area MO). The upper-left
bshanks@0:             picture in Figure  shows wwc1&#8217;s spatial expression pattern over the cortex. The
bshanks@0:             lower-right boundary of MO is represented reasonably well by this gene, however
bshanks@0:             the gene overshoots the upper-left boundary. This flattened 2-D representation
bshanks@0:             does not show it, but the area corresponding to the overshoot is the medial
bshanks@0:             surface of the cortex. MO is only found on the lateral surface (todo).
bshanks@15:                Gnee mtif23 is shown in figure the upper-right of Fig. . Mtif2 captures MO&#8217;s
bshanks@0:             upper-left boundary, but not its lower-right boundary.  Mtif2 does not express
bshanks@0:             very much on the medial surface.  By adding together the values at each pixel
bshanks@16:             in these two figures, we get the lower-left of Figure . This combination captures
bshanks@16:             area MO much better than any single gene.
bshanks@16:                Geometric and pointwise scoring methods provide complementary
bshanks@16:             information
bshanks@16:                To show that local geometry can provide useful information that cannot be
bshanks@16:             detected via pointwise analyses, consider Fig. . The top row of Fig.  displays the
bshanks@16:             3 genes which most match area AUD, according to a pointwise method4.  The
bshanks@16:             bottom row displays the 3 genes which most match AUD according to a method
bshanks@16:             which considers local geometry5 The pointwise method in the top row identifies
bshanks@16:             genes which express more strongly in AUD than outside of it; its weakness is that
bshanks@16:             this includes many areas which don&#8217;t have a salient border matching the areal
bshanks@16:             border. The geometric method identifies genes whose salient expression border
bshanks@16:             seems to partially line up with the border of AUD; its weakness is that this
bshanks@16:             includes genes which don&#8217;t express over the entire area. Genes which have high
bshanks@16:             rankings using both pointwise and border criteria, such as Aph1a in the example,
bshanks@15: __________________________
bshanks@15:    2&#8220;WW, C2 and coiled-coil domain containing 1&#8221;; EntrezGene ID 211652
bshanks@15:     3&#8220;mitochondrial translational initiation factor 2&#8221;; EntrezGene ID 76784
bshanks@16:     4For each gene, a logistic regression in which the response variable was whether or not a
bshanks@16: surface pixel was within area AUD, and the predictor variable was the value of the expression
bshanks@16: of the gene underneath that pixel. The resulting scores were used to rank the genes in terms
bshanks@16: of how well they predict area AUD.
bshanks@16:     5For each gene the gradient similarity (see section ??) between (a) a map of the expression
bshanks@16: of each gene on the cortical surface and (b) the shape of area AUD, was calculated, and this
bshanks@16: was used to rank the genes.
bshanks@15:                                             7
bshanks@0: 
bshanks@0:                                         
bshanks@0:             
bshanks@0:             Figure 1:  Upper left:  wwc1.  Upper right:  mtif2.  Lower left:  wwc1 + mtif2
bshanks@0:             (each pixel&#8217;s value on the lower left is the sum of the corresponding pixels in
bshanks@0:             the upper row).  Within each picture, the vertical axis roughly corresponds to
bshanks@0:             anterior at the top and posterior at the bottom, and the horizontal axis roughly
bshanks@0:             corresponds to medial at the left and lateral at the right.  The red outline is
bshanks@0:             the boundary of region MO. Pixels are colored approximately according to the
bshanks@0:             density of expressing cells underneath each pixel, with red meaning a lot of
bshanks@0:             expression and blue meaning little.
bshanks@15:                                             8
bshanks@15: 
bshanks@15:                                                         
bshanks@15:                                                         
bshanks@15:             Figure 2: The top row shows the three genes which (individually) best predict
bshanks@15:             area AUD, according to logistic regression.  The bottom row shows the three
bshanks@15:             genes which (individually) best match area AUD, according to gradient similar-
bshanks@15:             ity. From left to right and top to bottom, the genes are Ssr1, Efcbp1, Aph1a,
bshanks@15:             Ptk7, Aph1a again, and Lepr
bshanks@0:             may be particularly good markers.   None of these genes are,  individually,  a
bshanks@0:             perfect marker for AUD; we deliberately chose a &#8220;difficult&#8221; area in order to
bshanks@0:             better contrast pointwise with geometric methods.
bshanks@16:                Areas which can be identified by single genes
bshanks@16:                todo
bshanks@15:              Aim 1 (and Aim 3)
bshanks@16:             SVM on all genes at once
bshanks@16:                In order to see how well one can do when looking at all genes at once, we
bshanks@16:             ran a support vector machine to classify cortical surface pixels based on their
bshanks@16:             gene expression profiles.  We achieved classification accuracy of about 81%6.
bshanks@16:             As noted above, however, a classifier that looks at all the genes at once isn&#8217;t
bshanks@16:             practically useful.
bshanks@16:                The requirement to find combinations of only a small number of genes limits
bshanks@16:             us from straightforwardly applying many of the most simple techniques from
bshanks@16:             the field of supervised machine learning.  In the parlance of machine learning,
bshanks@16:             our task combines feature selection with supervised learning.
bshanks@16:                Decision trees
bshanks@16:                todo
bshanks@16: ____________________
bshanks@15:    6Using the Shogun SVM package (todo:cite), with parameters type=GMNPSVM (multi-
bshanks@15: class b-SVM), kernal = gaussian with sigma = 0.1, c = 10, epsilon = 1e-1 &#8211; these are the
bshanks@15: first parameters we tried, so presumably performance would improve with different choices of
bshanks@15: parameters. 5-fold cross-validation.
bshanks@6:                                             9
bshanks@6: 
bshanks@15:              Aim 2 (and Aim 3)
bshanks@15:              Raw dimensionality reduction results
bshanks@15:              Dimensionality reduction plus K-means or spectral clus-
bshanks@15:             tering
bshanks@16:             Many areas are captured by clusters of genes
bshanks@16:                todo
bshanks@15:                todo
bshanks@15:              Research plan
bshanks@15:             todo
bshanks@15:                amongst other thigns:
bshanks@16:                Develop algorithms that find genetic markers for anatomical re-
bshanks@16:             gions
bshanks@0:               1. Develop scoring measures for evaluating how good individual genes are at
bshanks@0:                  marking areas:  we will compare pointwise, geometric, and information-
bshanks@0:                  theoretic measures.
bshanks@0:               2. Develop a procedure to find single marker genes for anatomical regions: for
bshanks@0:                  each cortical area, by using or combining the scoring measures developed,
bshanks@0:                  we will rank the genes by their ability to delineate each area.
bshanks@0:               3. Extend the procedure to handle difficult areas by using combinatorial cod-
bshanks@0:                  ing: for areas that cannot be identified by any single gene, identify them
bshanks@0:                  with a handful of genes. We will consider both (a) algorithms that incre-
bshanks@0:                  mentally/greedily combine single gene markers into sets, such as forward
bshanks@0:                  stepwise regression and decision trees, and also (b) supervised learning
bshanks@0:                  techniques which use soft constraints to minimize the number of features,
bshanks@0:                  such as sparse support vector machines.
bshanks@0:               4. Extend the procedure to handle difficult areas by combining or redrawing
bshanks@0:                  the boundaries:  An area may be difficult to identify because the bound-
bshanks@0:                  aries are misdrawn, or because it does not &#8220;really&#8221; exist as a single area,
bshanks@0:                  at least on the genetic level. We will develop extensions to our procedure
bshanks@0:                  which (a) detect when a difficult area could be fit if its boundary were
bshanks@0:                  redrawn slightly, and (b) detect when a difficult area could be combined
bshanks@0:                  with adjacent areas to create a larger area which can be fit.
bshanks@16:                Apply these algorithms to the cortex
bshanks@0:               1. Create open source format conversion tools:  we will create tools to bulk
bshanks@0:                  download the ABA dataset and to convert between SEV, NIFTI and MAT-
bshanks@0:                  LAB formats.
bshanks@16:                                             10
bshanks@16: 
bshanks@0:               2. Flatmap the ABA cortex data: map the ABA data onto a plane and draw
bshanks@0:                  the cortical area boundaries onto it.
bshanks@0:               3. Find layer boundaries:  cluster similar voxels together in order to auto-
bshanks@0:                  matically find the cortical layer boundaries.
bshanks@0:               4. Run the procedures that we developed on the cortex: we will present, for
bshanks@0:                  each area, a short list of markers to identify that area; and we will also
bshanks@0:                  present lists of &#8220;panels&#8221; of genes that can be used to delineate many areas
bshanks@0:                  at once.
bshanks@16:                Develop algorithms to suggest a division of a structure into anatom-
bshanks@0:             ical parts
bshanks@0:               1. Explore dimensionality reduction algorithms applied to pixels:  including
bshanks@0:                  TODO
bshanks@0:               2. Explore dimensionality reduction algorithms applied to genes:  including
bshanks@0:                  TODO
bshanks@0:               3. Explore clustering algorithms applied to pixels: including TODO
bshanks@0:               4. Explore clustering algorithms applied to genes:  including gene shaving,
bshanks@0:                  TODO
bshanks@0:               5. Develop an algorithm to use dimensionality reduction and/or hierarchial
bshanks@0:                  clustering to create anatomical maps
bshanks@0:               6. Run this algorithm on the cortex: present a hierarchial, genoarchitectonic
bshanks@0:                  map of the cortex
bshanks@15: ______________________________________________
bshanks@15:     stuff  i  dunno  where  to  put  yet  (there  is  more  scattered  through  grant-
bshanks@15: oldtext):
bshanks@16:     Principle 4: Work in 2-D whenever possible
bshanks@16:     In anatomy, the manifold of interest is usually either defined by a combina-
bshanks@16: tion of two relevant anatomical axes (todo), or by the surface of the structure
bshanks@16: (as is the case with the cortex).  In the former case, the manifold of interest is
bshanks@16: a plane, but in the latter case it is curved. If the manifold is curved, there are
bshanks@16: various methods for mapping the manifold into a plane.
bshanks@16:     The method that we will develop will begin by mapping the data into a
bshanks@16: 2-D plane.  Although the manifold that characterized cortical areas is known
bshanks@16: to be the cortical surface, it remains to be seen which method of mapping the
bshanks@16: manifold into a plane is optimal for this application. We will compare mappings
bshanks@16: which attempt to preserve size (such as the one used by Caret??) with mappings
bshanks@16: which preserve angle (conformal maps).
bshanks@16:     Although there is much 2-D organization in anatomy, there are also struc-
bshanks@16: tures whose shape is fundamentally 3-dimensional.  If possible, we would like
bshanks@16: the method we develop to include a statistical test that warns the user if the
bshanks@16: assumption of 2-D structure seems to be wrong.
bshanks@15:                                             11
bshanks@15: 
bshanks@16: