cg

diff grant.html @ 0:29eee29f9bc1
initial commit to hg version control repository
author: bshanks@bshanks-salk.dyndns.org
date: Sat Apr 11 19:12:32 2009 -0700 (16 years ago)
children: 7487ad7f5d8f
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/grant.html	Sat Apr 11 19:12:32 2009 -0700
@@ -0,0 +1,790 @@
+Specific aims
+            Massive new datasets obtained with techniques such as in situ hybridization
+            (ISH) and BAC-transgenics allow the expression levels of many genes at many
+            locations to be compared. Our goal is to develop automated methods to relate
+            spatial variation in gene expression to anatomy. We want to find marker genes
+            for specific anatomical regions, and also to draw new anatomical maps based on
+            gene expression patterns. We have three specific aims:
+             (1) develop an algorithm to screen spatial gene expression data for combina-
+                 tions of marker genes which selectively target anatomical regions
+             (2) develop an algorithm to suggest new ways of carving up a structure into
+                 anatomical subregions, based on spatial patterns in gene expression
+             (3) create a 2-D &#8220;flat map&#8221; dataset of the mouse cerebral cortex that contains
+                 a flattened version of the Allen Mouse Brain Atlas ISH data, as well as
+                 the boundaries of cortical anatomical areas.  Use this dataset to validate
+                 the methods developed in (1) and (2).
+               In addition to validating the usefulness of the algorithms, the application of
+            these methods to cerebral cortex will produce immediate benefits, because there
+            are currently no known genetic markers for many cortical areas.  The results
+            of the project will support the development of new ways to selectively target
+            cortical areas, and it will support the development of a method for identifying
+            the cortical areal boundaries present in small tissue samples.
+               All algorithms that we develop will be implemented in an open-source soft-
+            ware toolkit.  The toolkit, as well as the machine-readable datasets developed
+            in aim (3), will be published and freely available for others to use.
+             Background and significance
+             Aim 1
+             Machine learning terminology
+            The task of looking for marker genes for anatomical subregions means that one
+            is looking for a set of genes such that, if the expression level of those genes is
+            known, then the locations of the subregions can be inferred.
+               If we define the subregions so that they cover the entire anatomical structure
+            to be divided, then instead of saying that we are using gene expression to find
+            the locations of the subregions, we may say that we are using gene expression to
+            determine to which subregion each voxel within the structure belongs. We call
+            this a classification task, because each voxel is being assigned to a class (namely,
+            its subregion).
+               Therefore, an understanding of the relationship between the combination of
+            their expression levels and the locations of the subregions may be expressed as
+                                            1
+
+            a function. The input to this function is a voxel, along with the gene expression
+            levels within that voxel;  the output is the subregional identity of the target
+            voxel, that is, the subregion to which the target voxel belongs.  We call this
+            function a classifier.  In general, the input to a classifier is called an instance,
+            and the output is called a label.
+               The object of aim 1 is not to produce a single classifier, but rather to develop
+            an automated method for determining a classifier for any known anatomical
+            structure.  Therefore, we seek a procedure by which a gene expression dataset
+            may be analyzed in concert with an anatomical atlas in order to produce a
+            classifier.  Such a procedure is a type of a machine learning procedure.  The
+            construction of the classifier is called training (also learning), and the initial
+            gene expression dataset used in the construction of the classifier is called training
+            data.
+               In the machine learning literature, this sort of procedure may be thought
+            of as a supervised learning task, defined as a task in whcih the goal is to learn
+            a mapping from instances to labels, and the training data consists of a set of
+            instances (voxels) for which the labels (subregions) are known.
+               Each gene expression level is called a feature, and the selection of which
+            genes to include is called feature selection.  Feature selection is one component
+            of the task of learning a classifier.  Some methods for learning classifiers start
+            out with a separate feature selection phase, whereas other methods combine
+            feature selection with other aspects of training.
+               One class of feature selection methods assigns some sort of score to each
+            candidate gene. The top-ranked genes are then chosen. Some scoring measures
+            can assign a score to a set of selected genes, not just to a single gene; in this
+            case, a dynamic procedure may be used in which features are added and sub-
+            tracted from the selected set depending on how much they raise the score. Such
+            procedures are called &#8220;stepwise&#8221; or &#8220;greedy&#8221;.
+               Although the classifier itself may only look at the gene expression data within
+            each voxel before classifying that voxel, the learning algorithm which constructs
+            the classifier may look over the entire dataset.  We can categorize score-based
+            feature selection methods depending on how the score of calculated.   Often
+            the score calculation consists of assigning a sub-score to each voxel, and then
+            aggregating these sub-scores into a final score (the aggregation is often a sum or
+            a sum of squares). If only information from nearby voxels is used to calculate a
+            voxel&#8217;s sub-score, then we say it is a local scoring method.  If only information
+            from the voxel itself is used to calculate a voxel&#8217;s sub-score, then we say it is a
+            pointwise scoring method.
+               Key questions when choosing a learning method are: What are the instances?
+            What are the features?  How are the features chosen?  Here are four principles
+            that outline our answers to these questions.
+             Principle 1: Combinatorial gene expression
+            Above, we defined an &#8220;instance&#8221; as the combination of a voxel with the &#8220;asso-
+            ciated gene expression data&#8221;.  In our case this refers to the expression level of
+                                            2
+
+            genes within the voxel, but should we include the expression levels of all genes,
+            or only a few of them?
+               It is too much to hope that every anatomical region of interest will be iden-
+            tified by a single gene. For example, in the cortex, there are some areas which
+            are not clearly delineated by any gene included in the Allen Brain Atlas (ABA)
+            dataset.  However, at least some of these areas can be delineated by looking
+            at combinations of genes (an example of an area for which multiple genes are
+            necessary and sufficient is provided in Preliminary Results).
+             Principle 2: Only look at combinations of small numbers of genes
+            When the classifier classifies a voxel, it is only allowed to look at the expression of
+            the genes which have been selected as features. The more data that is available
+            to a classifier, the better that it can do.  For example, perhaps there are weak
+            correlations over many genes that add up to a strong signal. So, why not include
+            every gene as a feature? The reason is that we wish to employ the classifier in
+            situations in which it is not feasible to gather data about every gene.   For
+            example, if we want to use the expression of marker genes as a trigger for some
+            regionally-targeted intervention, then our intervention must contain a molecular
+            mechanism to check the expression level of each marker gene before it triggers.
+            It is currently infeasible to design a molecular trigger that checks the level of
+            more than a handful of genes. Similarly, if the goal is to develop a procedure to
+            do ISH on tissue samples in order to label their anatomy, then it is infeasible
+            to label more than a few genes.  Therefore, we must select only a few genes as
+            features.
+             Principle 3: Use geometry in feature selection
+            When doing feature selection with score-based methods, the simplest thing to
+            do would be to score the performance of each voxel by itself and then combine
+            these scores; this is pointwise scoring. A more powerful approach is to also use
+            information about the geometric relations between each voxel and its neighbors;
+            this requires non-pointwise, local scoring methods. See Preliminary Results for
+            evidence of the complementary nature of pointwise and local scoring methods.
+             Principle 4: Work in 2-D whenever possible
+            There are many anatomical structures which are commonly characterized in
+            terms of a two-dimensional manifold. When it is known that the structure that
+            one is looking for is two-dimensional, the results may be improved by allowing
+            the analysis algorithm to take advantage of this prior knowledge.  In addition,
+            it is easier for humans to visualize and work with 2-D data.
+               Therefore, when possible, the instances should represent pixels, not voxels.
+                                            3
+
+             Aim 3
+             Background
+            The cortex is divided into areas and layers.  To a first approximation, the par-
+            cellation of the cortex into areas can be drawn as a 2-D map on the surface
+            of the cortex.  In the third dimension, the boundaries between the areas con-
+            tinue downwards into the cortical depth, perpendicular to the surface. The layer
+            boundaries run parallel to the surface. One can picture an area of the cortex as
+            a slice of many-layered cake.
+               Although it is known that different cortical areas have distinct roles in both
+            normal functioning and in disease processes, there are no known marker genes
+            for many cortical areas.  When it is necessary to divide a tissue sample into
+            cortical areas, this is a manual process that requires a skilled human to combine
+            multiple visual cues and interpret them in the context of their approximate
+            location upon the cortical surface.
+               Even the questions of how many areas should be recognized in cortex, and
+            what their arrangement is, are still not completely settled. A proposed division
+            of the cortex into areas is called a cortical map.  In the rodent, the lack of a
+            single agreed-upon map can be seen by contrasting the recent maps given by
+            Swanson?? on the one hand, and Paxinos and Franklin?? on the other. While
+            the maps are certainly very similar in their general arrangement, significant
+            differences remain in the details.
+             Significance
+            The method developed in aim (1) will be applied to each cortical area to find
+            a set of marker genes such that the combinatorial expression pattern of those
+            genes uniquely picks out the target area.  Finding marker genes will be useful
+            for drug discovery as well as for experimentation because marker genes can be
+            used to design interventions which selectively target individual cortical areas.
+               The application of the marker gene finding algorithm to the cortex will
+            also support the development of new neuroanatomical methods. In addition to
+            finding markers for each individual cortical areas, we will find a small panel
+            of genes that can find many of the areal boundaries at once.  This panel of
+            marker genes will allow the development of an ISH protocol that will allow
+            experimenters to more easily identify which anatomical areas are present in
+            small samples of cortex.
+               The method developed in aim (3) will provide a genoarchitectonic viewpoint
+            that will contribute to the creation of a better map. The development of present-
+            day cortical maps was driven by the application of histological stains.   It is
+            conceivable that if a different set of stains had been available which identified
+            a different set of features, then the today&#8217;s cortical maps would have come out
+            differently. Since the number of classes of stains is small compared to the number
+            of genes, it is likely that there are many repeated, salient spatial patterns in
+            the gene expression which have not yet been captured by any stain. Therefore,
+                                            4
+
+            current ideas about cortical anatomy need to incorporate what we can learn
+            from looking at the patterns of gene expression.
+               While we do not here propose to analyze human gene expression data, it is
+            conceivable that the methods we propose to develop could be used to suggest
+            modifications to the human cortical map as well.
+             Related work
+             Preliminary work
+             Justification of principles 1 thur 3
+             Principle 1: Combinatorial gene expression
+            Here we give an example of a cortical area which is not marked by any single
+            gene, but which can be identified combinatorially.  according to logistic regres-
+            sion, gene wwc11 is the best fit single gene for predicting whether or not a pixel
+            on the cortical surface belongs to the motor area (area MO). The upper-left
+            picture in Figure  shows wwc1&#8217;s spatial expression pattern over the cortex. The
+            lower-right boundary of MO is represented reasonably well by this gene, however
+            the gene overshoots the upper-left boundary. This flattened 2-D representation
+            does not show it, but the area corresponding to the overshoot is the medial
+            surface of the cortex. MO is only found on the lateral surface (todo).
+               Gnee mtif22 is shown in figure the upper-right of Fig. . Mtif2 captures MO&#8217;s
+            upper-left boundary, but not its lower-right boundary.  Mtif2 does not express
+            very much on the medial surface.  By adding together the values at each pixel
+            in these two figures, we get the lower-left of Figure . This combination captures
+            area MO much better than any single gene.
+             Principle 2: Only look at combinations of small numbers of genes
+            In order to see how well one can do when looking at all genes at once, we ran
+            a support vector machine to classify cortical surface pixels based on their gene
+            expression profiles. We achieved classification accuracy of about 81%3. As noted
+            above, however, a classifier that looks at all the genes at once isn&#8217;t practically
+            useful.
+               The requirement to find combinations of only a small number of genes limits
+            us from straightforwardly applying many of the most simple techniques from
+            the field of supervised machine learning.  In the parlance of machine learning,
+            our task combines feature selection with supervised learning.
+__________________________
+   1&#8220;WW, C2 and coiled-coil domain containing 1&#8221;; EntrezGene ID 211652
+    2&#8220;mitochondrial translational initiation factor 2&#8221;; EntrezGene ID 76784
+    3Using the Shogun SVM package (todo:cite), with parameters type=GMNPSVM (multi-
+class b-SVM), kernal = gaussian with sigma = 0.1, c = 10, epsilon = 1e-1 &#8211; these are the
+first parameters we tried, so presumably performance would improve with different choices of
+parameters. 5-fold cross-validation.
+                                            5
+
+                                        
+            
+            Figure 1:  Upper left:  wwc1.  Upper right:  mtif2.  Lower left:  wwc1 + mtif2
+            (each pixel&#8217;s value on the lower left is the sum of the corresponding pixels in
+            the upper row).  Within each picture, the vertical axis roughly corresponds to
+            anterior at the top and posterior at the bottom, and the horizontal axis roughly
+            corresponds to medial at the left and lateral at the right.  The red outline is
+            the boundary of region MO. Pixels are colored approximately according to the
+            density of expressing cells underneath each pixel, with red meaning a lot of
+            expression and blue meaning little.
+                                            6
+
+                                                        
+                                                        
+            Figure 2: The top row shows the three genes which (individually) best predict
+            area AUD, according to logistic regression.  The bottom row shows the three
+            genes which (individually) best match area AUD, according to gradient similar-
+            ity. From left to right and top to bottom, the genes are Ssr1, Efcbp1, Aph1a,
+            Ptk7, Aph1a again, and Lepr
+             Principle 3: Use geometry
+            To show that local geometry can provide useful information that cannot be
+            detected via pointwise analyses, consider Fig.  .  The top row of Fig.   displays
+            the 3 genes which most match area AUD, according to a pointwise method4. The
+            bottom row displays the 3 genes which most match AUD according to a method
+            which considers local geometry5 The pointwise method in the top row identifies
+            genes which express more strongly in AUD than outside of it; its weakness is that
+            this includes many areas which don&#8217;t have a salient border matching the areal
+            border. The geometric method identifies genes whose salient expression border
+            seems to partially line up with the border of AUD; its weakness is that this
+            includes genes which don&#8217;t express over the entire area. Genes which have high
+            rankings using both pointwise and border criteria, such as Aph1a in the example,
+            may be particularly good markers.   None of these genes are,  individually,  a
+            perfect marker for AUD; we deliberately chose a &#8220;difficult&#8221; area in order to
+            better contrast pointwise with geometric methods.
+__________________________
+   4For each gene, a logistic regression in which the response variable was whether or not a
+surface pixel was within area AUD, and the predictor variable was the value of the expression
+of the gene underneath that pixel. The resulting scores were used to rank the genes in terms
+of how well they predict area AUD.
+    5For each gene the gradient similarity (see section ??) between (a) a map of the expression
+of each gene on the cortical surface and (b) the shape of area AUD, was calculated, and this
+was used to rank the genes.
+                                            7
+
+             Principle 4: Work in 2-D whenever possible
+            In anatomy, the manifold of interest is usually either defined by a combination
+            of two relevant anatomical axes (todo), or by the surface of the structure (as is
+            the case with the cortex). In the former case, the manifold of interest is a plane,
+            but in the latter case it is curved.  If the manifold is curved, there are various
+            methods for mapping the manifold into a plane.
+               The method that we will develop will begin by mapping the data into a
+            2-D plane.  Although the manifold that characterized cortical areas is known
+            to be the cortical surface, it remains to be seen which method of mapping the
+            manifold into a plane is optimal for this application. We will compare mappings
+            which attempt to preserve size (such as the one used by Caret??) with mappings
+            which preserve angle (conformal maps).
+               Although there is much 2-D organization in anatomy, there are also struc-
+            tures whose shape is fundamentally 3-dimensional.  If possible, we would like
+            the method we develop to include a statistical test that warns the user if the
+            assumption of 2-D structure seems to be wrong.
+               &#8212;&#8212;
+               Massive new datasets obtained with techniques such as in situ hybridization
+            (ISH) and BAC-transgenics allow the expression levels of many genes at many
+            locations to be compared.  This can be used to find marker genes for specific
+            anatomical structures, as well as to draw new anatomical maps.  Our goal is
+            to develop automated methods to relate spatial variation in gene expression to
+            anatomy. We have five specific aims:
+             (1) develop an algorithm to screen spatial gene expression data for combi-
+                 nations of marker genes which selectively target individual anatomical
+                 structures
+             (2) develop an algorithm to screen spatial gene expression data for combina-
+                 tions of marker genes which can be used to delineate most of the bound-
+                 aries between a number of anatomical structures at once
+             (3) develop an algorithm to suggest new ways of dividing a structure up into
+                 anatomical subregions, based on spatial patterns in gene expression
+             (4) create a flat (2-D) map of the mouse cerebral cortex that contains a flat-
+                 tened version of the Allen Mouse Brain Atlas ISH dataset, as well as the
+                 boundaries of anatomical areas within the cortex. For each cortical layer,
+                 a layer-specific flat dataset will be created. A single combined flat dataset
+                 will be created which averages information from all of the layers.  These
+                 datasets will be made available in both MATLAB and Caret formats.
+             (5) validate the methods developed in (1), (2) and (3) by applying them to
+                 the cerebral cortex datasets created in (4)
+               All algorithms that we develop will be implemented in an open-source soft-
+            ware toolkit. The toolkit, as well as the machine-readable datasets developed in
+                                            8
+
+            aim (4) and any other intermediate dataset we produce, will be published and
+            freely available for others to use.
+               In addition to developing generally useful methods, the application of these
+            methods to cerebral cortex will produce immediate benefits that are only one
+            step removed from clinical application, while also supporting the development
+            of new neuroanatomical techniques.  The method developed in aim (1) will be
+            applied to each cortical area to find a set of marker genes.  Currently, despite
+            the distinct roles of different cortical areas in both normal functioning and
+            disease processes, there are no known marker genes for many cortical areas.
+            Finding marker genes will be immediately useful for drug discovery as well as for
+            experimentation because once marker genes for an area are known, interventions
+            can be designed which selectively target that area.
+               The method developed in aim (2) will be used to find a small panel of genes
+            that can find most of the boundaries between areas in the cortex. Today, finding
+            cortical areal boundaries in a tissue sample is a manual process that requires a
+            skilled human to combine multiple visual cues over a large area of the cortical
+            surface. A panel of marker genes will allow the development of an ISH protocol
+            that will allow experimenters to more easily identify which anatomical areas are
+            present in small samples of cortex.
+               For each cortical layer, a layer-specific flat dataset will be created. A single
+            combined flat dataset will be created which averages information from all of
+            the layers. These datasets will be made available in both MATLAB and Caret
+            formats.
+               &#8212;-
+               New techniques allow the expression levels of many genes at many locations
+            to be compared. It is thought that even neighboring anatomical structures have
+            different gene expression profiles.  We propose to develop automated methods
+            to relate the spatial variation in gene expression to anatomy.  We will develop
+            two kinds of techniques:
+             (a) techniques to screen for combinations of marker genes which selectively
+                 target anatomical structures
+             (b) techniques to suggest new ways of dividing a structure up into anatomical
+                 subregions, based on the shapes of contours in the gene expression
+               The first kind of technique will be helpful for finding marker genes associated
+            with known anatomical features. The second kind of technique will be helpful in
+            creating new anatomical maps, maps which reflect differences in gene expression
+            the same way that existing maps reflect differences in histology.
+               We intend to develop our techniques using the adult mouse cerebral cortex
+            as a testbed.   The Allen Brain Atlas has collected a dataset containing the
+            expression level of about 4000 genes* over a set of over 150000 voxels, with a
+            spatial resolution of approximately 200 microns[?].
+               We expect to discover sets of marker genes that pick out specific cortical
+            areas.  This will allow the development of drugs and other interventions that
+            selectively target individual cortical areas.   Therefore our research will lead
+                                            9
+
+            to application in drug discovery, in the development of other targeted clinical
+            interventions, and in the development of new experimental techniques.
+               The best way to divide up rodent cortex into areas has not been completely
+            determined, as can be seen by the differences in the recent maps given by Swan-
+            son on the one hand, and Paxinos and Franklin on the other. It is likely that our
+            study, by showing which areal divisions naturally follow from gene expression
+            data, as opposed to traditional histological data, will contribute to the creation
+            of a better map. While we do not here propose to analyze human gene expres-
+            sion data, it is conceivable that the methods we propose to develop could be
+            used to suggest modifications to the human cortical map as well.
+               In the following, we will only be talking about coronal data.
+               The Allen Brain Atlas provides &#8220;Smoothed Energy Volumes&#8221;, which are
+               One type of artifact in the Allen Brain Atlas data is what we call a &#8220;slice
+            artifact&#8221;. We have noticed two types of slice artifacts in the dataset. The first
+            type, a &#8220;missing slice artifact&#8221;, occurs when the ISH procedure on a slice did
+            not come out well. In this case, the Allen Brain investigators excluded the slice
+            at issue from the dataset.  This means that no gene expression information is
+            available for that gene for the region of space covered by that slice. This results
+            in an expression level of zero being assigned to voxels covered by the slice. This
+            is partially but not completely ameliorated by the smoothing that is applied to
+            create the Smoothed Energy Volumes. The usual end result is that a region of
+            space which is shaped and oriented like a coronal slice is marked as having less
+            gene expression than surrounding regions.
+               The second type of slice artifact is caused by the fact that all of the slices
+            have a consistent orientation.  Since there may be artifacts (such as how well
+            the ISH worked) which are constant within each slice but which vary between
+            different slices, the result is that ceteris paribus, when one compares the genetic
+            data of a voxel to another voxel within the same coronal plane, one would expect
+            to find more similarity than if one compared a voxel to another voxel displaced
+            along the rostrocaudal axis.
+               We are enthusiastic about the sharing of methods, data, and results, and
+            at the conclusion of the project, we will make all of our data and computer
+            source code publically available.  Our goal is that replicating our results, or
+            applying the methods we develop to other targets, will be quick and easy for
+            other investigators. In order to aid in understanding and replicating our results,
+            we intend to include a software program which, when run, will take as input
+            the Allen Brain Atlas raw data, and produce as output all numbers and charts
+            found in publications resulting from the project.
+               To aid in the replication of our results, we will include a script which takes
+            as input the dataset in aim (3) and provides as output all of the tables in figures
+            in our publications .
+               We also expect to weigh in on the debate about how to best partition rodent
+            cortex
+               be useful for drug discovery as well
+               * Another 16000 genes are available, but they do not cover the entire cerebral
+            cortex with high spatial resolution.
+                                            10
+
+               User-definable ROIs Combinatorial gene expression Negative as well as pos-
+            itive signal Use geometry Search for local boundaries if necessary Flatmapped
+             Specific aims
+            Develop algorithms that find genetic markers for anatomical regions
+              1. Develop scoring measures for evaluating how good individual genes are at
+                 marking areas:  we will compare pointwise, geometric, and information-
+                 theoretic measures.
+              2. Develop a procedure to find single marker genes for anatomical regions: for
+                 each cortical area, by using or combining the scoring measures developed,
+                 we will rank the genes by their ability to delineate each area.
+              3. Extend the procedure to handle difficult areas by using combinatorial cod-
+                 ing: for areas that cannot be identified by any single gene, identify them
+                 with a handful of genes. We will consider both (a) algorithms that incre-
+                 mentally/greedily combine single gene markers into sets, such as forward
+                 stepwise regression and decision trees, and also (b) supervised learning
+                 techniques which use soft constraints to minimize the number of features,
+                 such as sparse support vector machines.
+              4. Extend the procedure to handle difficult areas by combining or redrawing
+                 the boundaries:  An area may be difficult to identify because the bound-
+                 aries are misdrawn, or because it does not &#8220;really&#8221; exist as a single area,
+                 at least on the genetic level. We will develop extensions to our procedure
+                 which (a) detect when a difficult area could be fit if its boundary were
+                 redrawn slightly, and (b) detect when a difficult area could be combined
+                 with adjacent areas to create a larger area which can be fit.
+             Apply these algorithms to the cortex
+              1. Create open source format conversion tools:  we will create tools to bulk
+                 download the ABA dataset and to convert between SEV, NIFTI and MAT-
+                 LAB formats.
+              2. Flatmap the ABA cortex data: map the ABA data onto a plane and draw
+                 the cortical area boundaries onto it.
+              3. Find layer boundaries:  cluster similar voxels together in order to auto-
+                 matically find the cortical layer boundaries.
+              4. Run the procedures that we developed on the cortex: we will present, for
+                 each area, a short list of markers to identify that area; and we will also
+                 present lists of &#8220;panels&#8221; of genes that can be used to delineate many areas
+                 at once.
+                                            11
+
+            Develop algorithms to suggest a division of a structure into anatom-
+            ical parts
+              1. Explore dimensionality reduction algorithms applied to pixels:  including
+                 TODO
+              2. Explore dimensionality reduction algorithms applied to genes:  including
+                 TODO
+              3. Explore clustering algorithms applied to pixels: including TODO
+              4. Explore clustering algorithms applied to genes:  including gene shaving,
+                 TODO
+              5. Develop an algorithm to use dimensionality reduction and/or hierarchial
+                 clustering to create anatomical maps
+              6. Run this algorithm on the cortex: present a hierarchial, genoarchitectonic
+                 map of the cortex
+               gradient similarity is calculated as: &#x2211;
+  pixels cos(abs(&#x2220;&#x2207;1 - &#x2220;&#x2207;2)) &#x22C5;|&#x2207;1|+|&#x2207;2|
+   2       &#x22C5;
+            pixel_value1+pixel_value2
+         2
+               (todo) Technically, we say that an anatomical structure has a fundamen-
+            tally 2-D organization when there exists a commonly used, generic, anatomical
+            structure-preserving map from 3-D space to a 2-D manifold.
+               Related work:
+               The Allen Brain Institute has developed an interactive web interface called
+            AGEA which allows an investigator to (1) calculate lists of genes which are se-
+            lectively overexpressed in certain anatomical regions (ABA calls this the &#8220;Gene
+            Finder&#8221; function) (2) to visualize the correlation between the genetic profiles of
+            voxels in the dataset, and (3) to visualize a hierarchial clustering of voxels in
+            the dataset [?].  AGEA is an impressive and useful tool, however, it does not
+            solve the same problems that we propose to solve with this project.
+               First we describe AGEA&#8217;s &#8220;Gene Finder&#8221;, and then compare it to our pro-
+            posed method for finding marker genes.  AGEA&#8217;s Gene Finder first asks the
+            investigator to select a single &#8220;seed voxel&#8221; of interest. It then uses a clustering
+            method, combined with built-in knowledge of major anatomical structures, to
+            select two sets of voxels; an &#8220;ROI&#8221; and a &#8220;comparator region&#8221;*. The seed voxel
+            is always contained within the ROI, and the ROI is always contained within the
+            comparator region.  The comparator region is similar but not identical to the
+            set of voxels making up the major anatomical region containing the ROI. Gene
+            Finder then looks for genes which can distinguish the ROI from the comparator
+            region. Specifically, it finds genes for which the ratio (expression energy in the
+            ROI) / (expression energy in the comparator region) is high.
+               Informally, the Gene Finder first infers an ROI based on clustering the seed
+            voxel with other voxels.  Then, the Gene Finder finds genes which overexpress
+            in the ROI as compared to other voxels in the major anatomical region.
+               There are three major differences between our approach and Gene Finder.
+                                            12
+
+               First, Gene Finder focuses on individual genes and individual ROIs in isola-
+            tion. This is great for regions which can be picked out from all other regions by a
+            single gene, but not all of them can (todo). There are at least two ways this can
+            miss out on useful genes. First, a gene might express in part of a region, but not
+            throughout the whole region, but there may be another gene which expresses
+            in the rest of the region*. Second, a gene might express in a region, but not in
+            any of its neighbors, but it might express also in other non-neighboring regions.
+            To take advantage of these types of genes, we propose to find combinations of
+            genes which, together, can identify the boundaries of all subregions within the
+            containing region.
+               Second, Gene Finder uses a pointwise metric, namely expression energy ratio,
+            to decide whether a gene is good for picking out a region. We have found better
+            results by using metrics which take into account not just single voxels, but also
+            the local geometry of neighboring voxels, such as the local gradient (todo).  In
+            addition, we have found that often the absence of gene expression can be used
+            as a marker, which will not be caught by Gene Finder&#8217;s expression energy ratio
+            (todo).
+               Third, Gene Finder chooses the ROI based only on the seed voxel.  This
+            often does not permit the user to query the ROI that they are interested in. For
+            example, in all of our tests of Gene Finder in cortex, the ROIs chosen tend to
+            be cortical layers, rather than cortical areas.
+               In summary, when Gene Finder picks the ROI that you want, and when this
+            ROI can be easily picked out from neighboring regions by single genes which
+            selectively overexpress in the ROI compared to the entire major anatomical re-
+            gion, Gene Finder will work. However, Gene Finder will not pick cortical areas
+            as ROIs, and even if it could, many cortical areas cannot be uniquely picked out
+            by the overexpression of any single gene.  By contrast, we will target cortical
+            areas, we will explore a variety of metrics which can complement the shortcom-
+            ings of expression energy ratio, and we will use the combinatorial expression of
+            genes to pick out cortical areas even when no individual gene will do.
+               * The terms &#8220;ROI&#8221; and &#8220;comparator region&#8221; are our own; the ABI calls
+            them the &#8220;local region&#8221; and the &#8220;larger anatomical context&#8221;. The ABI uses the
+            term &#8220;specificity comparator&#8221; to mean the major anatomic region containing
+            the ROI, which is not exactly identical to the comparator region.
+               ** In this case, the union of the area of expression of the two genes would
+            suffice; one could also imagine that there could be situations in which the in-
+            tersection of multiple genes would be needed, or a combination of unions and
+            intersections.
+               Now we describe AGEA&#8217;s hierarchial clustering, and compare it to our pro-
+            posal. The goal of AGEA&#8217;s hierarchial clustering is to generate a binary tree of
+            clusters, where a cluster is a collection of voxels.  AGEA begins by computing
+            the Pearson correlation between each pair of voxels. They then employ a recur-
+            sive divisive (top-down) hierarchial clustering procedure on the voxels, which
+            means that they start with all of the voxels, and then they divide them into clus-
+            ters, and then within each cluster, they divide that cluster into smaller clusters,
+            etc***.  At each step, the collection of voxels is partitioned into two smaller
+                                            13
+
+            clusters in a way that maximizes the following quantity:  average correlation
+            between all possible pairs of voxels containing one voxel from each cluster.
+               There are three major differences between our approach and AGEA&#8217;s hier-
+            archial clustering.  First, AGEA&#8217;s clustering method separates cortical layers
+            before it separates cortical areas.
+               following procedure is used for the purpose of dividing a collection of voxels
+            into smaller clusters: partition the voxels into two sets, such that the following
+            quantity is maximized:
+               *** depending on which level of the tree is being created, the voxels are
+            subsampled in order to save time
+               does not allow the user to input anything other than a seed voxel; this means
+            that for each seed voxel, there is only one
+               The role of the &#8220;local region&#8221; is to serve as a region of interest for which
+            marker genes are desired; the role of the &#8220;larger anatomical context&#8221; is to be
+            the structure
+               There are two kinds of differences between AGEA and our project; differ-
+            ences that relate to the treatment of the cortex, and differences in the type of
+            generalizable methods being developed. As relates
+               indicate an ROI
+               explore simple correlation-based relationships between voxels,  genes,  and
+            clusters of voxels.
+               There have not yet been any studies which describe the results of applying
+            AGEA to the cerebral cortex; however, we suspect that the AGEA metrics are
+            not optimal for the task of relating genes to cortical areas.   A voxel&#8217;s gene
+            expression profile depends upon both its cortical area and its cortical layer,
+            however, AGEA has no mechanism to distinguish these two. As a result, voxels
+            in the same layer but different areas are often clustered together by AGEA. As
+            part of the project, we will compare the performance of our techniques against
+            AGEA&#8217;s.
+               &#8212;
+               The Allen Brain Institute has developed interactive tools called AGEA which
+            allow an investigator to explore simple correlation-based relationships between
+            voxels,  genes,  and clusters of voxels.   There have not yet been any studies
+            which describe the results of applying AGEA to the cerebral cortex; however,
+            we suspect that the AGEA metrics are not optimal for the task of relating
+            genes to cortical areas.  A voxel&#8217;s gene expression profile depends upon both
+            its cortical area and its cortical layer, however, AGEA has no mechanism to
+            distinguish these two.  As a result, voxels in the same layer but different areas
+            are often clustered together by AGEA. As part of the project, we will compare
+            the performance of our techniques against AGEA&#8217;s.
+               Another difference between our techniques and AGEA&#8217;s is that AGEA allows
+            the user to enter only a voxel location, and then to either explore the rest of
+            the brain&#8217;s relationship to that particular voxel, or explore a partitioning of
+            the brain based on pairwise voxel correlation. If the user is interested not in a
+            single voxel, but rather an entire anatomical structure, AGEA will only succeed
+            to the extent that the selected voxel is a typical representative of the structure.
+                                            14
+
+            As discussed in the previous paragraph, this poses problems for structures like
+            cortical areas, which (because of their division into cortical layers) do not have
+            a single &#8220;typical representative&#8221;.
+               By contrast, in our system, the user will start by selecting, not a single voxel,
+            but rather, an anatomical superstructure to be divided into pieces (for example,
+            the cerebral cortex).  We expect that our methods will take into account not
+            just pairwise statistics between voxels, but also large-scale geometric features
+            (for example, the rapidity of change in gene expression as regional boundaries
+            are crossed) which optimize the discriminability of regions within the selected
+            superstructure.
+               &#8212;&#8211;
+               screen for combinations of marker genes which selectively target anatom-
+            ical structures pick delineate the boundaries between neighboring anatomical
+            structures. (b) techniques to screen for marker genes which pick out anatomical
+            structures of interest
+               ,  techniques  which:  (a)  screen  for  marker  genes  ,  and  (b)  suggest  new
+            anatomical maps based on
+               whose expression partitions the region of interest into its anatomical sub-
+            structures, and (b) use the natural contours of gene expression to suggest new
+            ways of dividing an organ into
+               The Allen Brain Atlas
+               &#8211;
+               to: brooksl@mail.nih.gov
+               Hi, I&#8217;m writing to confirm the applicability of a potential research project to
+            the challenge grant topic &#8221;New computational and statistical methods for the
+            analysis of large data sets from next-generation sequencing technologies&#8221;.
+               We want to develop methods for the analysis of gene expression datasets that
+            can be used to uncover the relationships between gene expression and anatomical
+            regions. Specifically, we want to develop techniques to (a) given a set of known
+            anatomical areas, identify genetic markers for each of these areas, and (b) given
+            an anatomical structure whose substructure is unknown, suggest a map, that
+            is, a division of the space into anatomical sub-structures, that represents the
+            boundaries inherent in the gene expression data.
+               We propose to develop our techniques on the Allen Brain Atlas mouse brain
+            gene expression dataset by finding genetic markers for anatomical areas within
+            the cerebral cortex.  The Allen Brain Atlas contains a registered 3-D map of
+            gene expression data with 200-micron voxel resolution which was created from
+            in situ hybridization data.  The dataset contains about 4000 genes which are
+            available at this resolution across the entire cerebral cortex.
+               Despite the distinct roles of different cortical areas in both normal function-
+            ing and disease processes, there are no known marker genes for many cortical
+            areas. This project will be immediately useful for both drug discovery and clini-
+            cal research because once the markers are known, interventions can be designed
+            which selectively target specific cortical areas.
+               This techniques we develop will be useful because they will be applicable to
+            the analysis of other anatomical areas, both in terms of finding marker genes
+                                            15
+
+            for known areas, and in terms of suggesting new anatomical subdivisions that
+            are based upon the gene expression data.
+               &#8212;-
+               It is likely that our study, by showing which areal divisions naturally fol-
+            low from gene expression data, as opposed to traditional histological data, will
+            contribute to the creation of
+               there are clear genetic or chemical markers known for only a few cortical
+            areas. This makes it difficult to target drugs to specific
+               As part of aims (1) and (5), we will discover sets of marker genes that pick
+            out specific cortical areas.  This will allow the development of drugs and other
+            interventions that selectively target individual cortical areas.  As part of aims
+            (2) and (5), we will also discover small panels of marker genes that can be used
+            to delineate most of the cortical areal map.
+               With aims (2) and (4), we
+               There are five principals
+               In addition to validating the usefulness of the algorithms, the application of
+            these methods to cerebral cortex will produce immediate benefits that are only
+            one step removed from clinical application.
+               todo: remember to check gensat, etc for validation (mention bias/variance)
+             Why it is useful to apply these methods to cortex
+            There is still room for debate as to exactly how the cortex should be parcellated
+            into areas.
+               The best way to divide up rodent cortex into areas has not been completely
+            determined,
+               not yet been accounted for in
+               that the expression of some genes will contain novel spatial patterns which
+            are not account
+               that a genoarchitectonic map
+               This principle is only applicable to aim 1 (marker genes). For aim 2 (partition
+            a structure in into anatomical subregions), we plan to work with many genes at
+            once.
+               tood: aim 2 b+s?
+             Principle 5: Interoperate with existing tools
+            In order for our software to be as useful as possible for our users, it will be
+            able to import and export data to standard formats so that users can use our
+            software in tandem with other software tools created by other teams.  We will
+            support the following formats:  NIFTI (Neuroimaging Informatics Technology
+            Initiative), SEV (Allen Brain Institute Smoothed Energy Volume), and MAT-
+            LAB. This ensures that our users will not have to exclusively rely on our tools
+            when analyzing data. For example, users will be able to use the data visualiza-
+            tion and analysis capabilities of MATLAB and Caret alongside our software.
+                                            16
+
+               To our knowledge, there is no currently available software to convert between
+            these formats, so we will also provide a format conversion tool.  This may be
+            useful even for groups that don&#8217;t use any of our other software.
+               todo: is &#8220;marker gene&#8221; even a phrase that we should use at all?
+               note for aim 1 apps: combo of genes is for voxel, not within any single cell
+               , as when genetic markers allow the development of selective interventions;
+            the reason that one can be confident that the intervention is selective is that it
+            is only turned on when a certain combination of genes is turned on and off. The
+            result procedure is what assures us that when that combination is present, the
+            local tissue is probably part of a certain subregion.
+               The basic idea is that we want to find a procedure by
+               The task of finding genes that mark anatomical areas can be phrased in
+            terms of what the field of machine learning calls a &#8220;supervised learning&#8221; task.
+            The goal of this task is to learn a function (the &#8220;classifier&#8221;) which
+               If a person knows a combination of genes that mark an area, that implies
+            that the person can be told how strong those genes express in any voxel, and
+            the person can use this information to determine how
+               finding how to infer the areal identity of a voxel if given the gene expression
+            profile of that voxel.
+               For each voxel in the cortex, we want to start with data about the gene
+            expression
+               There are various ways to look for marker genes. We will define some terms,
+            and along the way we will describe a few design choices encountered in the
+            process of creating a marker gene finding method, and then we will present four
+            principles that describe which options we have chosen.
+               In developing a procedure for finding marker genes,  we are developing a
+            procedure that takes a dataset of experimental observations and produces a
+            result. One can think of the result as merely a list of genes, but really the result
+            is an understanding of a predictive relationship between, on the one hand, the
+            expression levels of genes, and, on the other hand, anatomical subregions.
+               One way to more formally define this understanding is to look at it as a
+            procedure. In this view, the result of the learning procedure is itself a procedure.
+            The result procedure provides a way to use the gene expression profiles of voxels
+            in a tissue sample in order to determine where the subregions are.
+               This result procedure can be used directly, as when an experimenter has
+            a tissue sample and needs to know what subregions are present in it,  and,
+            if multiple subregions are present,  where they each are.   Or it can be used
+            indirectly; imagine that the result procedure tells us that whenever a certain
+            combination of genes are expressed, the local tissue is probably part of a certain
+            subregion.  This means that we can then confidentally develop an intervention
+            which is triggered only when that combination of genes are expressed; and to
+            the extent that the result procedure is reliable, we know that the intervention
+            will only be triggered in the target subregion.
+               We said that the result procedure provides &#8220;a way to use the gene expression
+            profiles of voxels in a tissue sample&#8221; in order to &#8220;determine where the subregions
+            are&#8221;.
+                                            17
+
+               Does the result procedure get as input all of the gene expression profiles
+            of each voxel in the entire tissue sample,  and produce as output all of the
+            subregional boundaries all at once?
+               it is helpful for the classifier to look at the global &#8220;shape&#8221; of gene expression
+            patterns over the whole structure, rather than just nearby voxels.
+               there is some small bit of additional information that can be gleaned from
+            knowing the
+             Design choices for a supervised learning procedure
+            After all,
+               there is a small correlation between the gene expression levels from distant
+            voxels and
+               Depending on how we intend to use the classifier, we may want to design it
+            so that
+               It is possible for many things to
+               The choice of which data is made part of an instance
+               what we seek is a procedure
+               partition the tissue sample into subregions.
+               each part of the anatomical structure
+               must be One way to rephrase this task is to say that, instead of searching
+            for the location of the subregions, we are looking to partition the tissue sample
+            into subregions.
+               There are various ways to look for marker genes. We will define some terms,
+            and along the way we will describe a few design choices encountered in the
+            process of creating a marker gene finding method, and then we will present four
+            principles that describe which options we have chosen.
+               In developing a procedure for finding marker genes,  we are developing a
+            procedure that takes a dataset of experimental observations and produces a
+            result. One can think of the result as merely a list of genes, but really the result
+            is an understanding of a predictive relationship between, on the one hand, the
+            expression levels of genes, and, on the other hand, anatomical subregions.
+               One way to more formally define this understanding is to look at it as a
+            procedure. In this view, the result of the learning procedure is itself a procedure.
+            The result procedure provides a way to use the gene expression profiles of voxels
+            in a tissue sample in order to determine where the subregions are.
+               This result procedure can be used directly, as when an experimenter has
+            a tissue sample and needs to know what subregions are present in it,  and,
+            if multiple subregions are present,  where they each are.   Or it can be used
+            indirectly; imagine that the result procedure tells us that whenever a certain
+            combination of genes are expressed, the local tissue is probably part of a certain
+            subregion.  This means that we can then confidentally develop an intervention
+            which is triggered only when that combination of genes are expressed; and to
+            the extent that the result procedure is reliable, we know that the intervention
+            will only be triggered in the target subregion.
+                                            18
+
+               We said that the result procedure provides &#8220;a way to use the gene expression
+            profiles of voxels in a tissue sample&#8221; in order to &#8220;determine where the subregions
+            are&#8221;.
+               Does the result procedure get as input all of the gene expression profiles
+            of each voxel in the entire tissue sample,  and produce as output all of the
+            subregional boundaries all at once?
+               Or are we given one voxel at a time,
+               In the jargon of the field of machine learning, the result procedure is called
+            a classifier.
+               The task of finding genes that mark anatomical areas can be phrased in
+            terms of what the field of machine learning calls a &#8220;supervised learning&#8221; task.
+            The goal of this task is to learn a function (the &#8220;classifier&#8221;) which
+               If a person knows a combination of genes that mark an area, that implies
+            that the person can be told how strong those genes express in any voxel, and
+            the person can use this information to determine how
+               finding how to infer the areal identity of a voxel if given the gene expression
+            profile of that voxel.
+               For each voxel in the cortex, we want to start with data about the gene
+            expression
+               single voxels, but rather groups of voxels, such that the groups can be placed
+            in some 2-D space. We will call such instances &#8220;pixels&#8221;.
+               We have been speaking as if instances necessarily correspond to single voxels.
+            But it is possible for instances to be groupings of many voxels, in which case
+            each grouping must be assigned the same label (that is, each voxel grouping
+            must stay inside a single anatomical subregion).
+               In some but not all cases, the groups are either rows or columns of voxels.
+            This is the case with the cerebral cortex, in which one may assume that columns
+            of voxels which run perpendicular to the cortical surface all share the same areal
+            identity. In the cortex, we call such an instance a &#8220;surface pixel&#8221;, because such
+            an instance represents the data associated with all voxels underneath a specific
+            patch of the cortical surface.
+                                            19
+
+
author	bshanks@bshanks-salk.dyndns.org
date	Sat Apr 11 19:12:32 2009 -0700 (16 years ago)
parents
children	7487ad7f5d8f