cg
changeset 6:3c874c1cd837
.
| author | bshanks@bshanks.dyndns.org | 
|---|---|
| date | Sat Apr 11 19:53:38 2009 -0700 (16 years ago) | 
| parents | 21ae2b01cc60 | 
| children | 075618f574d8 | 
| files | grant.html grant.odt grant.pdf grant.txt | 
   line diff
     1.1 --- a/grant.html	Sat Apr 11 19:50:21 2009 -0700
     1.2 +++ b/grant.html	Sat Apr 11 19:53:38 2009 -0700
     1.3 @@ -1,5 +1,6 @@
     1.4  Specific aims
     1.5 -            Massive new datasets obtained with techniques such as in situ hybridization
     1.6 +            todo2
     1.7 +               Massive new datasets obtained with techniques such as in situ hybridization
     1.8              (ISH) and BAC-transgenics allow the expression levels of many genes at many
     1.9              locations to be compared. Our goal is to develop automated methods to relate
    1.10              spatial variation in gene expression to anatomy. We want to find marker genes
    1.11 @@ -34,10 +35,10 @@
    1.12              determine to which subregion each voxel within the structure belongs. We call
    1.13              this a classification task, because each voxel is being assigned to a class (namely,
    1.14              its subregion).
    1.15 +                                            1
    1.16 +
    1.17                 Therefore, an understanding of the relationship between the combination of
    1.18              their expression levels and the locations of the subregions may be expressed as
    1.19 -                                            1
    1.20 -
    1.21              a function. The input to this function is a voxel, along with the gene expression
    1.22              levels within that voxel;  the output is the subregional identity of the target
    1.23              voxel, that is, the subregion to which the target voxel belongs.  We call this
    1.24 @@ -79,11 +80,11 @@
    1.25                 Key questions when choosing a learning method are: What are the instances?
    1.26              What are the features?  How are the features chosen?  Here are four principles
    1.27              that outline our answers to these questions.
    1.28 +                                            2
    1.29 +
    1.30               Principle 1: Combinatorial gene expression
    1.31              Above, we defined an “instance” as the combination of a voxel with the “asso-
    1.32              ciated gene expression data”.  In our case this refers to the expression level of
    1.33 -                                            2
    1.34 -
    1.35              genes within the voxel, but should we include the expression levels of all genes,
    1.36              or only a few of them?
    1.37                 It is too much to hope that every anatomical region of interest will be iden-
    1.38 @@ -121,10 +122,10 @@
    1.39              the analysis algorithm to take advantage of this prior knowledge.  In addition,
    1.40              it is easier for humans to visualize and work with 2-D data.
    1.41                 Therefore, when possible, the instances should represent pixels, not voxels.
    1.42 +                                            3
    1.43 +
    1.44               Aim 2
    1.45              todo
    1.46 -                                            3
    1.47 -
    1.48               Aim 3
    1.49               Background
    1.50              The cortex is divided into areas and layers.  To a first approximation, the par-
    1.51 @@ -164,11 +165,11 @@
    1.52              day cortical maps was driven by the application of histological stains.   It is
    1.53              conceivable that if a different set of stains had been available which identified
    1.54              a different set of features, then the today’s cortical maps would have come out
    1.55 +                                            4
    1.56 +
    1.57              differently. Since the number of classes of stains is small compared to the number
    1.58              of genes, it is likely that there are many repeated, salient spatial patterns in
    1.59              the gene expression which have not yet been captured by any stain. Therefore,
    1.60 -                                            4
    1.61 -
    1.62              current ideas about cortical anatomy need to incorporate what we can learn
    1.63              from looking at the patterns of gene expression.
    1.64                 While we do not here propose to analyze human gene expression data, it is
    1.65 @@ -199,9 +200,7 @@
    1.66              expression profiles. We achieved classification accuracy of about 81%3. As noted
    1.67              above, however, a classifier that looks at all the genes at once isn’t practically
    1.68              useful.
    1.69 -               The requirement to find combinations of only a small number of genes limits
    1.70 -            us from straightforwardly applying many of the most simple techniques from
    1.71 -__________________________
    1.72 +_____________________
    1.73     1“WW, C2 and coiled-coil domain containing 1”; EntrezGene ID 211652
    1.74      2“mitochondrial translational initiation factor 2”; EntrezGene ID 76784
    1.75      3Using the Shogun SVM package (todo:cite), with parameters type=GMNPSVM (multi-
    1.76 @@ -222,13 +221,8 @@
    1.77              expression and blue meaning little.
    1.78                                              6
    1.79  
    1.80 -                                                        
    1.81 -                                                        
    1.82 -            Figure 2: The top row shows the three genes which (individually) best predict
    1.83 -            area AUD, according to logistic regression.  The bottom row shows the three
    1.84 -            genes which (individually) best match area AUD, according to gradient similar-
    1.85 -            ity. From left to right and top to bottom, the genes are Ssr1, Efcbp1, Aph1a,
    1.86 -            Ptk7, Aph1a again, and Lepr
    1.87 +               The requirement to find combinations of only a small number of genes limits
    1.88 +            us from straightforwardly applying many of the most simple techniques from
    1.89              the field of supervised machine learning.  In the parlance of machine learning,
    1.90              our task combines feature selection with supervised learning.
    1.91               Principle 3: Use geometry
    1.92 @@ -246,16 +240,6 @@
    1.93              may be particularly good markers.   None of these genes are,  individually,  a
    1.94              perfect marker for AUD; we deliberately chose a “difficult” area in order to
    1.95              better contrast pointwise with geometric methods.
    1.96 -__________________________
    1.97 -   4For each gene, a logistic regression in which the response variable was whether or not a
    1.98 -surface pixel was within area AUD, and the predictor variable was the value of the expression
    1.99 -of the gene underneath that pixel. The resulting scores were used to rank the genes in terms
   1.100 -of how well they predict area AUD.
   1.101 -    5For each gene the gradient similarity (see section ??) between (a) a map of the expression
   1.102 -of each gene on the cortical surface and (b) the shape of area AUD, was calculated, and this
   1.103 -was used to rank the genes.
   1.104 -                                            7
   1.105 -
   1.106               Principle 4: Work in 2-D whenever possible
   1.107              In anatomy, the manifold of interest is usually either defined by a combination
   1.108              of two relevant anatomical axes (todo), or by the surface of the structure (as is
   1.109 @@ -273,6 +257,23 @@
   1.110              the method we develop to include a statistical test that warns the user if the
   1.111              assumption of 2-D structure seems to be wrong.
   1.112                 ——
   1.113 +____________________
   1.114 +   4For each gene, a logistic regression in which the response variable was whether or not a
   1.115 +surface pixel was within area AUD, and the predictor variable was the value of the expression
   1.116 +of the gene underneath that pixel. The resulting scores were used to rank the genes in terms
   1.117 +of how well they predict area AUD.
   1.118 +    5For each gene the gradient similarity (see section ??) between (a) a map of the expression
   1.119 +of each gene on the cortical surface and (b) the shape of area AUD, was calculated, and this
   1.120 +was used to rank the genes.
   1.121 +                                            7
   1.122 +
   1.123 +                                                        
   1.124 +                                                        
   1.125 +            Figure 2: The top row shows the three genes which (individually) best predict
   1.126 +            area AUD, according to logistic regression.  The bottom row shows the three
   1.127 +            genes which (individually) best match area AUD, according to gradient similar-
   1.128 +            ity. From left to right and top to bottom, the genes are Ssr1, Efcbp1, Aph1a,
   1.129 +            Ptk7, Aph1a again, and Lepr
   1.130                 Massive new datasets obtained with techniques such as in situ hybridization
   1.131              (ISH) and BAC-transgenics allow the expression levels of many genes at many
   1.132              locations to be compared.  This can be used to find marker genes for specific
   1.133 @@ -295,10 +296,10 @@
   1.134                   datasets will be made available in both MATLAB and Caret formats.
   1.135               (5) validate the methods developed in (1), (2) and (3) by applying them to
   1.136                   the cerebral cortex datasets created in (4)
   1.137 +                                            8
   1.138 +
   1.139                 All algorithms that we develop will be implemented in an open-source soft-
   1.140              ware toolkit. The toolkit, as well as the machine-readable datasets developed in
   1.141 -                                            8
   1.142 -
   1.143              aim (4) and any other intermediate dataset we produce, will be published and
   1.144              freely available for others to use.
   1.145                 In addition to developing generally useful methods, the application of these
   1.146 @@ -322,29 +323,29 @@
   1.147              combined flat dataset will be created which averages information from all of
   1.148              the layers. These datasets will be made available in both MATLAB and Caret
   1.149              formats.
   1.150 -               —-
   1.151 -               New techniques allow the expression levels of many genes at many locations
   1.152 -            to be compared. It is thought that even neighboring anatomical structures have
   1.153 -            different gene expression profiles.  We propose to develop automated methods
   1.154 -            to relate the spatial variation in gene expression to anatomy.  We will develop
   1.155 -            two kinds of techniques:
   1.156 -             (a) techniques to screen for combinations of marker genes which selectively
   1.157 -                 target anatomical structures
   1.158 -             (b) techniques to suggest new ways of dividing a structure up into anatomical
   1.159 -                 subregions, based on the shapes of contours in the gene expression
   1.160 -               The first kind of technique will be helpful for finding marker genes associated
   1.161 -            with known anatomical features. The second kind of technique will be helpful in
   1.162 -            creating new anatomical maps, maps which reflect differences in gene expression
   1.163 -            the same way that existing maps reflect differences in histology.
   1.164 -               We intend to develop our techniques using the adult mouse cerebral cortex
   1.165 -            as a testbed.   The Allen Brain Atlas has collected a dataset containing the
   1.166 -            expression level of about 4000 genes* over a set of over 150000 voxels, with a
   1.167 -            spatial resolution of approximately 200 microns[?].
   1.168 +___________________________________________________________
   1.169 +    New techniques allow the expression levels of many genes at many locations
   1.170 +to be compared. It is thought that even neighboring anatomical structures have
   1.171 +different gene expression profiles.  We propose to develop automated methods
   1.172 +to relate the spatial variation in gene expression to anatomy.  We will develop
   1.173 +two kinds of techniques:
   1.174 +  (a) techniques to screen for combinations of marker genes which selectively
   1.175 +       target anatomical structures
   1.176 +  (b) techniques to suggest new ways of dividing a structure up into anatomical
   1.177 +       subregions, based on the shapes of contours in the gene expression
   1.178 +    The first kind of technique will be helpful for finding marker genes associated
   1.179 +with known anatomical features. The second kind of technique will be helpful in
   1.180 +creating new anatomical maps, maps which reflect differences in gene expression
   1.181 +the same way that existing maps reflect differences in histology.
   1.182 +    We intend to develop our techniques using the adult mouse cerebral cortex
   1.183 +as a testbed.   The Allen Brain Atlas has collected a dataset containing the
   1.184 +expression level of about 4000 genes* over a set of over 150000 voxels, with a
   1.185 +spatial resolution of approximately 200 microns[?].
   1.186 +                                            9
   1.187 +
   1.188                 We expect to discover sets of marker genes that pick out specific cortical
   1.189              areas.  This will allow the development of drugs and other interventions that
   1.190              selectively target individual cortical areas.   Therefore our research will lead
   1.191 -                                            9
   1.192 -
   1.193              to application in drug discovery, in the development of other targeted clinical
   1.194              interventions, and in the development of new experimental techniques.
   1.195                 The best way to divide up rodent cortex into areas has not been completely
   1.196 @@ -388,11 +389,11 @@
   1.197              in our publications .
   1.198                 We also expect to weigh in on the debate about how to best partition rodent
   1.199              cortex
   1.200 +                                            10
   1.201 +
   1.202                 be useful for drug discovery as well
   1.203                 * Another 16000 genes are available, but they do not cover the entire cerebral
   1.204              cortex with high spatial resolution.
   1.205 -                                            10
   1.206 -
   1.207                 User-definable ROIs Combinatorial gene expression Negative as well as pos-
   1.208              itive signal Use geometry Search for local boundaries if necessary Flatmapped
   1.209               Specific aims
   1.210 @@ -427,10 +428,10 @@
   1.211                   matically find the cortical layer boundaries.
   1.212                4. Run the procedures that we developed on the cortex: we will present, for
   1.213                   each area, a short list of markers to identify that area; and we will also
   1.214 +                                            11
   1.215 +
   1.216                   present lists of “panels” of genes that can be used to delineate many areas
   1.217                   at once.
   1.218 -                                            11
   1.219 -
   1.220              Develop algorithms to suggest a division of a structure into anatom-
   1.221              ical parts
   1.222                1. Explore dimensionality reduction algorithms applied to pixels:  including
   1.223 @@ -471,12 +472,12 @@
   1.224              Finder then looks for genes which can distinguish the ROI from the comparator
   1.225              region. Specifically, it finds genes for which the ratio (expression energy in the
   1.226              ROI) / (expression energy in the comparator region) is high.
   1.227 +                                            12
   1.228 +
   1.229                 Informally, the Gene Finder first infers an ROI based on clustering the seed
   1.230              voxel with other voxels.  Then, the Gene Finder finds genes which overexpress
   1.231              in the ROI as compared to other voxels in the major anatomical region.
   1.232                 There are three major differences between our approach and Gene Finder.
   1.233 -                                            12
   1.234 -
   1.235                 First, Gene Finder focuses on individual genes and individual ROIs in isola-
   1.236              tion. This is great for regions which can be picked out from all other regions by a
   1.237              single gene, but not all of them can (todo). There are at least two ways this can
   1.238 @@ -519,12 +520,12 @@
   1.239              posal. The goal of AGEA’s hierarchial clustering is to generate a binary tree of
   1.240              clusters, where a cluster is a collection of voxels.  AGEA begins by computing
   1.241              the Pearson correlation between each pair of voxels. They then employ a recur-
   1.242 +                                            13
   1.243 +
   1.244              sive divisive (top-down) hierarchial clustering procedure on the voxels, which
   1.245              means that they start with all of the voxels, and then they divide them into clus-
   1.246              ters, and then within each cluster, they divide that cluster into smaller clusters,
   1.247              etc***.  At each step, the collection of voxels is partitioned into two smaller
   1.248 -                                            13
   1.249 -
   1.250              clusters in a way that maximizes the following quantity:  average correlation
   1.251              between all possible pairs of voxels containing one voxel from each cluster.
   1.252                 There are three major differences between our approach and AGEA’s hier-
   1.253 @@ -567,12 +568,12 @@
   1.254              the performance of our techniques against AGEA’s.
   1.255                 Another difference between our techniques and AGEA’s is that AGEA allows
   1.256              the user to enter only a voxel location, and then to either explore the rest of
   1.257 +                                            14
   1.258 +
   1.259              the brain’s relationship to that particular voxel, or explore a partitioning of
   1.260              the brain based on pairwise voxel correlation. If the user is interested not in a
   1.261              single voxel, but rather an entire anatomical structure, AGEA will only succeed
   1.262              to the extent that the selected voxel is a typical representative of the structure.
   1.263 -                                            14
   1.264 -
   1.265              As discussed in the previous paragraph, this poses problems for structures like
   1.266              cortical areas, which (because of their division into cortical layers) do not have
   1.267              a single “typical representative”.
   1.268 @@ -615,55 +616,55 @@
   1.269                 Despite the distinct roles of different cortical areas in both normal function-
   1.270              ing and disease processes, there are no known marker genes for many cortical
   1.271              areas. This project will be immediately useful for both drug discovery and clini-
   1.272 +                                            15
   1.273 +
   1.274              cal research because once the markers are known, interventions can be designed
   1.275              which selectively target specific cortical areas.
   1.276                 This techniques we develop will be useful because they will be applicable to
   1.277              the analysis of other anatomical areas, both in terms of finding marker genes
   1.278 -                                            15
   1.279 -
   1.280              for known areas, and in terms of suggesting new anatomical subdivisions that
   1.281              are based upon the gene expression data.
   1.282 -               —-
   1.283 -               It is likely that our study, by showing which areal divisions naturally fol-
   1.284 -            low from gene expression data, as opposed to traditional histological data, will
   1.285 -            contribute to the creation of
   1.286 -               there are clear genetic or chemical markers known for only a few cortical
   1.287 -            areas. This makes it difficult to target drugs to specific
   1.288 -               As part of aims (1) and (5), we will discover sets of marker genes that pick
   1.289 -            out specific cortical areas.  This will allow the development of drugs and other
   1.290 -            interventions that selectively target individual cortical areas.  As part of aims
   1.291 -            (2) and (5), we will also discover small panels of marker genes that can be used
   1.292 -            to delineate most of the cortical areal map.
   1.293 -               With aims (2) and (4), we
   1.294 -               There are five principals
   1.295 -               In addition to validating the usefulness of the algorithms, the application of
   1.296 -            these methods to cerebral cortex will produce immediate benefits that are only
   1.297 -            one step removed from clinical application.
   1.298 -               todo: remember to check gensat, etc for validation (mention bias/variance)
   1.299 -             Why it is useful to apply these methods to cortex
   1.300 -            There is still room for debate as to exactly how the cortex should be parcellated
   1.301 -            into areas.
   1.302 -               The best way to divide up rodent cortex into areas has not been completely
   1.303 -            determined,
   1.304 -               not yet been accounted for in
   1.305 -               that the expression of some genes will contain novel spatial patterns which
   1.306 -            are not account
   1.307 -               that a genoarchitectonic map
   1.308 -               This principle is only applicable to aim 1 (marker genes). For aim 2 (partition
   1.309 -            a structure in into anatomical subregions), we plan to work with many genes at
   1.310 -            once.
   1.311 -               tood: aim 2 b+s?
   1.312 -             Principle 5: Interoperate with existing tools
   1.313 -            In order for our software to be as useful as possible for our users, it will be
   1.314 -            able to import and export data to standard formats so that users can use our
   1.315 -            software in tandem with other software tools created by other teams.  We will
   1.316 -            support the following formats:  NIFTI (Neuroimaging Informatics Technology
   1.317 +_______________________________
   1.318 +    It is likely that our study, by showing which areal divisions naturally fol-
   1.319 +low from gene expression data, as opposed to traditional histological data, will
   1.320 +contribute to the creation of
   1.321 +    there are clear genetic or chemical markers known for only a few cortical
   1.322 +areas. This makes it difficult to target drugs to specific
   1.323 +    As part of aims (1) and (5), we will discover sets of marker genes that pick
   1.324 +out specific cortical areas.  This will allow the development of drugs and other
   1.325 +interventions that selectively target individual cortical areas.  As part of aims
   1.326 +(2) and (5), we will also discover small panels of marker genes that can be used
   1.327 +to delineate most of the cortical areal map.
   1.328 +    With aims (2) and (4), we
   1.329 +    There are five principals
   1.330 +    In addition to validating the usefulness of the algorithms, the application of
   1.331 +these methods to cerebral cortex will produce immediate benefits that are only
   1.332 +one step removed from clinical application.
   1.333 +    todo: remember to check gensat, etc for validation (mention bias/variance)
   1.334 + Why it is useful to apply these methods to cortex
   1.335 +There is still room for debate as to exactly how the cortex should be parcellated
   1.336 +into areas.
   1.337 +    The best way to divide up rodent cortex into areas has not been completely
   1.338 +determined,
   1.339 +    not yet been accounted for in
   1.340 +    that the expression of some genes will contain novel spatial patterns which
   1.341 +are not account
   1.342 +    that a genoarchitectonic map
   1.343 +    This principle is only applicable to aim 1 (marker genes). For aim 2 (partition
   1.344 +a structure in into anatomical subregions), we plan to work with many genes at
   1.345 +once.
   1.346 +    tood: aim 2 b+s?
   1.347 + Principle 5: Interoperate with existing tools
   1.348 +In order for our software to be as useful as possible for our users, it will be
   1.349 +able to import and export data to standard formats so that users can use our
   1.350 +software in tandem with other software tools created by other teams.  We will
   1.351 +support the following formats:  NIFTI (Neuroimaging Informatics Technology
   1.352 +                                            16
   1.353 +
   1.354              Initiative), SEV (Allen Brain Institute Smoothed Energy Volume), and MAT-
   1.355              LAB. This ensures that our users will not have to exclusively rely on our tools
   1.356              when analyzing data. For example, users will be able to use the data visualiza-
   1.357              tion and analysis capabilities of MATLAB and Caret alongside our software.
   1.358 -                                            16
   1.359 -
   1.360                 To our knowledge, there is no currently available software to convert between
   1.361              these formats, so we will also provide a format conversion tool.  This may be
   1.362              useful even for groups that don’t use any of our other software.
   1.363 @@ -705,13 +706,13 @@
   1.364              combination of genes are expressed, the local tissue is probably part of a certain
   1.365              subregion.  This means that we can then confidentally develop an intervention
   1.366              which is triggered only when that combination of genes are expressed; and to
   1.367 +                                            17
   1.368 +
   1.369              the extent that the result procedure is reliable, we know that the intervention
   1.370              will only be triggered in the target subregion.
   1.371                 We said that the result procedure provides “a way to use the gene expression
   1.372              profiles of voxels in a tissue sample” in order to “determine where the subregions
   1.373              are”.
   1.374 -                                            17
   1.375 -
   1.376                 Does the result procedure get as input all of the gene expression profiles
   1.377              of each voxel in the entire tissue sample,  and produce as output all of the
   1.378              subregional boundaries all at once?
   1.379 @@ -751,12 +752,12 @@
   1.380              if multiple subregions are present,  where they each are.   Or it can be used
   1.381              indirectly; imagine that the result procedure tells us that whenever a certain
   1.382              combination of genes are expressed, the local tissue is probably part of a certain
   1.383 +                                            18
   1.384 +
   1.385              subregion.  This means that we can then confidentally develop an intervention
   1.386              which is triggered only when that combination of genes are expressed; and to
   1.387              the extent that the result procedure is reliable, we know that the intervention
   1.388              will only be triggered in the target subregion.
   1.389 -                                            18
   1.390 -
   1.391                 We said that the result procedure provides “a way to use the gene expression
   1.392              profiles of voxels in a tissue sample” in order to “determine where the subregions
   1.393              are”.
     2.1 Binary file grant.odt has changed
     3.1 Binary file grant.pdf has changed
     4.1 --- a/grant.txt	Sat Apr 11 19:50:21 2009 -0700
     4.2 +++ b/grant.txt	Sat Apr 11 19:53:38 2009 -0700
     4.3 @@ -1,5 +1,7 @@
     4.4  == Specific aims ==
     4.5  
     4.6 +todo2
     4.7 +
     4.8  Massive new datasets obtained with techniques such as in situ hybridization (ISH) and BAC-transgenics allow the expression levels of many genes at many locations to be compared. Our goal is to develop automated methods to relate spatial variation in gene expression to anatomy. We want to find marker genes for specific anatomical regions, and also to draw new anatomical maps based on gene expression patterns. We have three specific aims:
     4.9  
    4.10  (1) develop an algorithm to screen spatial gene expression data for combinations of marker genes which selectively target anatomical regions
