nsf
changeset 26:9d0cc9c66ecd
.
| author | bshanks@bshanks.dyndns.org | 
|---|---|
| date | Mon Apr 13 03:22:01 2009 -0700 (16 years ago) | 
| parents | 8ff9b7b5c242 | 
| children | 5db0420abbb6 | 
| files | grant.html grant.odt grant.pdf grant.txt | 
   line diff
     1.1 --- a/grant.html	Mon Apr 13 03:21:04 2009 -0700
     1.2 +++ b/grant.html	Mon Apr 13 03:22:01 2009 -0700
     1.3 @@ -22,6 +22,8 @@
     1.4                 All algorithms that we develop will be implemented in an open-source soft-
     1.5              ware toolkit.  The toolkit, as well as the machine-readable datasets developed
     1.6              in aim (3), will be published and freely available for others to use.
     1.7 +                                            1
     1.8 +
     1.9               Background and significance
    1.10               Aim 1
    1.11              Machine learning terminology: supervised learning
    1.12 @@ -35,8 +37,6 @@
    1.13              this a classification task, because each voxel is being assigned to a class (namely,
    1.14              its subregion).
    1.15                 Therefore, an understanding of the relationship between the combination of
    1.16 -                                            1
    1.17 -
    1.18              their expression levels and the locations of the subregions may be expressed as
    1.19              a function. The input to this function is a voxel, along with the gene expression
    1.20              levels within that voxel;  the output is the subregional identity of the target
    1.21 @@ -68,6 +68,8 @@
    1.22              procedures are called “stepwise” or “greedy”.
    1.23                 Although the classifier itself may only look at the gene expression data within
    1.24              each voxel before classifying that voxel, the learning algorithm which constructs
    1.25 +                                            2
    1.26 +
    1.27              the classifier may look over the entire dataset.  We can categorize score-based
    1.28              feature selection methods depending on how the score of calculated.   Often
    1.29              the score calculation consists of assigning a sub-score to each voxel, and then
    1.30 @@ -83,8 +85,6 @@
    1.31                 Above, we defined an “instance” as the combination of a voxel with the
    1.32              “associated gene expression data”. In our case this refers to the expression level
    1.33              of genes within the voxel, but should we include the expression levels of all
    1.34 -                                            2
    1.35 -
    1.36              genes, or only a few of them?
    1.37                 It is too much to hope that every anatomical region of interest will be iden-
    1.38              tified by a single gene. For example, in the cortex, there are some areas which
    1.39 @@ -116,6 +116,8 @@
    1.40              evidence of the complementary nature of pointwise and local scoring methods.
    1.41                 Principle 4: Work in 2-D whenever possible
    1.42                 There are many anatomical structures which are commonly characterized in
    1.43 +                                            3
    1.44 +
    1.45              terms of a two-dimensional manifold. When it is known that the structure that
    1.46              one is looking for is two-dimensional, the results may be improved by allowing
    1.47              the analysis algorithm to take advantage of this prior knowledge.  In addition,
    1.48 @@ -128,8 +130,6 @@
    1.49              of machine learning. One thing that you can do with such a dataset is to group
    1.50              instances together. A set of similar instances is called a cluster, and the activity
    1.51              of finding grouping the data into clusters is called clustering or cluster analysis.
    1.52 -                                            3
    1.53 -
    1.54                 The task of deciding how to carve up a structure into anatomical subregions
    1.55              can be put into these terms.  The instances are once again voxels (or pixels)
    1.56              along with their associated gene expression profiles.  We make the assumption
    1.57 @@ -162,6 +162,8 @@
    1.58              image into clusters, usually contiguous clusters.  Aim 2 is similar to an image
    1.59              segmentation task. There are two main differences; in our task, there are thou-
    1.60              sands of color channels (one for each gene), rather than just three.  There are
    1.61 +                                            4
    1.62 +
    1.63              imaging tasks which use more than three colors, however, for example multispec-
    1.64              tral imaging and hyperspectral imaging, which are often used to process satellite
    1.65              imagery. A more crucial difference is that there are various cues which are ap-
    1.66 @@ -176,8 +178,6 @@
    1.67              algorithms perform better on small numbers of features.  There are techniques
    1.68              which “summarize” a larger number of features using a smaller number of fea-
    1.69              tures; these techniques go by the name of feature extraction or dimensionality
    1.70 -                                            4
    1.71 -
    1.72              reduction.  The small set of features that such a technique yields is called the
    1.73              reduced feature set. After the reduced feature set is created, the instances may
    1.74              be replaced by reduced instances, which have as their features the reduced fea-
    1.75 @@ -208,6 +208,8 @@
    1.76              This is because many genes have an expression pattern which seems to pick
    1.77              out a single, spatially continguous subregion. Therefore, it seems likely that an
    1.78              anatomically interesting subregion will have multiple genes which each individ-
    1.79 +                                            5
    1.80 +
    1.81              ually pick it out1. This suggests the following procedure: cluster together genes
    1.82              which pick out similar subregions, and then to use the more popular common
    1.83              subregions as the final clusters. In the Preliminary Data we show that a num-
    1.84 @@ -216,14 +218,6 @@
    1.85              this fashion.
    1.86               Aim 3
    1.87              Background
    1.88 -_______________
    1.89 -   1This would seem to contradict our finding in aim 1 that some cortical areas are combina-
    1.90 -torially coded by multiple genes.  However, it is possible that the currently accepted cortical
    1.91 -maps divide the cortex into subregions which are unnatural from the point of view of gene
    1.92 -expression; perhaps there is some other way to map the cortex for which each subregion can
    1.93 -be identified by single genes.
    1.94 -                                            5
    1.95 -
    1.96                 The cortex is divided into areas and layers.  To a first approximation, the
    1.97              parcellation of the cortex into areas can be drawn as a 2-D map on the surface of
    1.98              the cortex.  In the third dimension, the boundaries between the areas continue
    1.99 @@ -254,6 +248,14 @@
   1.100              finding markers for each individual cortical areas, we will find a small panel
   1.101              of genes that can find many of the areal boundaries at once.  This panel of
   1.102              marker genes will allow the development of an ISH protocol that will allow
   1.103 +__________________________
   1.104 +   1This would seem to contradict our finding in aim 1 that some cortical areas are combina-
   1.105 +torially coded by multiple genes.  However, it is possible that the currently accepted cortical
   1.106 +maps divide the cortex into subregions which are unnatural from the point of view of gene
   1.107 +expression; perhaps there is some other way to map the cortex for which each subregion can
   1.108 +be identified by single genes.
   1.109 +                                            6
   1.110 +
   1.111              experimenters to more easily identify which anatomical areas are present in
   1.112              small samples of cortex.
   1.113                 The method developed in aim (3) will provide a genoarchitectonic viewpoint
   1.114 @@ -269,8 +271,6 @@
   1.115                 While we do not here propose to analyze human gene expression data, it is
   1.116              conceivable that the methods we propose to develop could be used to suggest
   1.117              modifications to the human cortical map as well.
   1.118 -                                            6
   1.119 -
   1.120               Related work
   1.121              There does not appear to be much work on the automated analysis of spatial
   1.122              gene expression data.
   1.123 @@ -297,23 +297,26 @@
   1.124              yielded impressive results, proving the usefulness of such research. We have run
   1.125              NNMF on the cortical dataset and while the results are promising (see Prelim-
   1.126              inary Data), we think that it will be possible to find a better method2 (we also
   1.127 +__________________________
   1.128 +   2We ran “vanilla” NNMF, whereas the paper under discussion used a modified method.
   1.129 +Their main modification consisted of adding a soft spatial contiguity constraint.  However,
   1.130 +on our dataset,  NNMF naturally produced spatially contiguous clusters,  so no additional
   1.131 +                                            7
   1.132 +
   1.133              think that more automation of the parts that this paper’s authors did manually
   1.134              will be possible).
   1.135                 and [?] describes AGEA. todo
   1.136 +__________________________
   1.137 +constraint was needed. The paper under discussion mentions that they also tried a hierarchial
   1.138 +variant of NNMF, but since they didn’t report its results, we assume that those result were
   1.139 +not any more impressive than the results of the non-hierarchial variant.
   1.140 +                                            8
   1.141 +
   1.142               Preliminary work
   1.143               Format conversion between SEV, MATLAB, NIFTI
   1.144              todo
   1.145               Flatmap of cortex
   1.146              todo
   1.147 -_______________________
   1.148 -   2We ran “vanilla” NNMF, whereas the paper under discussion used a modified method.
   1.149 -Their main modification consisted of adding a soft spatial contiguity constraint.  However,
   1.150 -on our dataset,  NNMF naturally produced spatially contiguous clusters,  so no additional
   1.151 -constraint was needed. The paper under discussion mentions that they also tried a hierarchial
   1.152 -variant of NNMF, but since they didn’t report its results, we assume that those result were
   1.153 -not any more impressive than the results of the non-hierarchial variant.
   1.154 -                                            7
   1.155 -
   1.156                 Using combinations of multiple genes is necessary and sufficient to
   1.157              delineate some cortical areas
   1.158                 Here we give an example of a cortical area which is not marked by any
   1.159 @@ -343,15 +346,7 @@
   1.160              genes which express more strongly in AUD than outside of it; its weakness is that
   1.161              this includes many areas which don’t have a salient border matching the areal
   1.162              border. The geometric method identifies genes whose salient expression border
   1.163 -            seems to partially line up with the border of AUD; its weakness is that this
   1.164 -            includes genes which don’t express over the entire area. Genes which have high
   1.165 -            rankings using both pointwise and border criteria, such as Aph1a in the example,
   1.166 -            may be particularly good markers.   None of these genes are,  individually,  a
   1.167 -            perfect marker for AUD; we deliberately chose a “difficult” area in order to
   1.168 -            better contrast pointwise with geometric methods.
   1.169 -               Areas which can be identified by single genes
   1.170 -               todo
   1.171 -____________________
   1.172 +__________________________
   1.173     3“WW, C2 and coiled-coil domain containing 1”; EntrezGene ID 211652
   1.174      4“mitochondrial translational initiation factor 2”; EntrezGene ID 76784
   1.175      5For each gene, a logistic regression in which the response variable was whether or not a
   1.176 @@ -361,7 +356,7 @@
   1.177      6For each gene the gradient similarity (see section ??) between (a) a map of the expression
   1.178  of each gene on the cortical surface and (b) the shape of area AUD, was calculated, and this
   1.179  was used to rank the genes.
   1.180 -                                            8
   1.181 +                                            9
   1.182  
   1.183                                          
   1.184              
   1.185 @@ -373,6 +368,8 @@
   1.186              the boundary of region MO. Pixels are colored approximately according to the
   1.187              density of expressing cells underneath each pixel, with red meaning a lot of
   1.188              expression and blue meaning little.
   1.189 +                                            10
   1.190 +
   1.191                                                          
   1.192                                                          
   1.193              Figure 2: The top row shows the three genes which (individually) best predict
   1.194 @@ -380,8 +377,14 @@
   1.195              genes which (individually) best match area AUD, according to gradient similar-
   1.196              ity. From left to right and top to bottom, the genes are Ssr1, Efcbp1, Aph1a,
   1.197              Ptk7, Aph1a again, and Lepr
   1.198 -                                            9
   1.199 -
   1.200 +            seems to partially line up with the border of AUD; its weakness is that this
   1.201 +            includes genes which don’t express over the entire area. Genes which have high
   1.202 +            rankings using both pointwise and border criteria, such as Aph1a in the example,
   1.203 +            may be particularly good markers.   None of these genes are,  individually,  a
   1.204 +            perfect marker for AUD; we deliberately chose a “difficult” area in order to
   1.205 +            better contrast pointwise with geometric methods.
   1.206 +               Areas which can be identified by single genes
   1.207 +               todo
   1.208               Specific to Aim 1 (and Aim 3)
   1.209              Forward stepwise logistic regression todo
   1.210                 SVM on all genes at once
   1.211 @@ -396,6 +399,10 @@
   1.212              our task combines feature selection with supervised learning.
   1.213                 Decision trees
   1.214                 todo
   1.215 +____________________
   1.216 +   75-fold cross-validation.
   1.217 +                                            11
   1.218 +
   1.219               Specific to Aim 2 (and Aim 3)
   1.220              Raw dimensionality reduction results
   1.221                 todo
   1.222 @@ -404,6 +411,8 @@
   1.223                 Many areas are captured by clusters of genes
   1.224                 todo
   1.225                 todo
   1.226 +                                            12
   1.227 +
   1.228               Research plan
   1.229              todo amongst other things:
   1.230                 Develop algorithms that find genetic markers for anatomical re-
   1.231 @@ -419,10 +428,6 @@
   1.232                   with a handful of genes. We will consider both (a) algorithms that incre-
   1.233                   mentally/greedily combine single gene markers into sets, such as forward
   1.234                   stepwise regression and decision trees, and also (b) supervised learning
   1.235 -__________________________
   1.236 -   75-fold cross-validation.
   1.237 -                                            10
   1.238 -
   1.239                   techniques which use soft constraints to minimize the number of features,
   1.240                   such as sparse support vector machines.
   1.241                4. Extend the procedure to handle difficult areas by combining or redrawing
   1.242 @@ -446,6 +451,8 @@
   1.243                   at once.
   1.244                 Develop algorithms to suggest a division of a structure into anatom-
   1.245              ical parts
   1.246 +                                            13
   1.247 +
   1.248                1. Explore dimensionality reduction algorithms applied to pixels:  including
   1.249                   TODO
   1.250                2. Explore dimensionality reduction algorithms applied to genes:  including
   1.251 @@ -457,9 +464,8 @@
   1.252                   clustering to create anatomical maps
   1.253                6. Run this algorithm on the cortex: present a hierarchial, genoarchitectonic
   1.254                   map of the cortex
   1.255 -                                            11
   1.256 -
   1.257 -            _______________________________________________________________________________________________________ stuff i dunno where to put yet (there is more scattered through grant-
   1.258 +______________________________________________
   1.259 +    stuff  i  dunno  where  to  put  yet  (there  is  more  scattered  through  grant-
   1.260  oldtext):
   1.261      Principle 4: Work in 2-D whenever possible
   1.262      In anatomy, the manifold of interest is usually either defined by a combina-
   1.263 @@ -484,6 +490,6 @@
   1.264  app2 has examples of genetic targeting to specific anatomical regions
   1.265      —
   1.266      note:
   1.267 -                                            12
   1.268 -
   1.269 -
   1.270 +                                            14
   1.271 +
   1.272 +
     2.1 Binary file grant.odt has changed
     3.1 Binary file grant.pdf has changed
     4.1 --- a/grant.txt	Mon Apr 13 03:21:04 2009 -0700
     4.2 +++ b/grant.txt	Mon Apr 13 03:22:01 2009 -0700
     4.3 @@ -13,6 +13,7 @@
     4.4  All algorithms that we develop will be implemented in an open-source software toolkit. The toolkit, as well as the machine-readable datasets developed in aim (3), will be published and freely available for others to use. 
     4.5  
     4.6  
     4.7 +\newpage
     4.8  
     4.9  == Background and significance ==
    4.10  
    4.11 @@ -151,6 +152,8 @@
    4.12  
    4.13  
    4.14  
    4.15 +\newpage
    4.16 +
    4.17  == Preliminary work ==
    4.18  
    4.19  === Format conversion between SEV, MATLAB, NIFTI ===
    4.20 @@ -254,6 +257,9 @@
    4.21  
    4.22  todo
    4.23  
    4.24 +
    4.25 +
    4.26 +\newpage
    4.27  == Research plan ==
    4.28  
    4.29  todo amongst other things:
