nsf
changeset 18:5d6dfc57654a
.
| author | bshanks@bshanks.dyndns.org | 
|---|---|
| date | Sun Apr 12 15:34:12 2009 -0700 (16 years ago) | 
| parents | ff9b47f2c7d3 | 
| children | 717d4025b861 | 
| files | grant.html grant.odt grant.pdf grant.txt | 
   line diff
     1.1 --- a/grant.html	Sun Apr 12 04:01:58 2009 -0700
     1.2 +++ b/grant.html	Sun Apr 12 15:34:12 2009 -0700
     1.3 @@ -145,7 +145,10 @@
     1.4              outcome of clustering may be a hierarchial tree of clusters, rather than a single
     1.5              set of clusters which partition the voxels. This is called hierarchial clustering.
     1.6                 Similarity scores
     1.7 -               todo
     1.8 +               A crucial choice when designing a clustering method is how to measure
     1.9 +            similarity, across either pairs of instances, or clusters, or both.  There is much
    1.10 +            overlap between scoring methods for feature selection (discussed above under
    1.11 +            Aim 1) and scoring methods for similarity.
    1.12                 Spatially contiguous clusters; image segmentation
    1.13                 We have shown that aim 2 is a type of clustering task.   In fact,  it is a
    1.14              special type of clustering task because we have an additional constraint on
    1.15 @@ -173,11 +176,11 @@
    1.16              algorithms perform better on small numbers of features.  There are techniques
    1.17              which “summarize” a larger number of features using a smaller number of fea-
    1.18              tures; these techniques go by the name of feature extraction or dimensionality
    1.19 +                                            4
    1.20 +
    1.21              reduction.  The small set of features that such a technique yields is called the
    1.22              reduced feature set. After the reduced feature set is created, the instances may
    1.23              be replaced by reduced instances, which have as their features the reduced fea-
    1.24 -                                            4
    1.25 -
    1.26              ture set rather than the original feature set of all gene expression levels.  Note
    1.27              that the features in the reduced feature set do not necessarily correspond to
    1.28              genes; each feature in the reduced set may be any function of the set of gene
    1.29 @@ -213,11 +216,7 @@
    1.30              this fashion.
    1.31               Aim 3
    1.32              Background
    1.33 -               The cortex is divided into areas and layers.  To a first approximation, the
    1.34 -            parcellation of the cortex into areas can be drawn as a 2-D map on the surface of
    1.35 -            the cortex.  In the third dimension, the boundaries between the areas continue
    1.36 -            downwards into the cortical depth,  perpendicular to the surface.   The layer
    1.37 -__________________________
    1.38 +_______________
    1.39     1This would seem to contradict our finding in aim 1 that some cortical areas are combina-
    1.40  torially coded by multiple genes.  However, it is possible that the currently accepted cortical
    1.41  maps divide the cortex into subregions which are unnatural from the point of view of gene
    1.42 @@ -225,6 +224,10 @@
    1.43  be identified by single genes.
    1.44                                              5
    1.45  
    1.46 +               The cortex is divided into areas and layers.  To a first approximation, the
    1.47 +            parcellation of the cortex into areas can be drawn as a 2-D map on the surface of
    1.48 +            the cortex.  In the third dimension, the boundaries between the areas continue
    1.49 +            downwards into the cortical depth,  perpendicular to the surface.   The layer
    1.50              boundaries run parallel to the surface. One can picture an area of the cortex as
    1.51              a slice of many-layered cake.
    1.52                 Although it is known that different cortical areas have distinct roles in both
    1.53 @@ -266,14 +269,50 @@
    1.54                 While we do not here propose to analyze human gene expression data, it is
    1.55              conceivable that the methods we propose to develop could be used to suggest
    1.56              modifications to the human cortical map as well.
    1.57 +                                            6
    1.58 +
    1.59               Related work
    1.60 -            todo
    1.61 +            There does not appear to be much work on the automated analysis of spatial
    1.62 +            gene expression data.
    1.63 +               There is a substantial body of work on the analysis of gene expression data,
    1.64 +            however, most of this concerns gene expression data which is not fundamentally
    1.65 +            spatial, for example, microarray datasets.  In some cases, a few locations have
    1.66 +            been sampled, but such a dataset is still of a fundamentally different character
    1.67 +            than a dataset containing a large grid of sampling points distributed over space.
    1.68 +            In relating gene expression to anatomy, it is the spatial aspects of the problem
    1.69 +            which are the most important.
    1.70 +               As noted above, there has been much work on both supervised learning and
    1.71 +            clustering, and there are many available algorithms for each.  Many of these
    1.72 +            algorithms are flexible enough to accomodate new scoring measures; and the
    1.73 +            performance of most of the algorithms is greatly affected by preprocessing and
    1.74 +            by the choice of which representation to use for feature values. We think it likely
    1.75 +            that for this application, the development of domain-specific scoring measures
    1.76 +            (such as gradient similarity, which is discussed in Preliminary Work) will be
    1.77 +            necessary in order to achieve the best results. In essence, the machine learning
    1.78 +            community has provided algorithms, but the scientist must provide a framework
    1.79 +            for representing the problem domain, and the way that this framework is set
    1.80 +            up has a large impact on performance. Creating a good framework can require
    1.81 +            creatively reconceptualizing the problem domain, and is not merely a mechanical
    1.82 +            “fine-tuning” of numerical parameters.  Therefore, the completion of Aims 1
    1.83 +            and 2 involves more than just reimplementing an existing algorithm, and more
    1.84 +            than just choosing between a set of existing algorithms, and will constitute a
    1.85 +            substantial contribution to biology.
    1.86 +               We are aware of one other effort to computationally analyze spatial gene
    1.87 +            expression data.
    1.88 +               In the Preliminary Work, we show that
    1.89 +               The creation of a domain-specific scoring measure may be required in order
    1.90 +            to achieve good performance, and it is not impossible that the algorithms them-
    1.91 +            selves will have to be extended.  We plan to test out existing algorithms and
    1.92 +            scoring measures,
    1.93 +               Therefore, we anticipate
    1.94 +               Therefore, it is unclear which of the
    1.95 +               todo
    1.96                 vs. AGEA – i wrote something on this but i’m going to rewrite it
    1.97 -                                            6
    1.98 -
    1.99               Preliminary work
   1.100               Format conversion between SEV, MATLAB, NIFTI
   1.101              todo
   1.102 +                                            7
   1.103 +
   1.104               Flatmap of cortex
   1.105              todo
   1.106                 Using combinations of multiple genes is necessary and sufficient to
   1.107 @@ -305,6 +344,12 @@
   1.108              genes which express more strongly in AUD than outside of it; its weakness is that
   1.109              this includes many areas which don’t have a salient border matching the areal
   1.110              border. The geometric method identifies genes whose salient expression border
   1.111 +            seems to partially line up with the border of AUD; its weakness is that this
   1.112 +            includes genes which don’t express over the entire area. Genes which have high
   1.113 +            rankings using both pointwise and border criteria, such as Aph1a in the example,
   1.114 +            may be particularly good markers.   None of these genes are,  individually,  a
   1.115 +            perfect marker for AUD; we deliberately chose a “difficult” area in order to
   1.116 +            better contrast pointwise with geometric methods.
   1.117  __________________________
   1.118     2“WW, C2 and coiled-coil domain containing 1”; EntrezGene ID 211652
   1.119      3“mitochondrial translational initiation factor 2”; EntrezGene ID 76784
   1.120 @@ -315,7 +360,7 @@
   1.121      5For each gene the gradient similarity (see section ??) between (a) a map of the expression
   1.122  of each gene on the cortical surface and (b) the shape of area AUD, was calculated, and this
   1.123  was used to rank the genes.
   1.124 -                                            7
   1.125 +                                            8
   1.126  
   1.127                                          
   1.128              
   1.129 @@ -327,8 +372,6 @@
   1.130              the boundary of region MO. Pixels are colored approximately according to the
   1.131              density of expressing cells underneath each pixel, with red meaning a lot of
   1.132              expression and blue meaning little.
   1.133 -                                            8
   1.134 -
   1.135                                                          
   1.136                                                          
   1.137              Figure 2: The top row shows the three genes which (individually) best predict
   1.138 @@ -336,15 +379,11 @@
   1.139              genes which (individually) best match area AUD, according to gradient similar-
   1.140              ity. From left to right and top to bottom, the genes are Ssr1, Efcbp1, Aph1a,
   1.141              Ptk7, Aph1a again, and Lepr
   1.142 -            seems to partially line up with the border of AUD; its weakness is that this
   1.143 -            includes genes which don’t express over the entire area. Genes which have high
   1.144 -            rankings using both pointwise and border criteria, such as Aph1a in the example,
   1.145 -            may be particularly good markers.   None of these genes are,  individually,  a
   1.146 -            perfect marker for AUD; we deliberately chose a “difficult” area in order to
   1.147 -            better contrast pointwise with geometric methods.
   1.148 +                                            9
   1.149 +
   1.150                 Areas which can be identified by single genes
   1.151                 todo
   1.152 -             Aim 1 (and Aim 3)
   1.153 +             Specific to Aim 1 (and Aim 3)
   1.154              Forward stepwise logistic regression todo
   1.155                 SVM on all genes at once
   1.156                 In order to see how well one can do when looking at all genes at once, we
   1.157 @@ -354,27 +393,18 @@
   1.158              practically useful.
   1.159                 The requirement to find combinations of only a small number of genes limits
   1.160              us from straightforwardly applying many of the most simple techniques from
   1.161 -__________________________
   1.162 -   6Using the Shogun SVM package (todo:cite), with parameters type=GMNPSVM (multi-
   1.163 -class b-SVM), kernal = gaussian with sigma = 0.1, c = 10, epsilon = 1e-1 – these are the
   1.164 -first parameters we tried, so presumably performance would improve with different choices of
   1.165 -parameters. 5-fold cross-validation.
   1.166 -                                            9
   1.167 -
   1.168              the field of supervised machine learning.  In the parlance of machine learning,
   1.169              our task combines feature selection with supervised learning.
   1.170                 Decision trees
   1.171                 todo
   1.172 -             Aim 2 (and Aim 3)
   1.173 -             Raw dimensionality reduction results
   1.174 -             Dimensionality reduction plus K-means or spectral clus-
   1.175 -            tering
   1.176 -            Many areas are captured by clusters of genes
   1.177 +             Specific to Aim 2 (and Aim 3)
   1.178 +            Raw dimensionality reduction results
   1.179 +               Dimensionality reduction plus K-means or spectral clustering
   1.180 +               Many areas are captured by clusters of genes
   1.181                 todo
   1.182                 todo
   1.183               Research plan
   1.184 -            todo
   1.185 -               amongst other thigns:
   1.186 +            todo amongst other things:
   1.187                 Develop algorithms that find genetic markers for anatomical re-
   1.188              gions
   1.189                1. Develop scoring measures for evaluating how good individual genes are at
   1.190 @@ -387,6 +417,10 @@
   1.191                   ing: for areas that cannot be identified by any single gene, identify them
   1.192                   with a handful of genes. We will consider both (a) algorithms that incre-
   1.193                   mentally/greedily combine single gene markers into sets, such as forward
   1.194 +__________________________
   1.195 +   65-fold cross-validation.
   1.196 +                                            10
   1.197 +
   1.198                   stepwise regression and decision trees, and also (b) supervised learning
   1.199                   techniques which use soft constraints to minimize the number of features,
   1.200                   such as sparse support vector machines.
   1.201 @@ -397,8 +431,6 @@
   1.202                   which (a) detect when a difficult area could be fit if its boundary were
   1.203                   redrawn slightly, and (b) detect when a difficult area could be combined
   1.204                   with adjacent areas to create a larger area which can be fit.
   1.205 -                                            10
   1.206 -
   1.207                 Apply these algorithms to the cortex
   1.208                1. Create open source format conversion tools:  we will create tools to bulk
   1.209                   download the ABA dataset and to convert between SEV, NIFTI and MAT-
   1.210 @@ -424,8 +456,9 @@
   1.211                   clustering to create anatomical maps
   1.212                6. Run this algorithm on the cortex: present a hierarchial, genoarchitectonic
   1.213                   map of the cortex
   1.214 -______________________________________________
   1.215 -    stuff  i  dunno  where  to  put  yet  (there  is  more  scattered  through  grant-
   1.216 +                                            11
   1.217 +
   1.218 +            _______________________________________________________________________________________________________ stuff i dunno where to put yet (there is more scattered through grant-
   1.219  oldtext):
   1.220      Principle 4: Work in 2-D whenever possible
   1.221      In anatomy, the manifold of interest is usually either defined by a combina-
   1.222 @@ -436,16 +469,14 @@
   1.223      The method that we will develop will begin by mapping the data into a
   1.224  2-D plane.  Although the manifold that characterized cortical areas is known
   1.225  to be the cortical surface, it remains to be seen which method of mapping the
   1.226 -                                            11
   1.227 -
   1.228 -            manifold into a plane is optimal for this application. We will compare mappings
   1.229 -            which attempt to preserve size (such as the one used by Caret??) with mappings
   1.230 -            which preserve angle (conformal maps).
   1.231 -               Although there is much 2-D organization in anatomy, there are also struc-
   1.232 -            tures whose shape is fundamentally 3-dimensional.  If possible, we would like
   1.233 -            the method we develop to include a statistical test that warns the user if the
   1.234 -            assumption of 2-D structure seems to be wrong.
   1.235 -               todo: replace aim # bullet pts with #s
   1.236 +manifold into a plane is optimal for this application. We will compare mappings
   1.237 +which attempt to preserve size (such as the one used by Caret??) with mappings
   1.238 +which preserve angle (conformal maps).
   1.239 +    Although there is much 2-D organization in anatomy, there are also struc-
   1.240 +tures whose shape is fundamentally 3-dimensional.  If possible, we would like
   1.241 +the method we develop to include a statistical test that warns the user if the
   1.242 +assumption of 2-D structure seems to be wrong.
   1.243 +    todo: replace aim # bullet pts with #s
   1.244                                              12
   1.245  
   1.246  
     2.1 Binary file grant.odt has changed
     3.1 Binary file grant.pdf has changed
     4.1 --- a/grant.txt	Sun Apr 12 04:01:58 2009 -0700
     4.2 +++ b/grant.txt	Sun Apr 12 15:34:12 2009 -0700
     4.3 @@ -79,8 +79,7 @@
     4.4  
     4.5  **Similarity scores**
     4.6  
     4.7 -
     4.8 -todo
     4.9 +A crucial choice when designing a clustering method is how to measure similarity, across either pairs of instances, or clusters, or both. There is much overlap between scoring methods for feature selection (discussed above under Aim 1) and scoring methods for similarity. 
    4.10  
    4.11  
    4.12  **Spatially contiguous clusters; image segmentation**
    4.13 @@ -137,6 +136,23 @@
    4.14  
    4.15  
    4.16  === Related work ===
    4.17 +There does not appear to be much work on the automated analysis of spatial gene expression data. 
    4.18 +
    4.19 +There is a substantial body of work on the analysis of gene expression data, however, most of this concerns gene expression data which is not fundamentally spatial, for example, microarray datasets. In some cases, a few locations have been sampled, but such a dataset is still of a fundamentally different character than a dataset containing a large grid of sampling points distributed over space. In relating gene expression to anatomy, it is the spatial aspects of the problem which are the most important.
    4.20 +
    4.21 +As noted above, there has been much work on both supervised learning and clustering, and there are many available algorithms for each. Many of these algorithms are flexible enough to accomodate new scoring measures; and the performance of most of the algorithms is greatly affected by preprocessing and by the choice of which representation to use for feature values. We think it likely that for this application, the development of domain-specific scoring measures (such as gradient similarity, which is discussed in Preliminary Work) will be necessary in order to achieve the best results. In essence, the machine learning community has provided algorithms, but the scientist must provide a framework for representing the problem domain, and the way that this framework is set up has a large impact on performance. Creating a good framework can require creatively reconceptualizing the problem domain, and is not merely a mechanical "fine-tuning" of numerical parameters. Therefore, the completion of Aims 1 and 2 involves more than just reimplementing an existing algorithm, and more than just choosing between a set of existing algorithms, and will constitute a substantial contribution to biology.
    4.22 +
    4.23 +We are aware of one other effort to computationally analyze spatial gene expression data. 
    4.24 +
    4.25 +
    4.26 +In the Preliminary Work, we show that 
    4.27 +
    4.28 +The creation of a domain-specific scoring measure may be required in order to achieve good performance, and it is not impossible that the algorithms themselves will have to be extended. We plan to test out existing algorithms and scoring measures, 
    4.29 +
    4.30 +Therefore, we anticipate 
    4.31 +
    4.32 +Therefore, it is unclear which of the 
    4.33 +
    4.34  todo
    4.35  
    4.36  vs. AGEA -- i wrote something on this but i'm going to rewrite it
    4.37 @@ -199,14 +215,14 @@
    4.38  todo
    4.39  
    4.40  
    4.41 -=== Aim 1 (and Aim 3) ===
    4.42 +=== Specific to Aim 1 (and Aim 3) ===
    4.43  **Forward stepwise logistic regression**
    4.44  todo
    4.45  
    4.46  
    4.47  **SVM on all genes at once**
    4.48  
    4.49 -In order to see how well one can do when looking at all genes at once, we ran a support vector machine to classify cortical surface pixels based on their gene expression profiles. We achieved classification accuracy of about 81%\footnote{Using the Shogun SVM package (todo:cite), with parameters type=GMNPSVM (multiclass b-SVM), kernal = gaussian with sigma = 0.1, c = 10, epsilon = 1e-1 -- these are the first parameters we tried, so presumably performance would improve with different choices of parameters. 5-fold cross-validation.}. As noted above, however, a classifier that looks at all the genes at once isn't practically useful. 
    4.50 +In order to see how well one can do when looking at all genes at once, we ran a support vector machine to classify cortical surface pixels based on their gene expression profiles. We achieved classification accuracy of about 81%\footnote{5-fold cross-validation.}. As noted above, however, a classifier that looks at all the genes at once isn't practically useful. 
    4.51  
    4.52  The requirement to find combinations of only a small number of genes limits us from straightforwardly applying many of the most simple techniques from the field of supervised machine learning. In the parlance of machine learning, our task combines feature selection with supervised learning.
    4.53  
    4.54 @@ -217,12 +233,12 @@
    4.55  todo
    4.56  
    4.57  
    4.58 -=== Aim 2 (and Aim 3) ===
    4.59 -
    4.60 -=== Raw dimensionality reduction results ===
    4.61 -
    4.62 -
    4.63 -=== Dimensionality reduction plus K-means or spectral clustering ===
    4.64 +=== Specific to Aim 2 (and Aim 3) ===
    4.65 +
    4.66 +**Raw dimensionality reduction results**
    4.67 +
    4.68 +
    4.69 +**Dimensionality reduction plus K-means or spectral clustering**
    4.70  
    4.71  
    4.72  
    4.73 @@ -244,9 +260,7 @@
    4.74  
    4.75  == Research plan ==
    4.76  
    4.77 -todo
    4.78 -
    4.79 -amongst other thigns:
    4.80 +todo amongst other things:
    4.81  
    4.82  
    4.83  **Develop algorithms that find genetic markers for anatomical regions**
