cg
changeset 26:9d0cc9c66ecd
.
author | bshanks@bshanks.dyndns.org |
---|---|
date | Mon Apr 13 03:22:01 2009 -0700 (16 years ago) |
parents | 8ff9b7b5c242 |
children | 5db0420abbb6 |
files | grant.html grant.odt grant.pdf grant.txt |
line diff
1.1 --- a/grant.html Mon Apr 13 03:21:04 2009 -0700
1.2 +++ b/grant.html Mon Apr 13 03:22:01 2009 -0700
1.3 @@ -22,6 +22,8 @@
1.4 All algorithms that we develop will be implemented in an open-source soft-
1.5 ware toolkit. The toolkit, as well as the machine-readable datasets developed
1.6 in aim (3), will be published and freely available for others to use.
1.7 + 1
1.8 +
1.9 Background and significance
1.10 Aim 1
1.11 Machine learning terminology: supervised learning
1.12 @@ -35,8 +37,6 @@
1.13 this a classification task, because each voxel is being assigned to a class (namely,
1.14 its subregion).
1.15 Therefore, an understanding of the relationship between the combination of
1.16 - 1
1.17 -
1.18 their expression levels and the locations of the subregions may be expressed as
1.19 a function. The input to this function is a voxel, along with the gene expression
1.20 levels within that voxel; the output is the subregional identity of the target
1.21 @@ -68,6 +68,8 @@
1.22 procedures are called “stepwise” or “greedy”.
1.23 Although the classifier itself may only look at the gene expression data within
1.24 each voxel before classifying that voxel, the learning algorithm which constructs
1.25 + 2
1.26 +
1.27 the classifier may look over the entire dataset. We can categorize score-based
1.28 feature selection methods depending on how the score of calculated. Often
1.29 the score calculation consists of assigning a sub-score to each voxel, and then
1.30 @@ -83,8 +85,6 @@
1.31 Above, we defined an “instance” as the combination of a voxel with the
1.32 “associated gene expression data”. In our case this refers to the expression level
1.33 of genes within the voxel, but should we include the expression levels of all
1.34 - 2
1.35 -
1.36 genes, or only a few of them?
1.37 It is too much to hope that every anatomical region of interest will be iden-
1.38 tified by a single gene. For example, in the cortex, there are some areas which
1.39 @@ -116,6 +116,8 @@
1.40 evidence of the complementary nature of pointwise and local scoring methods.
1.41 Principle 4: Work in 2-D whenever possible
1.42 There are many anatomical structures which are commonly characterized in
1.43 + 3
1.44 +
1.45 terms of a two-dimensional manifold. When it is known that the structure that
1.46 one is looking for is two-dimensional, the results may be improved by allowing
1.47 the analysis algorithm to take advantage of this prior knowledge. In addition,
1.48 @@ -128,8 +130,6 @@
1.49 of machine learning. One thing that you can do with such a dataset is to group
1.50 instances together. A set of similar instances is called a cluster, and the activity
1.51 of finding grouping the data into clusters is called clustering or cluster analysis.
1.52 - 3
1.53 -
1.54 The task of deciding how to carve up a structure into anatomical subregions
1.55 can be put into these terms. The instances are once again voxels (or pixels)
1.56 along with their associated gene expression profiles. We make the assumption
1.57 @@ -162,6 +162,8 @@
1.58 image into clusters, usually contiguous clusters. Aim 2 is similar to an image
1.59 segmentation task. There are two main differences; in our task, there are thou-
1.60 sands of color channels (one for each gene), rather than just three. There are
1.61 + 4
1.62 +
1.63 imaging tasks which use more than three colors, however, for example multispec-
1.64 tral imaging and hyperspectral imaging, which are often used to process satellite
1.65 imagery. A more crucial difference is that there are various cues which are ap-
1.66 @@ -176,8 +178,6 @@
1.67 algorithms perform better on small numbers of features. There are techniques
1.68 which “summarize” a larger number of features using a smaller number of fea-
1.69 tures; these techniques go by the name of feature extraction or dimensionality
1.70 - 4
1.71 -
1.72 reduction. The small set of features that such a technique yields is called the
1.73 reduced feature set. After the reduced feature set is created, the instances may
1.74 be replaced by reduced instances, which have as their features the reduced fea-
1.75 @@ -208,6 +208,8 @@
1.76 This is because many genes have an expression pattern which seems to pick
1.77 out a single, spatially continguous subregion. Therefore, it seems likely that an
1.78 anatomically interesting subregion will have multiple genes which each individ-
1.79 + 5
1.80 +
1.81 ually pick it out1. This suggests the following procedure: cluster together genes
1.82 which pick out similar subregions, and then to use the more popular common
1.83 subregions as the final clusters. In the Preliminary Data we show that a num-
1.84 @@ -216,14 +218,6 @@
1.85 this fashion.
1.86 Aim 3
1.87 Background
1.88 -_______________
1.89 - 1This would seem to contradict our finding in aim 1 that some cortical areas are combina-
1.90 -torially coded by multiple genes. However, it is possible that the currently accepted cortical
1.91 -maps divide the cortex into subregions which are unnatural from the point of view of gene
1.92 -expression; perhaps there is some other way to map the cortex for which each subregion can
1.93 -be identified by single genes.
1.94 - 5
1.95 -
1.96 The cortex is divided into areas and layers. To a first approximation, the
1.97 parcellation of the cortex into areas can be drawn as a 2-D map on the surface of
1.98 the cortex. In the third dimension, the boundaries between the areas continue
1.99 @@ -254,6 +248,14 @@
1.100 finding markers for each individual cortical areas, we will find a small panel
1.101 of genes that can find many of the areal boundaries at once. This panel of
1.102 marker genes will allow the development of an ISH protocol that will allow
1.103 +__________________________
1.104 + 1This would seem to contradict our finding in aim 1 that some cortical areas are combina-
1.105 +torially coded by multiple genes. However, it is possible that the currently accepted cortical
1.106 +maps divide the cortex into subregions which are unnatural from the point of view of gene
1.107 +expression; perhaps there is some other way to map the cortex for which each subregion can
1.108 +be identified by single genes.
1.109 + 6
1.110 +
1.111 experimenters to more easily identify which anatomical areas are present in
1.112 small samples of cortex.
1.113 The method developed in aim (3) will provide a genoarchitectonic viewpoint
1.114 @@ -269,8 +271,6 @@
1.115 While we do not here propose to analyze human gene expression data, it is
1.116 conceivable that the methods we propose to develop could be used to suggest
1.117 modifications to the human cortical map as well.
1.118 - 6
1.119 -
1.120 Related work
1.121 There does not appear to be much work on the automated analysis of spatial
1.122 gene expression data.
1.123 @@ -297,23 +297,26 @@
1.124 yielded impressive results, proving the usefulness of such research. We have run
1.125 NNMF on the cortical dataset and while the results are promising (see Prelim-
1.126 inary Data), we think that it will be possible to find a better method2 (we also
1.127 +__________________________
1.128 + 2We ran “vanilla” NNMF, whereas the paper under discussion used a modified method.
1.129 +Their main modification consisted of adding a soft spatial contiguity constraint. However,
1.130 +on our dataset, NNMF naturally produced spatially contiguous clusters, so no additional
1.131 + 7
1.132 +
1.133 think that more automation of the parts that this paper’s authors did manually
1.134 will be possible).
1.135 and [?] describes AGEA. todo
1.136 +__________________________
1.137 +constraint was needed. The paper under discussion mentions that they also tried a hierarchial
1.138 +variant of NNMF, but since they didn’t report its results, we assume that those result were
1.139 +not any more impressive than the results of the non-hierarchial variant.
1.140 + 8
1.141 +
1.142 Preliminary work
1.143 Format conversion between SEV, MATLAB, NIFTI
1.144 todo
1.145 Flatmap of cortex
1.146 todo
1.147 -_______________________
1.148 - 2We ran “vanilla” NNMF, whereas the paper under discussion used a modified method.
1.149 -Their main modification consisted of adding a soft spatial contiguity constraint. However,
1.150 -on our dataset, NNMF naturally produced spatially contiguous clusters, so no additional
1.151 -constraint was needed. The paper under discussion mentions that they also tried a hierarchial
1.152 -variant of NNMF, but since they didn’t report its results, we assume that those result were
1.153 -not any more impressive than the results of the non-hierarchial variant.
1.154 - 7
1.155 -
1.156 Using combinations of multiple genes is necessary and sufficient to
1.157 delineate some cortical areas
1.158 Here we give an example of a cortical area which is not marked by any
1.159 @@ -343,15 +346,7 @@
1.160 genes which express more strongly in AUD than outside of it; its weakness is that
1.161 this includes many areas which don’t have a salient border matching the areal
1.162 border. The geometric method identifies genes whose salient expression border
1.163 - seems to partially line up with the border of AUD; its weakness is that this
1.164 - includes genes which don’t express over the entire area. Genes which have high
1.165 - rankings using both pointwise and border criteria, such as Aph1a in the example,
1.166 - may be particularly good markers. None of these genes are, individually, a
1.167 - perfect marker for AUD; we deliberately chose a “difficult” area in order to
1.168 - better contrast pointwise with geometric methods.
1.169 - Areas which can be identified by single genes
1.170 - todo
1.171 -____________________
1.172 +__________________________
1.173 3“WW, C2 and coiled-coil domain containing 1”; EntrezGene ID 211652
1.174 4“mitochondrial translational initiation factor 2”; EntrezGene ID 76784
1.175 5For each gene, a logistic regression in which the response variable was whether or not a
1.176 @@ -361,7 +356,7 @@
1.177 6For each gene the gradient similarity (see section ??) between (a) a map of the expression
1.178 of each gene on the cortical surface and (b) the shape of area AUD, was calculated, and this
1.179 was used to rank the genes.
1.180 - 8
1.181 + 9
1.182
1.183
1.184
1.185 @@ -373,6 +368,8 @@
1.186 the boundary of region MO. Pixels are colored approximately according to the
1.187 density of expressing cells underneath each pixel, with red meaning a lot of
1.188 expression and blue meaning little.
1.189 + 10
1.190 +
1.191
1.192
1.193 Figure 2: The top row shows the three genes which (individually) best predict
1.194 @@ -380,8 +377,14 @@
1.195 genes which (individually) best match area AUD, according to gradient similar-
1.196 ity. From left to right and top to bottom, the genes are Ssr1, Efcbp1, Aph1a,
1.197 Ptk7, Aph1a again, and Lepr
1.198 - 9
1.199 -
1.200 + seems to partially line up with the border of AUD; its weakness is that this
1.201 + includes genes which don’t express over the entire area. Genes which have high
1.202 + rankings using both pointwise and border criteria, such as Aph1a in the example,
1.203 + may be particularly good markers. None of these genes are, individually, a
1.204 + perfect marker for AUD; we deliberately chose a “difficult” area in order to
1.205 + better contrast pointwise with geometric methods.
1.206 + Areas which can be identified by single genes
1.207 + todo
1.208 Specific to Aim 1 (and Aim 3)
1.209 Forward stepwise logistic regression todo
1.210 SVM on all genes at once
1.211 @@ -396,6 +399,10 @@
1.212 our task combines feature selection with supervised learning.
1.213 Decision trees
1.214 todo
1.215 +____________________
1.216 + 75-fold cross-validation.
1.217 + 11
1.218 +
1.219 Specific to Aim 2 (and Aim 3)
1.220 Raw dimensionality reduction results
1.221 todo
1.222 @@ -404,6 +411,8 @@
1.223 Many areas are captured by clusters of genes
1.224 todo
1.225 todo
1.226 + 12
1.227 +
1.228 Research plan
1.229 todo amongst other things:
1.230 Develop algorithms that find genetic markers for anatomical re-
1.231 @@ -419,10 +428,6 @@
1.232 with a handful of genes. We will consider both (a) algorithms that incre-
1.233 mentally/greedily combine single gene markers into sets, such as forward
1.234 stepwise regression and decision trees, and also (b) supervised learning
1.235 -__________________________
1.236 - 75-fold cross-validation.
1.237 - 10
1.238 -
1.239 techniques which use soft constraints to minimize the number of features,
1.240 such as sparse support vector machines.
1.241 4. Extend the procedure to handle difficult areas by combining or redrawing
1.242 @@ -446,6 +451,8 @@
1.243 at once.
1.244 Develop algorithms to suggest a division of a structure into anatom-
1.245 ical parts
1.246 + 13
1.247 +
1.248 1. Explore dimensionality reduction algorithms applied to pixels: including
1.249 TODO
1.250 2. Explore dimensionality reduction algorithms applied to genes: including
1.251 @@ -457,9 +464,8 @@
1.252 clustering to create anatomical maps
1.253 6. Run this algorithm on the cortex: present a hierarchial, genoarchitectonic
1.254 map of the cortex
1.255 - 11
1.256 -
1.257 - _______________________________________________________________________________________________________ stuff i dunno where to put yet (there is more scattered through grant-
1.258 +______________________________________________
1.259 + stuff i dunno where to put yet (there is more scattered through grant-
1.260 oldtext):
1.261 Principle 4: Work in 2-D whenever possible
1.262 In anatomy, the manifold of interest is usually either defined by a combina-
1.263 @@ -484,6 +490,6 @@
1.264 app2 has examples of genetic targeting to specific anatomical regions
1.265 —
1.266 note:
1.267 - 12
1.268 -
1.269 -
1.270 + 14
1.271 +
1.272 +
2.1 Binary file grant.odt has changed
3.1 Binary file grant.pdf has changed
4.1 --- a/grant.txt Mon Apr 13 03:21:04 2009 -0700
4.2 +++ b/grant.txt Mon Apr 13 03:22:01 2009 -0700
4.3 @@ -13,6 +13,7 @@
4.4 All algorithms that we develop will be implemented in an open-source software toolkit. The toolkit, as well as the machine-readable datasets developed in aim (3), will be published and freely available for others to use.
4.5
4.6
4.7 +\newpage
4.8
4.9 == Background and significance ==
4.10
4.11 @@ -151,6 +152,8 @@
4.12
4.13
4.14
4.15 +\newpage
4.16 +
4.17 == Preliminary work ==
4.18
4.19 === Format conversion between SEV, MATLAB, NIFTI ===
4.20 @@ -254,6 +257,9 @@
4.21
4.22 todo
4.23
4.24 +
4.25 +
4.26 +\newpage
4.27 == Research plan ==
4.28
4.29 todo amongst other things: