cg
diff grant.html @ 99:a48955c639d4
.
author | bshanks@bshanks.dyndns.org |
---|---|
date | Wed Apr 22 06:43:51 2009 -0700 (16 years ago) |
parents | a75c226cbdd6 |
children | fa7c0a924e7a |
line diff
1.1 --- a/grant.html Wed Apr 22 06:23:09 2009 -0700
1.2 +++ b/grant.html Wed Apr 22 06:43:51 2009 -0700
1.3 @@ -77,7 +77,7 @@
1.4 calculate a voxel’s sub-score, then we say it is a local scoring method. If only information from the voxel itself is
1.5 used to calculate a voxel’s sub-score, then we say it is a pointwise scoring method.
1.6 Both gene expression data and anatomical atlases have errors, due to a variety of factors. Individual subjects
1.7 -have idiosyncratic anatomy. Subjects may be improperly registred to the atlas. The method used to measure
1.8 +have idiosyncratic anatomy. Subjects may be improperly registered to the atlas. The method used to measure
1.9 gene expression may be noisy. The atlas may have errors. It is even possible that some areas in the anatomical
1.10 atlas are “wrong” in that they do not have the same shape as the natural domains of gene expression to which
1.11 they correspond. These sources of error can affect the displacement and the shape of both the gene expression
1.12 @@ -175,7 +175,7 @@
1.13 gene, and then, for each anatomical structure of interest, computing
1.14 what proportion of this structure is covered by the gene’s spatial region.
1.15 GeneAtlas[5] and EMAGE [26] allow the user to construct a search
1.16 - query by demarcating regions and then specifing either the strength of
1.17 + query by demarcating regions and then specifying either the strength of
1.18 expression or the name of another gene or dataset whose expression
1.19 pattern is to be matched. Neither GeneAtlas nor EMAGE allow one to
1.20 search for combinations of genes that define a region in concert but not separately.
1.21 @@ -195,8 +195,8 @@
1.22 one.
1.23 [10 ] describes a technique to find combinations of marker genes to pick out an anatomical region. They use
1.24 an evolutionary algorithm to evolve logical operators which combine boolean (thresholded) images in order to
1.25 -match a target image. Their match score is Jaccard similarity.
1.26 -_________________________________________
1.27 +match a target image.
1.28 +_____________________
1.29 2By “fundamentally spatial” we mean that there is information from a large number of spatial locations indexed by spatial coordinates;
1.30 not just data which have only a few different locations or which is indexed by anatomical label.
1.31 In summary, there has been fruitful work on finding marker genes, but only one of the previous projects
1.32 @@ -226,8 +226,8 @@
1.33 into clusters of voxels with similar gene expression.
1.34 It is desirable to determine not just one set of regions, but also how
1.35 these regions relate to each other. The outcome of clustering may be
1.36 - a hierarchial tree of clusters, rather than a single set of clusters which
1.37 -partition the voxels. This is called hierarchial clustering.
1.38 + a hierarchical tree of clusters, rather than a single set of clusters which
1.39 +partition the voxels. This is called hierarchical clustering.
1.40 Similarity scores A crucial choice when designing a clustering method is how to measure similarity, across
1.41 either pairs of instances, or clusters, or both. There is much overlap between scoring methods for feature
1.42 selection (discussed above under Aim 1) and scoring methods for similarity.
1.43 @@ -290,7 +290,7 @@
1.44 feature for each gene cluster.
1.45 Gene clusters could also be used to directly yield a clustering on
1.46 instances. This is because many genes have an expression pattern
1.47 - which seems to pick out a single, spatially continguous region. This
1.48 + which seems to pick out a single, spatially contiguous region. This
1.49 suggests the following procedure: cluster together genes which pick
1.50 out similar regions, and then to use the more popular common regions
1.51 as the final clusters. In Preliminary Studies, Figure 7, we show that a
1.52 @@ -306,31 +306,30 @@
1.53 [23] describes an analysis of the anatomy of the hippocampus us-
1.54 ing the ABA dataset. In addition to manual analysis, two clustering
1.55 methods were employed, a modified Non-negative Matrix Factoriza-
1.56 - tion (NNMF), and a hierarchial recursive bifurcation clustering scheme
1.57 + tion (NNMF), and a hierarchical recursive bifurcation clustering scheme
1.58 based on correlation as the similarity score. The paper yielded impres-
1.59 sive results, proving the usefulness of computational genomic anatomy.
1.60 We have run NNMF on the cortical dataset
1.61 - AGEA[15] includes a preset hierarchial clustering of voxels based
1.62 + AGEA[15] includes a preset hierarchical clustering of voxels based
1.63 on a recursive bifurcation algorithm with correlation as the similarity
1.64 metric. EMAGE[26] allows the user to select a dataset from among a
1.65 large number of alternatives, or by running a search query, and then to
1.66 - cluster the genes within that dataset. EMAGE clusters via hierarchial
1.67 - complete linkage clustering with un-centred correlation as the similarity
1.68 - score.
1.69 - [6] clustered genes. For each cluster, prototypical spatial expres-
1.70 - sion patterns were created by averaging the genes in the cluster. The
1.71 - prototypes were analyzed manually, without clustering voxels.
1.72 + cluster the genes within that dataset. EMAGE clusters via hierarchical
1.73 + complete linkage clustering.
1.74 + [6] clusters genes. For each cluster, prototypical spatial expression
1.75 + patterns were created by averaging the genes in the cluster. The pro-
1.76 + totypes were analyzed manually, without clustering voxels.
1.77 [10] applies their technique for finding combinations of marker
1.78 genes for the purpose of clustering genes around a “seed gene”.
1.79 In summary, although these projects obtained clusterings, there has
1.80 not been much comparison between different algorithms or scoring
1.81 methods, so it is likely that the best clustering method for this appli-
1.82 -cation has not yet been found. The projects using gene expression on cortex did not attempt to make use of
1.83 + cation has not yet been found. The projects using gene expression on
1.84 +cortex did not attempt to make use of the radial profile of gene expression. Also, none of these projects did a
1.85 _________________________________________
1.86 5A radial profile is a profile along a line perpendicular to the cortical surface.
1.87 -the radial profile of gene expression. Also, none of these projects did a separate dimensionality reduction step
1.88 -before clustering pixels, none tried to cluster genes first in order to guide automated clustering of pixels into
1.89 -spatial regions, and none used co-clustering algorithms.
1.90 +separate dimensionality reduction step before clustering pixels, none tried to cluster genes first in order to guide
1.91 +automated clustering of pixels into spatial regions, and none used co-clustering algorithms.
1.92 Aim 3: apply the methods developed to the cerebral cortex
1.93
1.94
1.95 @@ -400,11 +399,11 @@
1.96 of correlations between voxel gene expression profiles within a handful of cortical areas. However, this sort
1.97 of analysis is not related to either of our aims, as it neither finds marker genes, nor does it suggest a cortical
1.98 map based on gene expression data. Neither of the other components of AGEA can be applied to cortical
1.99 -areas; AGEA’s Gene Finder cannot be used to find marker genes for the cortical areas; and AGEA’s hierarchial
1.100 +areas; AGEA’s Gene Finder cannot be used to find marker genes for the cortical areas; and AGEA’s hierarchical
1.101 clustering does not produce clusters corresponding to the cortical areas8.
1.102 In summary, for all three aims, (a) only one of the previous projects explores combinations of marker genes,
1.103 (b) there has been almost no comparison of different algorithms or scoring methods, and (c) there has been no
1.104 -work on computationally finding marker genes for cortical areas, or on finding a hierarchial clustering that will
1.105 +work on computationally finding marker genes for cortical areas, or on finding a hierarchical clustering that will
1.106 yield a map of cortical areas de novo from gene expression data.
1.107 Our project is guided by a concrete application with a well-specified criterion of success (how well we can
1.108 find marker genes for / reproduce the layout of cortical areas), which will provide a solid basis for comparing
1.109 @@ -414,7 +413,7 @@
1.110 Figure 7: Prototypes corresponding to sample gene
1.111 clusters, clustered by gradient similarity. Region bound-
1.112 aries for the region that most matches each prototype
1.113 -are overlayed. The method developed in aim (1) will be applied to
1.114 +are overlaid. The method developed in aim (1) will be applied to
1.115 each cortical area to find a set of marker genes such
1.116 that the combinatorial expression pattern of those
1.117 genes uniquely picks out the target area. Finding
1.118 @@ -529,7 +528,7 @@
1.119 cortical areas, while also validating the relevancy of our new scoring method, gradient similarity.
1.120 Combinations of multiple genes are useful and necessary for some areas
1.121 In Figure 4, we give an example of a cortical area which is not marked by any single gene, but which
1.122 -can be identified combinatorially. Acccording to logistic regression, gene wwc1 is the best fit single gene for
1.123 +can be identified combinatorially. According to logistic regression, gene wwc1 is the best fit single gene for
1.124 predicting whether or not a pixel on the cortical surface belongs to the motor area (area MO). The upper-left
1.125 picture in Figure 4 shows wwc1’s spatial expression pattern over the cortex. The lower-right boundary of MO is
1.126 represented reasonably well by this gene, but the gene overshoots the upper-left boundary. This flattened 2-D
1.127 @@ -542,7 +541,7 @@
1.128 necessary.
1.129 Multivariate supervised learning
1.130 Forward stepwise logistic regression Logistic regression is a popular method for predictive modeling of cate-
1.131 -gorial data. As a pilot run, for five cortical areas (SS, AUD, RSP, VIS, and MO), we performed forward stepwise
1.132 +gorical data. As a pilot run, for five cortical areas (SS, AUD, RSP, VIS, and MO), we performed forward stepwise
1.133 logistic regression to find single genes, pairs of genes, and triplets of genes which predict areal identify. This is
1.134 an example of feature selection integrated with prediction using a stepwise wrapper. Some of the single genes
1.135 found were shown in various figures throughout this document, and Figure 4 shows a combination of genes
1.136 @@ -565,7 +564,7 @@
1.137 shown in the last row of Figure 6. To compare, the leftmost picture on the bottom row of Figure 6 shows some
1.138 of the major subdivisions of cortex. These results clearly show that different dimensionality reduction techniques
1.139 capture different aspects of the data and lead to different clusterings, indicating the utility of our proposal to
1.140 -produce a detailed comparion of these techniques as applied to the domain of genomic anatomy.
1.141 +produce a detailed comparison of these techniques as applied to the domain of genomic anatomy.
1.142 Many areas are captured by clusters of genes We also clustered the genes using gradient similarity to
1.143 see if the spatial regions defined by any clusters matched known anatomical regions. Figure 7 shows, for ten
1.144 sample gene clusters, each cluster’s average expression pattern, compared to a known anatomical boundary.
1.145 @@ -605,7 +604,7 @@
1.146 selection using a stepwise wrapper over “vanilla” classifiers such as logistic regression, (b) supervised learning
1.147 methods such as decision trees which incrementally/greedily combine single gene markers into sets, and (c)
1.148 supervised learning methods which use soft constraints to minimize number of features used, such as sparse
1.149 -support vector machines.
1.150 +support vector machines (SVMs).
1.151 Since errors of displacement and of shape may cause genes and target areas to match less than they should,
1.152 we will consider the robustness of feature selection methods in the presence of error. Some of these methods,
1.153 such as the Hough transform, are designed to be resistant in the presence of error, but many are not. We will
1.154 @@ -644,7 +643,7 @@
1.155 to quantitatively compare the relevance of the different dimensionality reduction methods for identifying cortical
1.156 areal boundaries.
1.157 Dimensionality reduction on pixels Instead of applying dimensionality reduction to the gene expression
1.158 -profiles, the same techniques can be applied instead to the pixels13. It is possible that the features generated in
1.159 +profiles, the same techniques can be applied instead to the pixels. It is possible that the features generated in
1.160 this way by some dimensionality reduction techniques will directly correspond to interesting spatial regions.
1.161 Clustering and segmentation on pixels We will explore clustering and segmentation algorithms in order to
1.162 segment the pixels into regions. We will explore k-means, spectral clustering, gene shaving[9], recursive division
1.163 @@ -660,7 +659,7 @@
1.164 the gene expression profiles. One could then perform clustering on pixels (possibly after a second dimensionality
1.165 reduction step) in order to identify spatial regions. It remains to be seen whether removal of redundancy would
1.166 help or hurt the ultimate goal of identifying interesting spatial regions.
1.167 -Co-clustering There are some algorithms which simultaineously incorporate clustering on instances and on
1.168 +Co-clustering There are some algorithms which simultaneously incorporate clustering on instances and on
1.169 features (in our case, genes and pixels), for example, IRM[11]. These are called co-clustering or biclustering
1.170 algorithms.
1.171 Radial profiles We wil explore the use of the radial profile of gene expression under each pixel.
1.172 @@ -675,12 +674,6 @@
1.173 best linear summary of gene expression profiles for the purpose of discriminating between regions. This reduced
1.174 feature set could then be used to cluster pixels into regions. Perhaps the resulting clusters will be similar to the
1.175 reference atlas, yet more faithful to natural spatial domains of gene expression than the reference atlas is.
1.176 -_________________________________________
1.177 - 13Consider a matrix whose rows represent pixel locations, and whose columns represent genes. An entry in this matrix represents the
1.178 -gene expression level at a given pixel. One can look at this matrix as a collection of pixels, each corresponding to a vector of many gene
1.179 -expression levels; or one can look at it as a collection of genes, each corresponding to a vector giving that gene’s expression at each
1.180 -pixel. Similarly, dimensionality reduction can be used to replace a large number of genes with a small number of features, or it can be
1.181 -used to replace a large number of pixels with a small number of features.
1.182 Apply the new methods to the cortex
1.183 Using the methods developed in Aim 1, we will present, for each cortical area, a short list of markers to identify
1.184 that area; and we will also present lists of “panels” of genes that can be used to delineate many areas at once.
1.185 @@ -689,7 +682,7 @@
1.186 validate our marker genes to guard against this. First, we will confirm that putative combinations of marker genes
1.187 express the same pattern in both hemispheres. Second, we will manually validate our final results on other gene
1.188 expression datasets such as EMAGE, GeneAtlas, and GENSAT[8].
1.189 -Using the methods developed in Aim 2, we will present one or more hierarchial cortical maps. We will identify
1.190 +Using the methods developed in Aim 2, we will present one or more hierarchical cortical maps. We will identify
1.191 and explain how the statistical structure in the gene expression data led to any unexpected or interesting features
1.192 of these maps, and we will provide biological hypotheses to interpret any new cortical areas, or groupings of
1.193 areas, which are discovered.
1.194 @@ -698,35 +691,26 @@
1.195 September-November 2009: Develop an automated mechanism for segmenting the cortical voxels into layers
1.196 November 2009 (milestone): Have completed construction of a flatmapped, cortical dataset with information
1.197 for each layer
1.198 -October 2009-April 2010: Develop scoring methods and to test them in various supervised learning frameworks.
1.199 -Also test out various dimensionality reduction schemes in combination with supervised learning. create or extend
1.200 -supervised learning frameworks which use multivariate versions of the best scoring methods.
1.201 +October 2009-April 2010: Develop scoring methods, dimensionality reduction, and supervised learning meth-
1.202 +ods.
1.203 January 2010 (milestone): Submit a publication on single marker genes for cortical areas
1.204 -February-July 2010: Continue to develop scoring methods and supervised learning frameworks. Explore the
1.205 -best way to integrate radial profiles with supervised learning. Explore the best way to make supervised learning
1.206 -techniques robust against incorrect labels (i.e. when the areas drawn on the input cortical map are slightly
1.207 -off). Quantitatively compare the performance of different supervised learning techniques. Validate marker genes
1.208 -found in the ABA dataset by checking against other gene expression datasets. Create documentation and unit
1.209 -tests for software toolbox for Aim 1. Respond to user bug reports for Aim 1 software toolbox.
1.210 +February-July 2010: Continue to develop scoring methods and supervised learning frameworks. Extend tech-
1.211 +niques for robustness. Compare the performance of techniques. Validate marker genes. Prepare software
1.212 +toolbox for Aim 1.
1.213 June 2010 (milestone): Submit a paper describing a method fulfilling Aim 1. Release toolbox.
1.214 July 2010 (milestone): Submit a paper describing combinations of marker genes for each cortical area, and a
1.215 small number of marker genes that can, in combination, define most of the areas at once
1.216 Revealing new ways to parcellate a structure into regions
1.217 -June 2010-March 2011: Explore dimensionality reduction algorithms for Aim 2. Explore standard hierarchial
1.218 -clustering algorithms, used in combination with dimensionality reduction, for Aim 2. Explore co-clustering algo-
1.219 -rithms. Think about how radial profile information can be used for Aim 2. Adapt clustering algorithms to use radial
1.220 -profile information. Quantitatively compare the performance of different dimensionality reduction and clustering
1.221 -techniques. Quantitatively compare the value of different flatmapping methods and ways of representing radial
1.222 -profiles.
1.223 +June 2010-March 2011: Explore dimensionality reduction algorithms for Aim 2. Explore clustering algorithms.
1.224 +Adapt clustering algorithms to use radial profile information. Compare the performance of techniques.
1.225 March 2011 (milestone): Submit a paper describing a method fulfilling Aim 2. Release toolbox.
1.226 February-May 2011: Using the methods developed for Aim 2, explore the genomic anatomy of the cortex. If
1.227 -new ways of organizing the cortex into areas are discovered, read the literature and talk to people to learn about
1.228 -research related to interpreting our results. Create documentation and unit tests for software toolbox for Aim 2.
1.229 -Respond to user bug reports for Aim 2 software toolbox.
1.230 +new ways of organizing the cortex into areas are discovered, interpret the results. Prepare software toolbox for
1.231 +Aim 2.
1.232 May 2011 (milestone): Submit a paper on the genomic anatomy of the cortex, using the methods developed in
1.233 Aim 2
1.234 May-August 2011: Revisit Aim 1 to see if what was learned during Aim 2 can improve the methods for Aim 1.
1.235 -Follow up on responses to our papers. Possibly submit another paper.
1.236 +Possibly submit another paper.
1.237 Bibliography & References Cited
1.238 [1]Chris Adamson, Leigh Johnston, Terrie Inder, Sandra Rees, Iven Mareels, and Gary Egan. A Tracking
1.239 Approach to Parcellation of the Cerebral Cortex, volume Volume 3749/2005 of Lecture Notes in Computer