cg

changeset 46:a44e9ad61efa

.
author bshanks@bshanks.dyndns.org
date Wed Apr 15 13:57:53 2009 -0700 (16 years ago)
parents 354ea5edb5f6
children 33c10c13f9a3
files grant.html grant.odt grant.pdf grant.txt
line diff
1.1 --- a/grant.html Wed Apr 15 03:20:19 2009 -0700 1.2 +++ b/grant.html Wed Apr 15 13:57:53 2009 -0700 1.3 @@ -109,24 +109,27 @@ 1.4 will also look for combinations of genes. Second, gene finder can only use overexpression as a marker, whereas we will also 1.5 search for underexpression. Third, Gene Finder uses a simple pointwise score4, whereas we will also use geometric scores 1.6 such as gradient similarity. The Preliminary Data section contains evidence that each of our three choices is the right one. 1.7 -[10 ] todo 1.8 -[4 ] todo 1.9 -In summary, none of the previous projects explores combinations of marker genes, and none of their publications compare 1.10 -the results obtained by using different algorithms or scoring methods. 1.11 +[11 ] todo 1.12 +[4 ] describes a technique to find combinations of marker genes to pick out an anatomical region. They use an evolutionary 1.13 +algorithm to evolve logical operators which combine boolean (thresholded) images in order to match a target image. Their 1.14 +match score is Jaccard similarity, which is equal to the number of true pixels in the intersection of the two images, divided 1.15 +by the number of pixels in their union. 1.16 +In summary, only one of the previous projects explores combinations of marker genes, and none of their publications 1.17 +compare the results obtained by using different algorithms or scoring methods. 1.18 Aim 2 1.19 Machine learning terminology: clustering 1.20 If one is given a dataset consisting merely of instances, with no class labels, then analysis of the dataset is referred to as 1.21 unsupervised learning in the jargon of machine learning. One thing that you can do with such a dataset is to group instances 1.22 -together. A set of similar instances is called a cluster, and the activity of finding grouping the data into clusters is called 1.23 -clustering or cluster analysis. 1.24 -The task of deciding how to carve up a structure into anatomical regions can be put into these terms. The instances are 1.25 -once again voxels (or pixels) along with their associated gene expression profiles. We make the assumption that voxels from 1.26 _________________________________________ 1.27 2By “fundamentally spatial” we mean that there is information from a large number of spatial locations; not just data which has only a few 1.28 different locations. 1.29 3See Preliminary Data for an example of an area which cannot be marked by any single gene in the dataset, but which can be marked by a 1.30 combination. 1.31 4“Expression energy ratio”, which captures overexpression. 1.32 +together. A set of similar instances is called a cluster, and the activity of finding grouping the data into clusters is called 1.33 +clustering or cluster analysis. 1.34 +The task of deciding how to carve up a structure into anatomical regions can be put into these terms. The instances are 1.35 +once again voxels (or pixels) along with their associated gene expression profiles. We make the assumption that voxels from 1.36 the same region have similar gene expression profiles, at least compared to the other regions. This means that clustering 1.37 voxels is the same as finding potential regions; we seek a partitioning of the voxels into regions, that is, into clusters of voxels 1.38 with similar gene expression. 1.39 @@ -177,30 +180,34 @@ 1.40 Gene clusters could also be used to directly yield a clustering on instances. This is because many genes have an expression 1.41 pattern which seems to pick out a single, spatially continguous region. Therefore, it seems likely that an anatomically 1.42 interesting region will have multiple genes which each individually pick it out5. This suggests the following procedure: 1.43 +_________________________________________ 1.44 + 5This would seem to contradict our finding in aim 1 that some cortical areas are combinatorially coded by multiple genes. However, it is 1.45 +possible that the currently accepted cortical maps divide the cortex into regions which are unnatural from the point of view of gene expression; 1.46 +perhaps there is some other way to map the cortex for which each region can be identified by single genes. Another possibility is that, although 1.47 cluster together genes which pick out similar regions, and then to use the more popular common regions as the final clusters. 1.48 In the Preliminary Data we show that a number of anatomically recognized cortical regions, as well as some “superregions” 1.49 formed by lumping together a few regions, are associated with gene clusters in this fashion. 1.50 -_________________________________________ 1.51 - 5This would seem to contradict our finding in aim 1 that some cortical areas are combinatorially coded by multiple genes. However, it is 1.52 -possible that the currently accepted cortical maps divide the cortex into regions which are unnatural from the point of view of gene expression; 1.53 -perhaps there is some other way to map the cortex for which each region can be identified by single genes. Another possibility is that, although 1.54 -the cluster prototype fits an anatomical region, the individual genes are each somewhat different from the prototype. 1.55 Related work 1.56 -We are aware of three existing efforts to cluster spatial gene expression data. 1.57 +We are aware of four existing efforts to cluster spatial gene expression data. 1.58 [9 ] describes an analysis of the anatomy of the hippocampus using the ABA dataset. In addition to manual analysis, 1.59 two clustering methods were employed, a modified Non-negative Matrix Factorization (NNMF), and a hierarchial recursive 1.60 bifurcation clustering scheme based on correlation as the similarity score. The paper yielded impressive results, proving 1.61 the usefulness of computational genomic anatomy. We have run NNMF on the cortical dataset6 and while the results are 1.62 promising (see Preliminary Data), we think that it will be possible to find an even better method. 1.63 +In an interesting twist, [4] applies their technique for finding combinations of marker genes for the purpose of clustering 1.64 +genes around a “seed gene”. The way they do this is by using the pattern of expression of the seed gene as the target image, 1.65 +and then searching for other genes which can be combined to reproduce this pattern. Those other genes which are found 1.66 +are considered to be related to the seed. The same team also describes a method[10] for finding “association rules” such as, 1.67 +“if this voxel is expressed in by any gene, then that voxel is probably also expressed in by the same gene”. This could be 1.68 +useful as part of a procedure for clustering voxels. 1.69 AGEA’s[6] hierarchial clustering differs from our Aim 2 in at least two ways. First, AGEA uses perhaps the simplest 1.70 possible similarity score (correlation), and does no dimensionality reduction before calculating similarity. While it is possible 1.71 that a more complex system will not do any better than this, we believe further exploration of alternative methods of scoring 1.72 and dimensionality reduction is warranted. Second, AGEA did not look at clusters of genes; in Preliminary Data we have 1.73 shown that clusters of genes may identify interesting spatial regions such as cortical areas. 1.74 -[10 ] todo 1.75 -In summary, although these projects obtained hierarchial clusterings, there has not been much comparison between 1.76 -different algorithms or scoring methods, so it is likely that the best clustering method for this application has not yet been 1.77 -found. 1.78 +[11 ] todo 1.79 +In summary, although these projects obtained clusterings, there has not been much comparison between different algo- 1.80 +rithms or scoring methods, so it is likely that the best clustering method for this application has not yet been found. 1.81 Aim 3 1.82 Background 1.83 The cortex is divided into areas and layers. To a first approximation, the parcellation of the cortex into areas can 1.84 @@ -225,22 +232,22 @@ 1.85 Next, an automated nonlinear alignment procedure located the 2D data from the various slices in a single 3D coordinate 1.86 system. In the final 3D coordinate system, voxels are cubes with 200 microns on a side. There are 67x41x58 = 159,326 1.87 voxels in the 3D coordinate system, of which 51,533 are in the brain[6]. 1.88 -Mus musculus, the common house mouse, is thought to contain about 22,000 protein-coding genes[12]. The ABA contains 1.89 +Mus musculus, the common house mouse, is thought to contain about 22,000 protein-coding genes[13]. The ABA contains 1.90 data on about 20,000 genes in sagittal sections, out of which over 4,000 genes are also measured in coronal sections. Our 1.91 -dataset is derived from only the coronal subset of the ABA, because the sagittal data does not cover the entire cortex, 1.92 -and has greater registration error[6]. Genes were selected by the Allen Institute for coronal sectioning based on, “classes of 1.93 -known neuroscientific interest... or through post hoc identification of a marked non-ubiquitous expression pattern”[6]. 1.94 -The ABA is not the only large public spatial gene expression dataset. Other such resources include GENSAT[3], 1.95 -GenePaint[11], its sister project GeneAtlas[1], BGEM[5], EMAGE[?], EurExpress (http://www.eurexpress.org/ee/; Eu- 1.96 -rExpress data is also entered into EMAGE), todo. With the exception of the ABA, GenePaint, and EMAGE, most of these 1.97 -resources, have not (yet) extracted the expression intensity from the ISH images and registered the results into a single 3-D 1.98 -space, and only ABA and EMAGE make this form of data available for public download from the website. Many of these 1.99 -resources focus on developmental gene expression. 1.100 -Significance 1.101 -___________________________ 1.102 - 6We ran “vanilla” NNMF, whereas the paper under discussion used a modified method. Their main modification consisted of adding a soft 1.103 +dataset is derived from only the coronal subset of the ABA, because the sagittal data does not cover the entire cortex, and 1.104 +_________________________________________ 1.105 +the cluster prototype fits an anatomical region, the individual genes are each somewhat different from the prototype. 1.106 + 6We ran “vanilla” NNMF, whereas the paper under discussion used a modified method. Their main modification consisted of adding a soft 1.107 spatial contiguity constraint. However, on our dataset, NNMF naturally produced spatially contiguous clusters, so no additional constraint was 1.108 needed. The paper under discussion also mentions that they tried a hierarchial variant of NNMF, which we have not yet tried. 1.109 +also has greater registration error[6]. Genes were selected by the Allen Institute for coronal sectioning based on, “classes of 1.110 +known neuroscientific interest... or through post hoc identification of a marked non-ubiquitous expression pattern”[6]. 1.111 +TheABA is not the only large public spatial gene expression dataset. Other such resources include GENSAT[3], 1.112 +GenePaint[12], its sister project GeneAtlas[1], BGEM[5], EMAGE[11], EurExpress7, todo. With the exception of the ABA, 1.113 +GenePaint, and EMAGE, most of these resources, have not (yet) extracted the expression intensity from the ISH images 1.114 +and registered the results into a single 3-D space, and only ABA and EMAGE make this form of data available for public 1.115 +download from the website. Many of these resources focus on developmental gene expression. 1.116 +Significance 1.117 The method developed in aim (1) will be applied to each cortical area to find a set of marker genes such that the 1.118 combinatorial expression pattern of those genes uniquely picks out the target area. Finding marker genes will be useful for 1.119 drug discovery as well as for experimentation because marker genes can be used to design interventions which selectively 1.120 @@ -260,21 +267,21 @@ 1.121 develop could be used to suggest modifications to the human cortical map as well. 1.122 Related work 1.123 [6 ] describes the application of AGEA to the cortex. The paper describes interesting results on the structure of correlations 1.124 -between voxel gene expression profiles within a handful of cortical areas. However, this sort of analysis is not related to 1.125 -either of our aims, as it neither finds marker genes, nor does it suggest a cortical map based on gene expression data. Neither 1.126 -of the other components of AGEA can be applied to cortical areas; AGEA’s Gene Finder cannot be used to find marker 1.127 -genes for most cortical areas; and AGEA’s hierarchial clustering does not produce clusters corresponding to most cortical 1.128 -areas7 . 1.129 -In summary, for all three aims, (a) none of the previous projects explores combinations of marker genes, (b) there has 1.130 +between voxel gene expression profiles within a handful of cortical areas. However, this sort of analysis is not related to either 1.131 +of our aims, as it neither finds marker genes, nor does it suggest a cortical map based on gene expression data. Neither of 1.132 +the other components of AGEA can be applied to cortical areas; AGEA’s Gene Finder cannot be used to find marker genes 1.133 +for the cortical areas; and AGEA’s hierarchial clustering does not produce clusters corresponding to the cortical areas8. 1.134 +In summary, for all three aims, (a) only one of the previous projects explores combinations of marker genes, (b) there has 1.135 been almost no comparison of different algorithms or scoring methods, and (c) there has been no work on computationally 1.136 finding marker genes for cortical areas, or on finding a hierarchial clustering that will yield a map of cortical areas de novo 1.137 from gene expression data. 1.138 Our project is guided by a concrete application with a well-specified criterion of success (how well we can find marker 1.139 genes for / reproduce the layout of cortical areas), which will provide a solid basis for comparing different methods. 1.140 _________________________________________ 1.141 - 7In both cases, the root cause is that pairwise correlations between the gene expression of voxels in different areas but the same layer are 1.142 + 7http://www.eurexpress.org/ee/; EurExpress data is also entered into EMAGE 1.143 + 8In both cases, the root cause is that pairwise correlations between the gene expression of voxels in different areas but the same layer are 1.144 often stronger than pairwise correlations between the gene expression of voxels in different layers but the same area. Therefore, a pairwise voxel 1.145 -correlation clustering algorithm will often create clusters representing cortical layers, not areas. This is why the hierarchial clustering does not 1.146 +correlation clustering algorithm will tend to create clusters representing cortical layers, not areas. This is why the hierarchial clustering does not 1.147 find most cortical areas (there are clusters which presumably correspond to the intersection of a layer and an area, but since one area will have 1.148 many layer-area intersection clusters, further work is needed to make sense of these). The reason that Gene Finder cannot find marker genes for 1.149 most cortical areas is that in Gene Finder, although the user chooses a seed voxel, Gene Finder chooses the ROI for which genes will be found, 1.150 @@ -310,7 +317,7 @@ 1.151 demarcated the depth of the outer boundary of cortical layer 5 throughout the cortex. 1.152 Feature selection and scoring methods 1.153 Correlation Recall that the instances are surface pixels, and consider the problem of attempting to classify each instance 1.154 -as either a member of a particular anatomical area, or not. The target area can be represented as a binary mask over the 1.155 +as either a member of a particular anatomical area, or not. The target area can be represented as a boolean mask over the 1.156 surface pixels. 1.157 One class of feature selection scoring method are those which calculate some sort of “match” between each gene image 1.158 and the target image. Those genes which match the best are good candidates for features. 1.159 @@ -322,12 +329,12 @@ 1.160 so what we want is to find features such that the conditional distribution of the target has minimal entropy. The distribution 1.161 to which we are referring is the probability distribution over the population of surface pixels. 1.162 The simplest way to use information theory is on discrete data, so we discretized our gene expression data by creating, 1.163 -for each gene, five thresholded binary masks of the gene data. For each gene, we created a binary mask of its expression 1.164 +for each gene, five thresholded boolean masks of the gene data. For each gene, we created a boolean mask of its expression 1.165 levels using each of these thresholds: the mean of that gene, the mean minus one standard deviation, the mean minus two 1.166 standard deviations, the mean plus one standard deviation, the mean plus two standard deviations. 1.167 Now, for each region, we created and ran a forward stepwise procedure which attempted to find pairs of gene expression 1.168 -binary masks such that the conditional entropy of the target area’s binary mask, conditioned upon the pair of gene expression 1.169 -binary masks, is minimized. 1.170 +boolean masks such that the conditional entropy of the target area’s boolean mask, conditioned upon the pair of gene 1.171 +expression boolean masks, is minimized. 1.172 This finds pairs of genes which are most informative (at least at these discretization thresholds) relative to the question, 1.173 “Is this surface pixel a member of the target area?”. 1.174 1.175 @@ -359,24 +366,24 @@ 1.176 similar direction (because the borders are similar). 1.177 Gradient similarity provides information complementary to correlation 1.178 To show that gradient similarity can provide useful information that cannot be detected via pointwise analyses, consider 1.179 -Fig. . The top row of Fig. displays the 3 genes which most match area AUD, according to a pointwise method8. The bottom 1.180 -row displays the 3 genes which most match AUD according to a method which considers local geometry9 The pointwise 1.181 -method in the top row identifies genes which express more strongly in AUD than outside of it; its weakness is that this 1.182 -includes many areas which don’t have a salient border matching the areal border. The geometric method identifies genes 1.183 -whose salient expression border seems to partially line up with the border of AUD; its weakness is that this includes genes 1.184 -which don’t express over the entire area. Genes which have high rankings using both pointwise and border criteria, such as 1.185 -Aph1a in the example, may be particularly good markers. None of these genes are, individually, a perfect marker for AUD; 1.186 -we deliberately chose a “difficult” area in order to better contrast pointwise with geometric methods. 1.187 +Fig. . The top row of Fig. displays the 3 genes which most match area AUD, according to a pointwise method9. The 1.188 +bottom row displays the 3 genes which most match AUD according to a method which considers local geometry10 The 1.189 +pointwise method in the top row identifies genes which express more strongly in AUD than outside of it; its weakness is 1.190 +that this includes many areas which don’t have a salient border matching the areal border. The geometric method identifies 1.191 +genes whose salient expression border seems to partially line up with the border of AUD; its weakness is that this includes 1.192 +genes which don’t express over the entire area. Genes which have high rankings using both pointwise and border criteria, 1.193 +such as Aph1a in the example, may be particularly good markers. None of these genes are, individually, a perfect marker 1.194 +for AUD; we deliberately chose a “difficult” area in order to better contrast pointwise with geometric methods. 1.195 Combinations of multiple genes are useful 1.196 Here we give an example of a cortical area which is not marked by any single gene, but which can be identified combi- 1.197 -natorially. according to logistic regression, gene wwc110 is the best fit single gene for predicting whether or not a pixel on 1.198 -_________________________________________ 1.199 - 8For each gene, a logistic regression in which the response variable was whether or not a surface pixel was within area AUD, and the predictor 1.200 +natorially. according to logistic regression, gene wwc111 is the best fit single gene for predicting whether or not a pixel on 1.201 +_________________________________________ 1.202 + 9For each gene, a logistic regression in which the response variable was whether or not a surface pixel was within area AUD, and the predictor 1.203 variable was the value of the expression of the gene underneath that pixel. The resulting scores were used to rank the genes in terms of how well 1.204 they predict area AUD. 1.205 - 9For each gene the gradient similarity (see section ??) between (a) a map of the expression of each gene on the cortical surface and (b) the 1.206 + 10For each gene the gradient similarity (see section ??) between (a) a map of the expression of each gene on the cortical surface and (b) the 1.207 shape of area AUD, was calculated, and this was used to rank the genes. 1.208 - 10“WW, C2 and coiled-coil domain containing 1”; EntrezGene ID 211652 1.209 + 11“WW, C2 and coiled-coil domain containing 1”; EntrezGene ID 211652 1.210 1.211 1.212 1.213 @@ -389,7 +396,7 @@ 1.214 pattern over the cortex. The lower-right boundary of MO is represented reasonably well by this gene, however the gene 1.215 overshoots the upper-left boundary. This flattened 2-D representation does not show it, but the area corresponding to the 1.216 overshoot is the medial surface of the cortex. MO is only found on the lateral surface (todo). 1.217 -Gene mtif211 is shown in figure the upper-right of Fig. . Mtif2 captures MO’s upper-left boundary, but not its lower-right 1.218 +Gene mtif212 is shown in figure the upper-right of Fig. . Mtif2 captures MO’s upper-left boundary, but not its lower-right 1.219 boundary. Mtif2 does not express very much on the medial surface. By adding together the values at each pixel in these 1.220 two figures, we get the lower-left of Figure . This combination captures area MO much better than any single gene. 1.221 Areas which can be identified by single genes 1.222 @@ -400,7 +407,7 @@ 1.223 Forward stepwise logistic regression todo 1.224 SVM on all genes at once 1.225 In order to see how well one can do when looking at all genes at once, we ran a support vector machine to classify cortical 1.226 -surface pixels based on their gene expression profiles. We achieved classification accuracy of about 81%12. As noted above, 1.227 +surface pixels based on their gene expression profiles. We achieved classification accuracy of about 81%13. As noted above, 1.228 however, a classifier that looks at all the genes at once isn’t practically useful. 1.229 The requirement to find combinations of only a small number of genes limits us from straightforwardly applying many 1.230 of the most simple techniques from the field of supervised machine learning. In the parlance of machine learning, our task 1.231 @@ -412,8 +419,8 @@ 1.232 todo 1.233 (might want to incld nnMF since mentioned above) 1.234 _________________________________________ 1.235 - 11“mitochondrial translational initiation factor 2”; EntrezGene ID 76784 1.236 - 125-fold cross-validation. 1.237 + 12“mitochondrial translational initiation factor 2”; EntrezGene ID 76784 1.238 + 135-fold cross-validation. 1.239 Dimensionality reduction plus K-means or spectral clustering 1.240 Many areas are captured by clusters of genes 1.241 todo 1.242 @@ -469,8 +476,9 @@ 1.243 [3]Shiaoching Gong, Chen Zheng, Martin L. Doughty, Kasia Losos, Nicholas Didkovsky, Uta B. Schambra, Norma J. 1.244 Nowak, Alexandra Joyner, Gabrielle Leblanc, Mary E. Hatten, and Nathaniel Heintz. A gene expression atlas of the 1.245 central nervous system based on bacterial artificial chromosomes. Nature, 425(6961):917–925, October 2003. 1.246 -[4]Jano Hemert and Richard Baldock. Matching Spatial Regions with Combinations of Interacting Gene Expression 1.247 -Patterns, pages 347–361. 2008. 1.248 +[4]Jano Hemert and Richard Baldock. Matching Spatial Regions with Combinations of Interacting Gene Expression Pat- 1.249 +terns, volume 13 of Communications in Computer and Information Science, pages 347–361. Springer Berlin Heidelberg, 1.250 +2008. 1.251 [5]Susan Magdaleno, Patricia Jensen, Craig L. Brumwell, Anna Seal, Karen Lehman, Andrew Asbury, Tony Cheung, 1.252 Tommie Cornelius, Diana M. Batten, Christopher Eden, Shannon M. Norland, Dennis S. Rice, Nilesh Dosooye, Sundeep 1.253 Shakya, Perdeep Mehta, and Tom Curran. BGEM: an in situ hybridization database of gene expression in the embryonic 1.254 @@ -486,12 +494,13 @@ 1.255 Allison Cusick, Zackery L. Riley, Susan M. Sunkin, Amy Bernard, Ralph B. Puchalski, Fred H. Gage, Allan R. Jones, 1.256 Vladimir B. Bajic, Michael J. Hawrylycz, and Ed S. Lein. Genomic anatomy of the hippocampus. Neuron, 60(6):1010– 1.257 1021, December 2008. 1.258 -[10]Shanmugasundaram Venkataraman, Peter Stevenson, Yiya Yang, Lorna Richardson, Nicholas Burton, Thomas P. Perry, 1.259 +[10]Jano van Hemert and Richard Baldock. Mining Spatial Gene Expression Data for Association Rules, pages 66–76. 2007. 1.260 +[11]Shanmugasundaram Venkataraman, Peter Stevenson, Yiya Yang, Lorna Richardson, Nicholas Burton, Thomas P. Perry, 1.261 Paul Smith, Richard A. Baldock, Duncan R. Davidson, and Jeffrey H. Christiansen. EMAGE edinburgh mouse atlas 1.262 of gene expression: 2008 update. Nucl. Acids Res., 36(suppl_1):D860–865, 2008. 1.263 -[11]Axel Visel, Christina Thaller, and Gregor Eichele. GenePaint.org: an atlas of gene expression patterns in the mouse 1.264 +[12]Axel Visel, Christina Thaller, and Gregor Eichele. GenePaint.org: an atlas of gene expression patterns in the mouse 1.265 embryo. Nucl. Acids Res., 32(suppl_1):D552–556, 2004. 1.266 -[12]Robert H Waterston, Kerstin Lindblad-Toh, Ewan Birney, Jane Rogers, Josep F Abril, Pankaj Agarwal, Richa Agar- 1.267 +[13]Robert H Waterston, Kerstin Lindblad-Toh, Ewan Birney, Jane Rogers, Josep F Abril, Pankaj Agarwal, Richa Agar- 1.268 wala, Rachel Ainscough, Marina Alexandersson, Peter An, Stylianos E Antonarakis, John Attwood, Robert Baertsch, 1.269 Jonathon Bailey, Karen Barlow, Stephan Beck, Eric Berry, Bruce Birren, Toby Bloom, Peer Bork, Marc Botcherby, 1.270 Nicolas Bray, Michael R Brent, Daniel G Brown, Stephen D Brown, Carol Bult, John Burton, Jonathan Butler,
2.1 Binary file grant.odt has changed
3.1 Binary file grant.pdf has changed
4.1 --- a/grant.txt Wed Apr 15 03:20:19 2009 -0700 4.2 +++ b/grant.txt Wed Apr 15 13:57:53 2009 -0700 4.3 @@ -94,9 +94,9 @@ 4.4 \cite{venkataraman_emage_2008} todo 4.5 4.6 4.7 -\cite{hemert_matching_2008} todo 4.8 - 4.9 -In summary, none of the previous projects explores combinations of marker genes, and none of their publications compare the results obtained by using different algorithms or scoring methods. 4.10 +\cite{hemert_matching_2008} describes a technique to find combinations of marker genes to pick out an anatomical region. They use an evolutionary algorithm to evolve logical operators which combine boolean (thresholded) images in order to match a target image. Their match score is Jaccard similarity, which is equal to the number of true pixels in the intersection of the two images, divided by the number of pixels in their union. 4.11 + 4.12 +In summary, only one of the previous projects explores combinations of marker genes, and none of their publications compare the results obtained by using different algorithms or scoring methods. 4.13 4.14 4.15 4.16 @@ -144,7 +144,7 @@ 4.17 4.18 4.19 === Related work === 4.20 -We are aware of three existing efforts to cluster spatial gene expression data. 4.21 +We are aware of four existing efforts to cluster spatial gene expression data. 4.22 4.23 4.24 \cite{thompson_genomic_2008} describes an analysis of the anatomy of 4.25 @@ -159,6 +159,7 @@ 4.26 4.27 %% todo \cite{thompson_genomic_2008} reports that both mNNMF and hierarchial mNNMF clustering were useful, and that hierarchial recursive bifurcation gave similar results. 4.28 4.29 +In an interesting twist, \cite{hemert_matching_2008} applies their technique for finding combinations of marker genes for the purpose of clustering genes around a "seed gene". The way they do this is by using the pattern of expression of the seed gene as the target image, and then searching for other genes which can be combined to reproduce this pattern. Those other genes which are found are considered to be related to the seed. The same team also describes a method\cite{van_hemert_mining_2007} for finding "association rules" such as, "if this voxel is expressed in by any gene, then that voxel is probably also expressed in by the same gene". This could be useful as part of a procedure for clustering voxels. 4.30 4.31 4.32 AGEA's\cite{ng_anatomic_2009} hierarchial clustering differs from our Aim 2 in at least two ways. First, AGEA uses perhaps the simplest possible similarity score (correlation), and does no dimensionality reduction before calculating similarity. While it is possible that a more complex system will not do any better than this, we believe further exploration of alternative methods of scoring and dimensionality reduction is warranted. Second, AGEA did not look at clusters of genes; in Preliminary Data we have shown that clusters of genes may identify interesting spatial regions such as cortical areas. 4.33 @@ -166,7 +167,7 @@ 4.34 \cite{venkataraman_emage_2008} todo 4.35 4.36 4.37 -In summary, although these projects obtained hierarchial clusterings, there has not been much comparison between different algorithms or scoring methods, so it is likely that the best clustering method for this application has not yet been found. 4.38 +In summary, although these projects obtained clusterings, there has not been much comparison between different algorithms or scoring methods, so it is likely that the best clustering method for this application has not yet been found. 4.39 4.40 4.41 4.42 @@ -186,9 +187,9 @@ 4.43 4.44 Next, an automated nonlinear alignment procedure located the 2D data from the various slices in a single 3D coordinate system. In the final 3D coordinate system, voxels are cubes with 200 microns on a side. There are 67x41x58 \= 159,326 voxels in the 3D coordinate system, of which 51,533 are in the brain\cite{ng_anatomic_2009}. 4.45 4.46 -Mus musculus, the common house mouse, is thought to contain about 22,000 protein-coding genes\cite{waterston_initial_2002}. The ABA contains data on about 20,000 genes in sagittal sections, out of which over 4,000 genes are also measured in coronal sections. Our dataset is derived from only the coronal subset of the ABA, because the sagittal data does not cover the entire cortex, and has greater registration error\cite{ng_anatomic_2009}. Genes were selected by the Allen Institute for coronal sectioning based on, "classes of known neuroscientific interest... or through post hoc identification of a marked non-ubiquitous expression pattern"\cite{ng_anatomic_2009}. 4.47 - 4.48 -The ABA is not the only large public spatial gene expression dataset. Other such resources include GENSAT\cite{gong_gene_2003}, GenePaint\cite{visel_genepaint.org:atlas_2004}, its sister project GeneAtlas\cite{carson_data_2005}, BGEM\cite{magdaleno_bgem:in_2006}, EMAGE\cite{?}, EurExpress (http://www.eurexpress.org/ee/; EurExpress data is also entered into EMAGE), todo. With the exception of the ABA, GenePaint, and EMAGE, most of these resources, have not (yet) extracted the expression intensity from the ISH images and registered the results into a single 3-D space, and only ABA and EMAGE make this form of data available for public download from the website. Many of these resources focus on developmental gene expression. 4.49 +Mus musculus, the common house mouse, is thought to contain about 22,000 protein-coding genes\cite{waterston_initial_2002}. The ABA contains data on about 20,000 genes in sagittal sections, out of which over 4,000 genes are also measured in coronal sections. Our dataset is derived from only the coronal subset of the ABA, because the sagittal data does not cover the entire cortex, and also has greater registration error\cite{ng_anatomic_2009}. Genes were selected by the Allen Institute for coronal sectioning based on, "classes of known neuroscientific interest... or through post hoc identification of a marked non-ubiquitous expression pattern"\cite{ng_anatomic_2009}. 4.50 + 4.51 +The ABA is not the only large public spatial gene expression dataset. Other such resources include GENSAT\cite{gong_gene_2003}, GenePaint\cite{visel_genepaint.org:atlas_2004}, its sister project GeneAtlas\cite{carson_data_2005}, BGEM\cite{magdaleno_bgem:in_2006}, EMAGE\cite{venkataraman_emage_2008}, EurExpress\footnote{http://www.eurexpress.org/ee/; EurExpress data is also entered into EMAGE}, todo. With the exception of the ABA, GenePaint, and EMAGE, most of these resources, have not (yet) extracted the expression intensity from the ISH images and registered the results into a single 3-D space, and only ABA and EMAGE make this form of data available for public download from the website. Many of these resources focus on developmental gene expression. 4.52 4.53 4.54 4.55 @@ -205,10 +206,12 @@ 4.56 4.57 === Related work === 4.58 4.59 -\cite{ng_anatomic_2009} describes the application of AGEA to the cortex. The paper describes interesting results on the structure of correlations between voxel gene expression profiles within a handful of cortical areas. However, this sort of analysis is not related to either of our aims, as it neither finds marker genes, nor does it suggest a cortical map based on gene expression data. Neither of the other components of AGEA can be applied to cortical areas; AGEA's Gene Finder cannot be used to find marker genes for most cortical areas; and AGEA's hierarchial clustering does not produce clusters corresponding to most cortical areas\footnote{In both cases, the root cause is that pairwise correlations between the gene expression of voxels in different areas but the same layer are often stronger than pairwise correlations between the gene expression of voxels in different layers but the same area. Therefore, a pairwise voxel correlation clustering algorithm will often create clusters representing cortical layers, not areas. This is why the hierarchial clustering does not find most cortical areas (there are clusters which presumably correspond to the intersection of a layer and an area, but since one area will have many layer-area intersection clusters, further work is needed to make sense of these). The reason that Gene Finder cannot find marker genes for most cortical areas is that in Gene Finder, although the user chooses a seed voxel, Gene Finder chooses the ROI for which genes will be found, and it creates that ROI by (pairwise voxel correlation) clustering around the seed.}. 4.60 - 4.61 - 4.62 -In summary, for all three aims, (a) none of the previous projects explores combinations of marker genes, (b) there has been almost no comparison of different algorithms or scoring methods, and (c) there has been no work on computationally finding marker genes for cortical areas, or on finding a hierarchial clustering that will yield a map of cortical areas de novo from gene expression data. 4.63 +\cite{ng_anatomic_2009} describes the application of AGEA to the cortex. The paper describes interesting results on the structure of correlations between voxel gene expression profiles within a handful of cortical areas. However, this sort of analysis is not related to either of our aims, as it neither finds marker genes, nor does it suggest a cortical map based on gene expression data. Neither of the other components of AGEA can be applied to cortical areas; AGEA's Gene Finder cannot be used to find marker genes for the cortical areas; and AGEA's hierarchial clustering does not produce clusters corresponding to the cortical areas\footnote{In both cases, the root cause is that pairwise correlations between the gene expression of voxels in different areas but the same layer are often stronger than pairwise correlations between the gene expression of voxels in different layers but the same area. Therefore, a pairwise voxel correlation clustering algorithm will tend to create clusters representing cortical layers, not areas. This is why the hierarchial clustering does not find most cortical areas (there are clusters which presumably correspond to the intersection of a layer and an area, but since one area will have many layer-area intersection clusters, further work is needed to make sense of these). The reason that Gene Finder cannot find marker genes for most cortical areas is that in Gene Finder, although the user chooses a seed voxel, Gene Finder chooses the ROI for which genes will be found, and it creates that ROI by (pairwise voxel correlation) clustering around the seed.}. 4.64 + 4.65 + 4.66 +%% Most of the projects which have been discussed have been done by the same groups that develop the public datasets. Although these projects make their algorithms available for use on their own website, none of them have released an open-source software toolkit; instead, users are restricted to using the provided algorithms only on their own dataset. 4.67 + 4.68 +In summary, for all three aims, (a) only one of the previous projects explores combinations of marker genes, (b) there has been almost no comparison of different algorithms or scoring methods, and (c) there has been no work on computationally finding marker genes for cortical areas, or on finding a hierarchial clustering that will yield a map of cortical areas de novo from gene expression data. 4.69 4.70 Our project is guided by a concrete application with a well-specified criterion of success (how well we can find marker genes for \begin{latex}/\end{latex} reproduce the layout of cortical areas), which will provide a solid basis for comparing different methods. 4.71 4.72 @@ -255,7 +258,7 @@ 4.73 4.74 4.75 \vspace{0.3cm}**Correlation** 4.76 -Recall that the instances are surface pixels, and consider the problem of attempting to classify each instance as either a member of a particular anatomical area, or not. The target area can be represented as a binary mask over the surface pixels. 4.77 +Recall that the instances are surface pixels, and consider the problem of attempting to classify each instance as either a member of a particular anatomical area, or not. The target area can be represented as a boolean mask over the surface pixels. 4.78 4.79 One class of feature selection scoring method are those which calculate some sort of "match" between each gene image and the target image. Those genes which match the best are good candidates for features. 4.80 4.81 @@ -266,9 +269,9 @@ 4.82 \vspace{0.3cm}**Conditional entropy** 4.83 An information-theoretic scoring method is to find features such that, if the features (gene expression levels) are known, uncertainty about the target (the regional identity) is reduced. Entropy measures uncertainty, so what we want is to find features such that the conditional distribution of the target has minimal entropy. The distribution to which we are referring is the probability distribution over the population of surface pixels. 4.84 4.85 -The simplest way to use information theory is on discrete data, so we discretized our gene expression data by creating, for each gene, five thresholded binary masks of the gene data. For each gene, we created a binary mask of its expression levels using each of these thresholds: the mean of that gene, the mean minus one standard deviation, the mean minus two standard deviations, the mean plus one standard deviation, the mean plus two standard deviations. 4.86 - 4.87 -Now, for each region, we created and ran a forward stepwise procedure which attempted to find pairs of gene expression binary masks such that the conditional entropy of the target area's binary mask, conditioned upon the pair of gene expression binary masks, is minimized. 4.88 +The simplest way to use information theory is on discrete data, so we discretized our gene expression data by creating, for each gene, five thresholded boolean masks of the gene data. For each gene, we created a boolean mask of its expression levels using each of these thresholds: the mean of that gene, the mean minus one standard deviation, the mean minus two standard deviations, the mean plus one standard deviation, the mean plus two standard deviations. 4.89 + 4.90 +Now, for each region, we created and ran a forward stepwise procedure which attempted to find pairs of gene expression boolean masks such that the conditional entropy of the target area's boolean mask, conditioned upon the pair of gene expression boolean masks, is minimized. 4.91 4.92 This finds pairs of genes which are most informative (at least at these discretization thresholds) relative to the question, "Is this surface pixel a member of the target area?". 4.93