cg
changeset 51:3ebb8f4ea921
.
author | bshanks@bshanks.dyndns.org |
---|---|
date | Fri Apr 17 12:47:51 2009 -0700 (16 years ago) |
parents | 0669519bc685 |
children | 074e2be60b38 |
files | grant.html grant.odt grant.pdf grant.txt |
line diff
1.1 --- a/grant.html Thu Apr 16 14:50:46 2009 -0700
1.2 +++ b/grant.html Fri Apr 17 12:47:51 2009 -0700
1.3 @@ -90,13 +90,12 @@
1.4 reconceptualizing the problem domain, and is not merely a mechanical “fine-tuning” of numerical parameters. For example,
1.5 we believe that domain-specific scoring measures (such as gradient similarity, which is discussed in Preliminary Work) may
1.6 be necessary in order to achieve the best results in this application.
1.7 -We are aware of four existing efforts to find marker genes using spatial gene expression data using automated methods.
1.8 -[1 ] describes GeneAtlas. GeneAtlas allows the user to construct a search query by freely demarcating one or two 2-D
1.9 -regions on sagittal slices, and then to specify either the strength of expression or the name of another gene whose expression
1.10 -pattern is to be matched. GeneAtlas differs from our Aim 1 in at least two ways. First, GeneAtlas finds only single genes,
1.11 -whereas we will also look for combinations of genes3. Second, at least for the custom spatial search, Gene Atlas appears to
1.12 -use a simple pointwise scoring method (strength of expression), whereas we will also use geometric metrics such as gradient
1.13 -similarity.
1.14 +We are aware of five existing efforts to find marker genes using spatial gene expression data using automated methods.
1.15 +GeneAtlas[1] and EMAGE [11] allow the user to construct a search query by demarcating regions and then specifing
1.16 +either the strength of expression or the name of another gene or dataset whose expression pattern is to be matched. For
1.17 +the similiarity score (match score), GeneAtlas appears to use strength of expression, and EMAGE uses Jaccard similarity,
1.18 +which is equal to the number of true pixels in the intersection of the two images, divided by the number of pixels in their
1.19 +union. Neither GeneAtlas nor EMAGE allow one to search for combinations of genes that together match a region.
1.20 [6 ] describes AGEA, ”Anatomic Gene Expression Atlas”. AGEA has three components:
1.21 * Gene Finder: The user selects a seed voxel and the system (1) chooses a cluster which includes the seed voxel, (2)
1.22 yields a list of genes which are overexpressed in that cluster. (note: the ABA website also contains pre-prepared lists of
1.23 @@ -107,15 +106,18 @@
1.24 with correlation as the similarity metric.
1.25 Gene Finder is different from our Aim 1 in at least three ways. First, Gene Finder finds only single genes, whereas we
1.26 will also look for combinations of genes. Second, gene finder can only use overexpression as a marker, whereas we will also
1.27 -search for underexpression. Third, Gene Finder uses a simple pointwise score4, whereas we will also use geometric scores
1.28 +search for underexpression. Third, Gene Finder uses a simple pointwise score3, whereas we will also use geometric scores
1.29 such as gradient similarity. The Preliminary Data section contains evidence that each of our three choices is the right one.
1.30 -[11 ] todo
1.31 +[? ] looks at the mean expression level of genes within anatomical regions, and applies a Student’s t-test with Bonferroni
1.32 +correction to determine whether the mean expression level of a gene is significantly higher in the target region. Like AGEA,
1.33 +this is a pointwise measure (only the mean expression level per pixel is being analyzed), it is not being used to look for
1.34 +underexpression, and does not look for combinations of genes.
1.35 [4 ] describes a technique to find combinations of marker genes to pick out an anatomical region. They use an evolutionary
1.36 algorithm to evolve logical operators which combine boolean (thresholded) images in order to match a target image. Their
1.37 -match score is Jaccard similarity, which is equal to the number of true pixels in the intersection of the two images, divided
1.38 -by the number of pixels in their union.
1.39 -In summary, only one of the previous projects explores combinations of marker genes, and none of their publications
1.40 -compare the results obtained by using different algorithms or scoring methods.
1.41 +match score is Jaccard similarity.
1.42 +In summary, there has been fruitful work on finding marker genes, however, only one of the previous projects explores
1.43 +combinations of marker genes, and none of these publications compare the results obtained by using different algorithms or
1.44 +scoring methods.
1.45 Aim 2
1.46 Machine learning terminology: clustering
1.47 If one is given a dataset consisting merely of instances, with no class labels, then analysis of the dataset is referred to as
1.48 @@ -123,9 +125,7 @@
1.49 _________________________________________
1.50 2By “fundamentally spatial” we mean that there is information from a large number of spatial locations indexed by spatial coordinates; not
1.51 just data which has only a few different locations or which is indexed by anatomical label.
1.52 - 3See Preliminary Data for an example of an area which cannot be marked by any single gene in the dataset, but which can be marked by a
1.53 -combination.
1.54 - 4“Expression energy ratio”, which captures overexpression.
1.55 + 3“Expression energy ratio”, which captures overexpression.
1.56 together. A set of similar instances is called a cluster, and the activity of finding grouping the data into clusters is called
1.57 clustering or cluster analysis.
1.58 The task of deciding how to carve up a structure into anatomical regions can be put into these terms. The instances are
1.59 @@ -156,7 +156,8 @@
1.60 sharp object boundaries in a visual scene but which are not appropriate for segmenting abstract spatial data such as gene
1.61 expression. Although many image segmentation algorithms can be expected to work well for segmenting other sorts of
1.62 spatially arranged data, some of these algorithms are specialized for visual images.
1.63 -Dimensionality reduction
1.64 +Dimensionality reduction In this section, we discuss reducing the length of the per-pixel gene expression feature
1.65 +vector. By “dimension”, we mean the dimension of this vector, not the spatial dimension of the underlying data.
1.66 Unlike aim 1, there is no externally-imposed need to select only a handful of informative genes for inclusion in the
1.67 instances. However, some clustering algorithms perform better on small numbers of features. There are techniques which
1.68 “summarize” a larger number of features using a smaller number of features; these techniques go by the name of feature
1.69 @@ -165,49 +166,51 @@
1.70 the reduced feature set rather than the original feature set of all gene expression levels. Note that the features in the reduced
1.71 feature set do not necessarily correspond to genes; each feature in the reduced set may be any function of the set of gene
1.72 expression levels.
1.73 -Another use for dimensionality reduction is to visualize the relationships between regions. For example, one might want
1.74 -to make a 2-D plot upon which each region is represented by a single point, and with the property that regions with similar
1.75 -gene expression profiles should be nearby on the plot (that is, the property that distance between pairs of points in the plot
1.76 -should be proportional to some measure of dissimilarity in gene expression). It is likely that no arrangement of the points on
1.77 -a 2-D plan will exactly satisfy this property – however, dimensionality reduction techniques allow one to find arrangements
1.78 -of points that approximately satisfy that property. Note that in this application, dimensionality reduction is being applied
1.79 -after clustering; whereas in the previous paragraph, we were talking about using dimensionality reduction before clustering.
1.80 +Dimensionality reduction before clustering is useful on large datasets. First, because the number of features in the
1.81 +reduced data set is less than in the original data set, the running time of clustering algorithms may be much less. Second,
1.82 +it is thought that some clustering algorithms may give better results on reduced data.
1.83 +Another use for dimensionality reduction is to visualize the relationships between regions after clustering. For example,
1.84 +one might want to make a 2-D plot upon which each region is represented by a single point, and with the property that regions
1.85 +with similar gene expression profiles should be nearby on the plot (that is, the property that distance between pairs of points
1.86 +in the plot should be proportional to some measure of dissimilarity in gene expression). It is likely that no arrangement of
1.87 +the points on a 2-D plan will exactly satisfy this property – however, dimensionality reduction techniques allow one to find
1.88 +arrangements of points that approximately satisfy that property. Note that in this application, dimensionality reduction
1.89 +is being applied after clustering; whereas in the previous paragraph, we were talking about using dimensionality reduction
1.90 +before clustering.
1.91 Clustering genes rather than voxels
1.92 Although the ultimate goal is to cluster the instances (voxels or pixels), one strategy to achieve this goal is to first cluster
1.93 the features (genes). There are two ways that clusters of genes could be used.
1.94 Gene clusters could be used as part of dimensionality reduction: rather than have one feature for each gene, we could
1.95 have one reduced feature for each gene cluster.
1.96 Gene clusters could also be used to directly yield a clustering on instances. This is because many genes have an expression
1.97 -pattern which seems to pick out a single, spatially continguous region. Therefore, it seems likely that an anatomically
1.98 -interesting region will have multiple genes which each individually pick it out5. This suggests the following procedure:
1.99 -_________________________________________
1.100 - 5This would seem to contradict our finding in aim 1 that some cortical areas are combinatorially coded by multiple genes. However, it is
1.101 -possible that the currently accepted cortical maps divide the cortex into regions which are unnatural from the point of view of gene expression;
1.102 -perhaps there is some other way to map the cortex for which each region can be identified by single genes. Another possibility is that, although
1.103 +patternwhich seems to pick out a single, spatially continguous region. Therefore, it seems likely that an anatomically
1.104 +interesting region will have multiple genes which each individually pick it out4. This suggests the following procedure:
1.105 cluster together genes which pick out similar regions, and then to use the more popular common regions as the final clusters.
1.106 In the Preliminary Data we show that a number of anatomically recognized cortical regions, as well as some “superregions”
1.107 formed by lumping together a few regions, are associated with gene clusters in this fashion.
1.108 +The task of clustering both the instances and the features is called co-clustering, and there are a number of co-clustering
1.109 +algorithms.
1.110 Related work
1.111 -We are aware of four existing efforts to cluster spatial gene expression data.
1.112 +We are aware of five existing efforts to cluster spatial gene expression data.
1.113 [9 ] describes an analysis of the anatomy of the hippocampus using the ABA dataset. In addition to manual analysis,
1.114 two clustering methods were employed, a modified Non-negative Matrix Factorization (NNMF), and a hierarchial recursive
1.115 bifurcation clustering scheme based on correlation as the similarity score. The paper yielded impressive results, proving
1.116 -the usefulness of computational genomic anatomy. We have run NNMF on the cortical dataset6 and while the results are
1.117 +the usefulness of computational genomic anatomy. We have run NNMF on the cortical dataset5 and while the results are
1.118 promising (see Preliminary Data), we think that it will be possible to find an even better method.
1.119 +AGEA’s[6] hierarchial clustering was described above. EMAGE[11] allows the user to select a dataset from among a
1.120 +large number of alternatives, or by running a search query, and then to cluster the genes within that dataset. Clustering is
1.121 +hierarchial complete linkage clustering with un-centred correlation as the similarity score.
1.122 +todo [?]
1.123 In an interesting twist, [4] applies their technique for finding combinations of marker genes for the purpose of clustering
1.124 genes around a “seed gene”. The way they do this is by using the pattern of expression of the seed gene as the target image,
1.125 and then searching for other genes which can be combined to reproduce this pattern. Those other genes which are found
1.126 are considered to be related to the seed. The same team also describes a method[10] for finding “association rules” such as,
1.127 “if this voxel is expressed in by any gene, then that voxel is probably also expressed in by the same gene”. This could be
1.128 useful as part of a procedure for clustering voxels.
1.129 -AGEA’s[6] hierarchial clustering differs from our Aim 2 in at least two ways. First, AGEA uses perhaps the simplest
1.130 -possible similarity score (correlation), and does no dimensionality reduction before calculating similarity. While it is possible
1.131 -that a more complex system will not do any better than this, we believe further exploration of alternative methods of scoring
1.132 -and dimensionality reduction is warranted. Second, AGEA did not look at clusters of genes; in Preliminary Data we have
1.133 -shown that clusters of genes may identify interesting spatial regions such as cortical areas.
1.134 -[11 ] todo
1.135 In summary, although these projects obtained clusterings, there has not been much comparison between different algo-
1.136 -rithms or scoring methods, so it is likely that the best clustering method for this application has not yet been found.
1.137 +rithms or scoring methods, so it is likely that the best clustering method for this application has not yet been found. Also,
1.138 +none of these projects did a separate dimensionality reduction step before clustering pixels, or tried to cluster genes first in
1.139 +order to guide the clustering of pixels into spatial regions, or used co-clustering algorithms.
1.140 Aim 3
1.141 Background
1.142 The cortex is divided into areas and layers. To a first approximation, the parcellation of the cortex into areas can
1.143 @@ -227,6 +230,14 @@
1.144 The Allen Mouse Brain Atlas (ABA) data was produced by doing in-situ hybridization on slices of male, 56-day-old
1.145 C57BL/6J mouse brains. Pictures were taken of the processed slice, and these pictures were semi-automatically analyzed
1.146 in order to create a digital measurement of gene expression levels at each location in each slice. Per slice, cellular spatial
1.147 +_________________________________________
1.148 + 4This would seem to contradict our finding in aim 1 that some cortical areas are combinatorially coded by multiple genes. However, it is
1.149 +possible that the currently accepted cortical maps divide the cortex into regions which are unnatural from the point of view of gene expression;
1.150 +perhaps there is some other way to map the cortex for which each region can be identified by single genes. Another possibility is that, although
1.151 +the cluster prototype fits an anatomical region, the individual genes are each somewhat different from the prototype.
1.152 + 5We ran “vanilla” NNMF, whereas the paper under discussion used a modified method. Their main modification consisted of adding a soft
1.153 +spatial contiguity constraint. However, on our dataset, NNMF naturally produced spatially contiguous clusters, so no additional constraint was
1.154 +needed. The paper under discussion also mentions that they tried a hierarchial variant of NNMF, which we have not yet tried.
1.155 resolution is achieved. Using this method, a single physical slice can only be used to measure one single gene; many different
1.156 mouse brains were needed in order to measure the expression of many genes.
1.157 Next, an automated nonlinear alignment procedure located the 2D data from the various slices in a single 3D coordinate
1.158 @@ -235,19 +246,14 @@
1.159 Mus musculus, the common house mouse, is thought to contain about 22,000 protein-coding genes[13]. The ABA contains
1.160 data on about 20,000 genes in sagittal sections, out of which over 4,000 genes are also measured in coronal sections. Our
1.161 dataset is derived from only the coronal subset of the ABA, because the sagittal data does not cover the entire cortex, and
1.162 -_________________________________________
1.163 -the cluster prototype fits an anatomical region, the individual genes are each somewhat different from the prototype.
1.164 - 6We ran “vanilla” NNMF, whereas the paper under discussion used a modified method. Their main modification consisted of adding a soft
1.165 -spatial contiguity constraint. However, on our dataset, NNMF naturally produced spatially contiguous clusters, so no additional constraint was
1.166 -needed. The paper under discussion also mentions that they tried a hierarchial variant of NNMF, which we have not yet tried.
1.167 also has greater registration error[6]. Genes were selected by the Allen Institute for coronal sectioning based on, “classes of
1.168 known neuroscientific interest... or through post hoc identification of a marked non-ubiquitous expression pattern”[6].
1.169 -TheABA is not the only large public spatial gene expression dataset. Other such resources include GENSAT[3],
1.170 -GenePaint[12], its sister project GeneAtlas[1], BGEM[5], EMAGE[11], EurExpress7, EADHB8, MAMEP9, Xenbase10,
1.171 -ZFIN[? ], Aniseed11, VisiGene12, GEISHA[?], Fruitfly.org[?], COMPARE[?] todo. With the exception of the ABA, Gene-
1.172 -Paint, and EMAGE, most of these resources, have not (yet) extracted the expression intensity from the ISH images and
1.173 -registered the results into a single 3-D space, and only ABA and EMAGE make this form of data available for public
1.174 -download from the website. Many of these resources focus on developmental gene expression.
1.175 +The ABA is not the only large public spatial gene expression dataset. Other such resources include GENSAT[3],
1.176 +GenePaint[12], its sister project GeneAtlas[1], BGEM[5], EMAGE[11], EurExpress6, EADHB7, MAMEP8, Xenbase9, ZFIN[?],
1.177 +Aniseed10, VisiGene11, GEISHA[?], Fruitfly.org[?], COMPARE[?] todo. With the exception of the ABA, GenePaint, and
1.178 +EMAGE, most of these resources have not (yet) extracted the expression intensity from the ISH images and registered the
1.179 +results into a single 3-D space, and only ABA and EMAGE make this form of data available for public download from the
1.180 +website12. Many of these resources focus on developmental gene expression.
1.181 Significance
1.182 The method developed in aim (1) will be applied to each cortical area to find a set of marker genes such that the
1.183 combinatorial expression pattern of those genes uniquely picks out the target area. Finding marker genes will be useful for
1.184 @@ -276,15 +282,14 @@
1.185 been almost no comparison of different algorithms or scoring methods, and (c) there has been no work on computationally
1.186 finding marker genes for cortical areas, or on finding a hierarchial clustering that will yield a map of cortical areas de novo
1.187 from gene expression data.
1.188 -Our project is guided by a concrete application with a well-specified criterion of success (how well we can find marker
1.189 -genes for / reproduce the layout of cortical areas), which will provide a solid basis for comparing different methods.
1.190 -_________________________________________
1.191 - 7http://www.eurexpress.org/ee/; EurExpress data is also entered into EMAGE
1.192 - 8http://www.ncl.ac.uk/ihg/EADHB/database/EADHB_database.html
1.193 - 9http://mamep.molgen.mpg.de/index.php
1.194 - 10http://xenbase.org/
1.195 - 11http://aniseed-ibdm.univ-mrs.fr/
1.196 - 12http://genome.ucsc.edu/cgi-bin/hgVisiGene ; includes data from some the other listed data sources
1.197 +___________________
1.198 + 6http://www.eurexpress.org/ee/; EurExpress data is also entered into EMAGE
1.199 + 7http://www.ncl.ac.uk/ihg/EADHB/database/EADHB_database.html
1.200 + 8http://mamep.molgen.mpg.de/index.php
1.201 + 9http://xenbase.org/
1.202 + 10http://aniseed-ibdm.univ-mrs.fr/
1.203 + 11http://genome.ucsc.edu/cgi-bin/hgVisiGene ; includes data from some the other listed data sources
1.204 + 12without prior offline registration
1.205 13In both cases, the root cause is that pairwise correlations between the gene expression of voxels in different areas but the same layer are
1.206 often stronger than pairwise correlations between the gene expression of voxels in different layers but the same area. Therefore, a pairwise voxel
1.207 correlation clustering algorithm will tend to create clusters representing cortical layers, not areas. This is why the hierarchial clustering does not
1.208 @@ -292,6 +297,8 @@
1.209 many layer-area intersection clusters, further work is needed to make sense of these). The reason that Gene Finder cannot find marker genes for
1.210 most cortical areas is that in Gene Finder, although the user chooses a seed voxel, Gene Finder chooses the ROI for which genes will be found,
1.211 and it creates that ROI by (pairwise voxel correlation) clustering around the seed.
1.212 +Our project is guided by a concrete application with a well-specified criterion of success (how well we can find marker
1.213 +genes for / reproduce the layout of cortical areas), which will provide a solid basis for comparing different methods.
1.214 Preliminary work
1.215 Format conversion between SEV, MATLAB, NIFTI
1.216 We have created software to (politely) download all of the SEV files from the Allen Institute website. We have also created
1.217 @@ -458,6 +465,7 @@
1.218 genetic level. We will develop extensions to our procedure which (a) detect when a difficult area could be fit if its
1.219 boundary were redrawn slightly, and (b) detect when a difficult area could be combined with adjacent areas to create
1.220 a larger area which can be fit.
1.221 +# Linear discriminant analysis
1.222 Apply these algorithms to the cortex
1.223 1.Create open source format conversion tools: we will create tools to bulk download the ABA dataset and to convert
1.224 between SEV, NIFTI and MATLAB formats.
1.225 @@ -472,6 +480,9 @@
1.226 4.Explore clustering algorithms applied to genes: including gene shaving, TODO
1.227 5.Develop an algorithm to use dimensionality reduction and/or hierarchial clustering to create anatomical maps
1.228 6.Run this algorithm on the cortex: present a hierarchial, genoarchitectonic map of the cortex
1.229 +# Linear discriminant analysis
1.230 +# jbt, coclustering
1.231 +# self-organizing map
1.232 Bibliography & References Cited
1.233 [1]J. Carson, T. Ju, C. Thaller, M. Bello, I. Kakadiaris, J. Warren, G. Eichele, and W. Chiu. Data mining in situ gene
1.234 expression patterns at cellular resolution. In Computational Systems Bioinformatics Conference, 2005. Workshops and
2.1 Binary file grant.odt has changed
3.1 Binary file grant.pdf has changed
4.1 --- a/grant.txt Thu Apr 16 14:50:46 2009 -0700
4.2 +++ b/grant.txt Fri Apr 17 12:47:51 2009 -0700
4.3 @@ -71,9 +71,11 @@
4.4
4.5 As noted above, there has been much work on both supervised learning and there are many available algorithms for each. However, the algorithms require the scientist to provide a framework for representing the problem domain, and the way that this framework is set up has a large impact on performance. Creating a good framework can require creatively reconceptualizing the problem domain, and is not merely a mechanical "fine-tuning" of numerical parameters. For example, we believe that domain-specific scoring measures (such as gradient similarity, which is discussed in Preliminary Work) may be necessary in order to achieve the best results in this application.
4.6
4.7 -We are aware of four existing efforts to find marker genes using spatial gene expression data using automated methods.
4.8 -
4.9 -\cite{carson_data_2005} describes GeneAtlas. GeneAtlas allows the user to construct a search query by freely demarcating one or two 2-D regions on sagittal slices, and then to specify either the strength of expression or the name of another gene whose expression pattern is to be matched. GeneAtlas differs from our Aim 1 in at least two ways. First, GeneAtlas finds only single genes, whereas we will also look for combinations of genes\footnote{See Preliminary Data for an example of an area which cannot be marked by any single gene in the dataset, but which can be marked by a combination.}. Second, at least for the custom spatial search, Gene Atlas appears to use a simple pointwise scoring method (strength of expression), whereas we will also use geometric metrics such as gradient similarity.
4.10 +We are aware of five existing efforts to find marker genes using spatial gene expression data using automated methods.
4.11 +
4.12 +%%GeneAtlas\cite{carson_data_2005} allows the user to construct a search query by freely demarcating one or two 2-D regions on sagittal slices, and then to specify either the strength of expression or the name of another gene whose expression pattern is to be matched.
4.13 +
4.14 +GeneAtlas\cite{carson_data_2005} and EMAGE \cite{venkataraman_emage_2008} allow the user to construct a search query by demarcating regions and then specifing either the strength of expression or the name of another gene or dataset whose expression pattern is to be matched. For the similiarity score (match score), GeneAtlas appears to use strength of expression, and EMAGE uses Jaccard similarity, which is equal to the number of true pixels in the intersection of the two images, divided by the number of pixels in their union. Neither GeneAtlas nor EMAGE allow one to search for combinations of genes that together match a region.
4.15
4.16 \cite{ng_anatomic_2009} describes AGEA, "Anatomic Gene Expression
4.17 Atlas". AGEA has three
4.18 @@ -91,14 +93,11 @@
4.19
4.20 Gene Finder is different from our Aim 1 in at least three ways. First, Gene Finder finds only single genes, whereas we will also look for combinations of genes. Second, gene finder can only use overexpression as a marker, whereas we will also search for underexpression. Third, Gene Finder uses a simple pointwise score\footnote{"Expression energy ratio", which captures overexpression.}, whereas we will also use geometric scores such as gradient similarity. The Preliminary Data section contains evidence that each of our three choices is the right one.
4.21
4.22 -\cite{venkataraman_emage_2008} todo
4.23 -
4.24 -
4.25 -\cite{chin_genome-scale_2007} uses a Student's t-test with Bonferroni correction to determine whether a gene is overexpressed in a specific anatomical region.
4.26 -
4.27 -\cite{hemert_matching_2008} describes a technique to find combinations of marker genes to pick out an anatomical region. They use an evolutionary algorithm to evolve logical operators which combine boolean (thresholded) images in order to match a target image. Their match score is Jaccard similarity, which is equal to the number of true pixels in the intersection of the two images, divided by the number of pixels in their union.
4.28 -
4.29 -In summary, only one of the previous projects explores combinations of marker genes, and none of their publications compare the results obtained by using different algorithms or scoring methods.
4.30 +\cite{chin_genome-scale_2007} looks at the mean expression level of genes within anatomical regions, and applies a Student's t-test with Bonferroni correction to determine whether the mean expression level of a gene is significantly higher in the target region. Like AGEA, this is a pointwise measure (only the mean expression level per pixel is being analyzed), it is not being used to look for underexpression, and does not look for combinations of genes.
4.31 +
4.32 +\cite{hemert_matching_2008} describes a technique to find combinations of marker genes to pick out an anatomical region. They use an evolutionary algorithm to evolve logical operators which combine boolean (thresholded) images in order to match a target image. Their match score is Jaccard similarity.
4.33 +
4.34 +In summary, there has been fruitful work on finding marker genes, however, only one of the previous projects explores combinations of marker genes, and none of these publications compare the results obtained by using different algorithms or scoring methods.
4.35
4.36
4.37
4.38 @@ -128,11 +127,13 @@
4.39
4.40
4.41 \vspace{0.3cm}**Dimensionality reduction**
4.42 -
4.43 +In this section, we discuss reducing the length of the per-pixel gene expression feature vector. By "dimension", we mean the dimension of this vector, not the spatial dimension of the underlying data.
4.44
4.45 Unlike aim 1, there is no externally-imposed need to select only a handful of informative genes for inclusion in the instances. However, some clustering algorithms perform better on small numbers of features. There are techniques which "summarize" a larger number of features using a smaller number of features; these techniques go by the name of feature extraction or dimensionality reduction. The small set of features that such a technique yields is called the __reduced feature set__. After the reduced feature set is created, the instances may be replaced by __reduced instances__, which have as their features the reduced feature set rather than the original feature set of all gene expression levels. Note that the features in the reduced feature set do not necessarily correspond to genes; each feature in the reduced set may be any function of the set of gene expression levels.
4.46
4.47 -Another use for dimensionality reduction is to visualize the relationships between regions. For example, one might want to make a 2-D plot upon which each region is represented by a single point, and with the property that regions with similar gene expression profiles should be nearby on the plot (that is, the property that distance between pairs of points in the plot should be proportional to some measure of dissimilarity in gene expression). It is likely that no arrangement of the points on a 2-D plan will exactly satisfy this property -- however, dimensionality reduction techniques allow one to find arrangements of points that approximately satisfy that property. Note that in this application, dimensionality reduction is being applied after clustering; whereas in the previous paragraph, we were talking about using dimensionality reduction before clustering.
4.48 +Dimensionality reduction before clustering is useful on large datasets. First, because the number of features in the reduced data set is less than in the original data set, the running time of clustering algorithms may be much less. Second, it is thought that some clustering algorithms may give better results on reduced data.
4.49 +
4.50 +Another use for dimensionality reduction is to visualize the relationships between regions after clustering. For example, one might want to make a 2-D plot upon which each region is represented by a single point, and with the property that regions with similar gene expression profiles should be nearby on the plot (that is, the property that distance between pairs of points in the plot should be proportional to some measure of dissimilarity in gene expression). It is likely that no arrangement of the points on a 2-D plan will exactly satisfy this property -- however, dimensionality reduction techniques allow one to find arrangements of points that approximately satisfy that property. Note that in this application, dimensionality reduction is being applied after clustering; whereas in the previous paragraph, we were talking about using dimensionality reduction before clustering.
4.51
4.52
4.53 \vspace{0.3cm}**Clustering genes rather than voxels**
4.54 @@ -144,10 +145,10 @@
4.55
4.56 Gene clusters could also be used to directly yield a clustering on instances. This is because many genes have an expression pattern which seems to pick out a single, spatially continguous region. Therefore, it seems likely that an anatomically interesting region will have multiple genes which each individually pick it out\footnote{This would seem to contradict our finding in aim 1 that some cortical areas are combinatorially coded by multiple genes. However, it is possible that the currently accepted cortical maps divide the cortex into regions which are unnatural from the point of view of gene expression; perhaps there is some other way to map the cortex for which each region can be identified by single genes. Another possibility is that, although the cluster prototype fits an anatomical region, the individual genes are each somewhat different from the prototype.}. This suggests the following procedure: cluster together genes which pick out similar regions, and then to use the more popular common regions as the final clusters. In the Preliminary Data we show that a number of anatomically recognized cortical regions, as well as some "superregions" formed by lumping together a few regions, are associated with gene clusters in this fashion.
4.57
4.58 +The task of clustering both the instances and the features is called co-clustering, and there are a number of co-clustering algorithms.
4.59
4.60 === Related work ===
4.61 -We are aware of four existing efforts to cluster spatial gene expression data.
4.62 -
4.63 +We are aware of five existing efforts to cluster spatial gene expression data.
4.64
4.65 \cite{thompson_genomic_2008} describes an analysis of the anatomy of
4.66 the hippocampus using the ABA dataset. In addition to manual analysis,
4.67 @@ -156,20 +157,16 @@
4.68
4.69 %% In addition, this paper described a visual screening of the data, specifically, a visual analysis of 6000 genes with the primary purpose of observing how the spatial pattern of their expression coincided with the regions that had been identified by NNMF. We propose to do this sort of screening automatically, which would yield an objective, quantifiable result, rather than qualitative observations.
4.70
4.71 -
4.72 -
4.73 -
4.74 -%% todo \cite{thompson_genomic_2008} reports that both mNNMF and hierarchial mNNMF clustering were useful, and that hierarchial recursive bifurcation gave similar results.
4.75 +%% \cite{thompson_genomic_2008} reports that both mNNMF and hierarchial mNNMF clustering were useful, and that hierarchial recursive bifurcation gave similar results.
4.76 +
4.77 +
4.78 +AGEA's\cite{ng_anatomic_2009} hierarchial clustering was described above. EMAGE\cite{venkataraman_emage_2008} allows the user to select a dataset from among a large number of alternatives, or by running a search query, and then to cluster the genes within that dataset. Clustering is hierarchial complete linkage clustering with un-centred correlation as the similarity score.
4.79 +
4.80 +todo \cite{chin_genome-scale_2007}
4.81
4.82 In an interesting twist, \cite{hemert_matching_2008} applies their technique for finding combinations of marker genes for the purpose of clustering genes around a "seed gene". The way they do this is by using the pattern of expression of the seed gene as the target image, and then searching for other genes which can be combined to reproduce this pattern. Those other genes which are found are considered to be related to the seed. The same team also describes a method\cite{van_hemert_mining_2007} for finding "association rules" such as, "if this voxel is expressed in by any gene, then that voxel is probably also expressed in by the same gene". This could be useful as part of a procedure for clustering voxels.
4.83
4.84 -
4.85 -AGEA's\cite{ng_anatomic_2009} hierarchial clustering differs from our Aim 2 in at least two ways. First, AGEA uses perhaps the simplest possible similarity score (correlation), and does no dimensionality reduction before calculating similarity. While it is possible that a more complex system will not do any better than this, we believe further exploration of alternative methods of scoring and dimensionality reduction is warranted. Second, AGEA did not look at clusters of genes; in Preliminary Data we have shown that clusters of genes may identify interesting spatial regions such as cortical areas.
4.86 -
4.87 -\cite{venkataraman_emage_2008} todo
4.88 -
4.89 -
4.90 -In summary, although these projects obtained clusterings, there has not been much comparison between different algorithms or scoring methods, so it is likely that the best clustering method for this application has not yet been found.
4.91 +In summary, although these projects obtained clusterings, there has not been much comparison between different algorithms or scoring methods, so it is likely that the best clustering method for this application has not yet been found. Also, none of these projects did a separate dimensionality reduction step before clustering pixels, or tried to cluster genes first in order to guide the clustering of pixels into spatial regions, or used co-clustering algorithms.
4.92
4.93
4.94
4.95 @@ -191,7 +188,7 @@
4.96
4.97 Mus musculus, the common house mouse, is thought to contain about 22,000 protein-coding genes\cite{waterston_initial_2002}. The ABA contains data on about 20,000 genes in sagittal sections, out of which over 4,000 genes are also measured in coronal sections. Our dataset is derived from only the coronal subset of the ABA, because the sagittal data does not cover the entire cortex, and also has greater registration error\cite{ng_anatomic_2009}. Genes were selected by the Allen Institute for coronal sectioning based on, "classes of known neuroscientific interest... or through post hoc identification of a marked non-ubiquitous expression pattern"\cite{ng_anatomic_2009}.
4.98
4.99 -The ABA is not the only large public spatial gene expression dataset. Other such resources include GENSAT\cite{gong_gene_2003}, GenePaint\cite{visel_genepaint.org:atlas_2004}, its sister project GeneAtlas\cite{carson_data_2005}, BGEM\cite{magdaleno_bgem:in_2006}, EMAGE\cite{venkataraman_emage_2008}, EurExpress\footnote{http://www.eurexpress.org/ee/; EurExpress data is also entered into EMAGE}, EADHB\footnote{http://www.ncl.ac.uk/ihg/EADHB/database/EADHB_database.html}, MAMEP\footnote{http://mamep.molgen.mpg.de/index.php}, Xenbase\footnote{http://xenbase.org/}, ZFIN\cite{sprague_zebrafish_2006}, Aniseed\footnote{http://aniseed-ibdm.univ-mrs.fr/}, VisiGene\footnote{http://genome.ucsc.edu/cgi-bin/hgVisiGene ; includes data from some the other listed data sources}, GEISHA\cite{bell_geisha_2004}, Fruitfly.org\cite{tomancak_systematic_2002}, COMPARE\cite{http://compare.ibdml.univ-mrs.fr/} todo. With the exception of the ABA, GenePaint, and EMAGE, most of these resources, have not (yet) extracted the expression intensity from the ISH images and registered the results into a single 3-D space, and only ABA and EMAGE make this form of data available for public download from the website. Many of these resources focus on developmental gene expression.
4.100 +The ABA is not the only large public spatial gene expression dataset. Other such resources include GENSAT\cite{gong_gene_2003}, GenePaint\cite{visel_genepaint.org:atlas_2004}, its sister project GeneAtlas\cite{carson_data_2005}, BGEM\cite{magdaleno_bgem:in_2006}, EMAGE\cite{venkataraman_emage_2008}, EurExpress\footnote{http://www.eurexpress.org/ee/; EurExpress data is also entered into EMAGE}, EADHB\footnote{http://www.ncl.ac.uk/ihg/EADHB/database/EADHB_database.html}, MAMEP\footnote{http://mamep.molgen.mpg.de/index.php}, Xenbase\footnote{http://xenbase.org/}, ZFIN\cite{sprague_zebrafish_2006}, Aniseed\footnote{http://aniseed-ibdm.univ-mrs.fr/}, VisiGene\footnote{http://genome.ucsc.edu/cgi-bin/hgVisiGene ; includes data from some the other listed data sources}, GEISHA\cite{bell_geisha_2004}, Fruitfly.org\cite{tomancak_systematic_2002}, COMPARE\footnote{http://compare.ibdml.univ-mrs.fr/} todo. With the exception of the ABA, GenePaint, and EMAGE, most of these resources have not (yet) extracted the expression intensity from the ISH images and registered the results into a single 3-D space, and only ABA and EMAGE make this form of data available for public download from the website\footnote{without prior offline registration}. Many of these resources focus on developmental gene expression.
4.101
4.102
4.103
4.104 @@ -411,7 +408,6 @@
4.105 # Linear discriminant analysis
4.106
4.107
4.108 -
4.109 \vspace{0.3cm}**Apply these algorithms to the cortex**
4.110
4.111 # Create open source format conversion tools: we will create tools to bulk download the ABA dataset and to convert between SEV, NIFTI and MATLAB formats.
4.112 @@ -432,6 +428,9 @@
4.113
4.114 # Linear discriminant analysis
4.115
4.116 +# jbt, coclustering
4.117 +
4.118 +# self-organizing map
4.119
4.120 \newpage
4.121