cg
diff grant.txt @ 84:d89a99c9ea9a
.
author | bshanks@bshanks.dyndns.org |
---|---|
date | Tue Apr 21 00:54:22 2009 -0700 (16 years ago) |
parents | 8808b945e2f7 |
children | da8f81785211 |
line diff
1.1 --- a/grant.txt Mon Apr 20 17:33:37 2009 -0700
1.2 +++ b/grant.txt Tue Apr 21 00:54:22 2009 -0700
1.3 @@ -10,11 +10,13 @@
1.4
1.5 (1) develop an algorithm to screen spatial gene expression data for combinations of marker genes which selectively target anatomical regions\\
1.6
1.7 -(2) develop an algorithm to suggest new ways of carving up a structure into anatomical regions, based on spatial patterns in gene expression\\
1.8 +(2) develop an algorithm to suggest new ways of carving up a structure into anatomically distinct regions, based on spatial patterns in gene expression\\
1.9
1.10 (3) create a 2-D "flat map" dataset of the mouse cerebral cortex that contains a flattened version of the Allen Mouse Brain Atlas ISH data, as well as the boundaries of cortical anatomical areas. This will involve extending the functionality of Caret, an existing open-source scientific imaging program. Use this dataset to validate the methods developed in (1) and (2).\\
1.11
1.12 -In addition to validating the usefulness of the algorithms, the application of these methods to cerebral cortex will produce immediate benefits, because there are currently no known genetic markers for many cortical areas. The results of the project will support the development of new ways to selectively target cortical areas, and it will support the development of a method for identifying the cortical areal boundaries present in small tissue samples.
1.13 +Although our particular application involves the 3D spatial distribution of gene expression, we anticipate that the methods developed in aims (1) and (2) will generalize to any sort of high-dimensional data over points located in a low-dimensional space.
1.14 +
1.15 +In terms of the application of the methods to cerebral cortex, aim (1) is to go from cortical areas to marker genes, and aim (2) is to let the gene profile define the cortical areas. In addition to validating the usefulness of the algorithms, the application of these methods to cortex will produce immediate benefits, because there are currently no known genetic markers for most cortical areas. The results of the project will support the development of new ways to selectively target cortical areas, and it will support the development of a method for identifying the cortical areal boundaries present in small tissue samples.
1.16
1.17 All algorithms that we develop will be implemented in a GPL open-source software toolkit. The toolkit, as well as the machine-readable datasets developed in aim (3), will be published and freely available for others to use.
1.18
1.19 @@ -23,11 +25,11 @@
1.20
1.21 == Background and significance ==
1.22
1.23 -=== Aim 1 ===
1.24 -
1.25 -\vspace{0.3cm}**Machine learning terminology: supervised learning**
1.26 -
1.27 -The task of looking for marker genes for anatomical regions means that one is looking for a set of genes such that, if the expression level of those genes is known, then the locations of the regions can be inferred.
1.28 +=== Aim 1: Given a map of regions, find genes that mark the regions ===
1.29 +
1.30 +After defining terms, we will describe a set of principles which determine our strategy to completing this aim.
1.31 +
1.32 +\vspace{0.3cm}**Machine learning terminology: supervised learning** The task of looking for marker genes for known anatomical regions means that one is looking for a set of genes such that, if the expression level of those genes is known, then the locations of the regions can be inferred.
1.33
1.34 If we define the regions so that they cover the entire anatomical structure to be divided, then instead of saying that we are using gene expression to find the locations of the regions, we may say that we are using gene expression to determine to which region each voxel within the structure belongs. We call this a __classification task__, because each voxel is being assigned to a class (namely, its region).
1.35
1.36 @@ -47,11 +49,13 @@
1.37
1.38
1.39 \vspace{0.3cm}**Principle 1: Combinatorial gene expression**
1.40 +
1.41 It is too much to hope that every anatomical region of interest will be identified by a single gene. For example, in the cortex, there are some areas which are not clearly delineated by any gene included in the Allen Brain Atlas (ABA) dataset. However, at least some of these areas can be delineated by looking at combinations of genes (an example of an area for which multiple genes are necessary and sufficient is provided in Preliminary Studies, Figure \ref{MOcombo}). Therefore, each instance should contain multiple features (genes).
1.42
1.43
1.44 \vspace{0.3cm}**Principle 2: Only look at combinations of small numbers of genes**
1.45 -When the classifier classifies a voxel, it is only allowed to look at the expression of the genes which have been selected as features. The more data that is available to a classifier, the better that it can do. For example, perhaps there are weak correlations over many genes that add up to a strong signal. So, why not include every gene as a feature? The reason is that we wish to employ the classifier in situations in which it is not feasible to gather data about every gene. For example, if we want to use the expression of marker genes as a trigger for some regionally-targeted intervention, then our intervention must contain a molecular mechanism to check the expression level of each marker gene before it triggers. It is currently infeasible to design a molecular trigger that checks the level of more than a handful of genes. Similarly, if the goal is to develop a procedure to do ISH on tissue samples in order to label their anatomy, then it is infeasible to label more than a few genes. Therefore, we must select only a few genes as features.
1.46 +
1.47 +When the classifier classifies a voxel, it is only allowed to look at the expression of the genes which have been selected as features. The more data that are available to a classifier, the better that it can do. For example, perhaps there are weak correlations over many genes that add up to a strong signal. So, why not include every gene as a feature? The reason is that we wish to employ the classifier in situations in which it is not feasible to gather data about every gene. For example, if we want to use the expression of marker genes as a trigger for some regionally-targeted intervention, then our intervention must contain a molecular mechanism to check the expression level of each marker gene before it triggers. It is currently infeasible to design a molecular trigger that checks the level of more than a handful of genes. Similarly, if the goal is to develop a procedure to do ISH on tissue samples in order to label their anatomy, then it is infeasible to label more than a few genes. Therefore, we must select only a few genes as features.
1.48
1.49 The requirement to find combinations of only a small number of genes limits us from straightforwardly applying many of the most simple techniques from the field of supervised machine learning. In the parlance of machine learning, our task combines feature selection with supervised learning.
1.50
1.51 @@ -71,7 +75,7 @@
1.52
1.53
1.54 === Related work ===
1.55 -There is a substantial body of work on the analysis of gene expression data, most of this concerns gene expression data which is not fundamentally spatial\footnote{By "__fundamentally__ spatial" we mean that there is information from a large number of spatial locations indexed by spatial coordinates; not just data which has only a few different locations or which is indexed by anatomical label.}.
1.56 +There is a substantial body of work on the analysis of gene expression data, most of this concerns gene expression data which are not fundamentally spatial\footnote{By "__fundamentally__ spatial" we mean that there is information from a large number of spatial locations indexed by spatial coordinates; not just data which have only a few different locations or which is indexed by anatomical label.}.
1.57
1.58 As noted above, there has been much work on both supervised learning and there are many available algorithms for each. However, the algorithms require the scientist to provide a framework for representing the problem domain, and the way that this framework is set up has a large impact on performance. Creating a good framework can require creatively reconceptualizing the problem domain, and is not merely a mechanical "fine-tuning" of numerical parameters. For example, we believe that domain-specific scoring measures (such as gradient similarity, which is discussed in Preliminary Studies) may be necessary in order to achieve the best results in this application.
1.59
1.60 @@ -92,31 +96,31 @@
1.61 cluster which includes the seed voxel, (2) yields a list of genes
1.62 which are overexpressed in that cluster. (note: the ABA website also contains pre-prepared lists of overexpressed genes for selected structures)
1.63
1.64 -\item Correlation: The user selects a seed voxel and
1.65 -the shows the user how much correlation there is between the gene
1.66 +\item Correlation: The user selects a seed voxel and the system
1.67 +then shows the user how much correlation there is between the gene
1.68 expression profile of the seed voxel and every other voxel.
1.69
1.70 \item Clusters: will be described later
1.71 \end{itemize}
1.72
1.73 -Gene Finder is different from our Aim 1 in at least three ways. First, Gene Finder finds only single genes, whereas we will also look for combinations of genes. Second, gene finder can only use overexpression as a marker, whereas we will also search for underexpression. Third, Gene Finder uses a simple pointwise score\footnote{"Expression energy ratio", which captures overexpression.}, whereas we will also use geometric scores such as gradient similarity. Figures \ref{MOcombo}, \ref{hole}, and \ref{AUDgeometry} in the Preliminary Studies section contains evidence that each of our three choices is the right one.
1.74 +Gene Finder is different from our Aim 1 in at least three ways. First, Gene Finder finds only single genes, whereas we will also look for combinations of genes. Second, gene finder can only use overexpression as a marker, whereas we will also search for underexpression. Third, Gene Finder uses a simple pointwise score\footnote{"Expression energy ratio", which captures overexpression.}, whereas we will also use geometric scores such as gradient similarity (described in Preliminary Studies). Figures \ref{MOcombo}, \ref{hole}, and \ref{AUDgeometry} in the Preliminary Studies section contains evidence that each of our three choices is the right one.
1.75
1.76 \cite{chin_genome-scale_2007} looks at the mean expression level of genes within anatomical regions, and applies a Student's t-test with Bonferroni correction to determine whether the mean expression level of a gene is significantly higher in the target region. Like AGEA, this is a pointwise measure (only the mean expression level per pixel is being analyzed), it is not being used to look for underexpression, and does not look for combinations of genes.
1.77
1.78 \cite{hemert_matching_2008} describes a technique to find combinations of marker genes to pick out an anatomical region. They use an evolutionary algorithm to evolve logical operators which combine boolean (thresholded) images in order to match a target image. Their match score is Jaccard similarity.
1.79
1.80 -In summary, there has been fruitful work on finding marker genes, however, only one of the previous projects explores combinations of marker genes, and none of these publications compare the results obtained by using different algorithms or scoring methods.
1.81 -
1.82 -
1.83 -
1.84 -
1.85 -=== Aim 2 ===
1.86 +In summary, there has been fruitful work on finding marker genes, but only one of the previous projects explores combinations of marker genes, and none of these publications compare the results obtained by using different algorithms or scoring methods.
1.87 +
1.88 +
1.89 +
1.90 +
1.91 +=== Aim 2: From gene expression data, discover a map of regions ===
1.92
1.93 \vspace{0.3cm}**Machine learning terminology: clustering**
1.94
1.95 -If one is given a dataset consisting merely of instances, with no class labels, then analysis of the dataset is referred to as __unsupervised learning__ in the jargon of machine learning. One thing that you can do with such a dataset is to group instances together. A set of similar instances is called a __cluster__, and the activity of finding grouping the data into clusters is called clustering or cluster analysis.
1.96 -
1.97 -The task of deciding how to carve up a structure into anatomical regions can be put into these terms. The instances are once again voxels (or pixels) along with their associated gene expression profiles. We make the assumption that voxels from the same region have similar gene expression profiles, at least compared to the other regions. This means that clustering voxels is the same as finding potential regions; we seek a partitioning of the voxels into regions, that is, into clusters of voxels with similar gene expression.
1.98 +If one is given a dataset consisting merely of instances, with no class labels, then analysis of the dataset is referred to as __unsupervised learning__ in the jargon of machine learning. One thing that you can do with such a dataset is to group instances together. A set of similar instances is called a __cluster__, and the activity of finding grouping the data into clusters is called __clustering__ or __cluster analysis__.
1.99 +
1.100 +The task of deciding how to carve up a structure into anatomical regions can be put into these terms. The instances are once again voxels (or pixels) along with their associated gene expression profiles. We make the assumption that voxels from the same anatomical region have similar gene expression profiles, at least compared to the other regions. This means that clustering voxels is the same as finding potential regions; we seek a partitioning of the voxels into regions, that is, into clusters of voxels with similar gene expression.
1.101
1.102 It is desirable to determine not just one set of regions, but also how these regions relate to each other, if at all; perhaps some of the regions are more similar to each other than to the rest, suggesting that, although at a fine spatial scale they could be considered separate, on a coarser spatial scale they could be grouped together into one large region. This suggests the outcome of clustering may be a hierarchial tree of clusters, rather than a single set of clusters which partition the voxels. This is called hierarchial clustering.
1.103
1.104 @@ -129,9 +133,9 @@
1.105 \vspace{0.3cm}**Spatially contiguous clusters; image segmentation**
1.106
1.107
1.108 -We have shown that aim 2 is a type of clustering task. In fact, it is a special type of clustering task because we have an additional constraint on clusters; voxels grouped together into a cluster must be spatially contiguous. In Preliminary Studies, we show that one can get reasonable results without enforcing this constraint, however, we plan to compare these results against other methods which guarantee contiguous clusters.
1.109 -
1.110 -Perhaps the biggest source of continguous clustering algorithms is the field of computer vision, which has produced a variety of image segmentation algorithms. Image segmentation is the task of partitioning the pixels in a digital image into clusters, usually contiguous clusters. Aim 2 is similar to an image segmentation task. There are two main differences; in our task, there are thousands of color channels (one for each gene), rather than just three. There are imaging tasks which use more than three colors, however, for example multispectral imaging and hyperspectral imaging, which are often used to process satellite imagery. A more crucial difference is that there are various cues which are appropriate for detecting sharp object boundaries in a visual scene but which are not appropriate for segmenting abstract spatial data such as gene expression. Although many image segmentation algorithms can be expected to work well for segmenting other sorts of spatially arranged data, some of these algorithms are specialized for visual images.
1.111 +We have shown that aim 2 is a type of clustering task. In fact, it is a special type of clustering task because we have an additional constraint on clusters; voxels grouped together into a cluster must be spatially contiguous. In Preliminary Studies, we show that one can get reasonable results without enforcing this constraint; however, we plan to compare these results against other methods which guarantee contiguous clusters.
1.112 +
1.113 +Perhaps the biggest source of continguous clustering algorithms is the field of computer vision, which has produced a variety of image segmentation algorithms. Image segmentation is the task of partitioning the pixels in a digital image into clusters, usually contiguous clusters. Aim 2 is similar to an image segmentation task. There are two main differences; in our task, there are thousands of color channels (one for each gene), rather than just three. However, there are imaging tasks which use more than three colors, for example multispectral imaging and hyperspectral imaging, which are often used to process satellite imagery. A more crucial difference is that there are various cues which are appropriate for detecting sharp object boundaries in a visual scene but which are not appropriate for segmenting abstract spatial data such as gene expression. Although many image segmentation algorithms can be expected to work well for segmenting other sorts of spatially arranged data, some of these algorithms are specialized for visual images.
1.114
1.115
1.116 \vspace{0.3cm}**Dimensionality reduction**
1.117 @@ -139,9 +143,9 @@
1.118
1.119 Unlike aim 1, there is no externally-imposed need to select only a handful of informative genes for inclusion in the instances. However, some clustering algorithms perform better on small numbers of features. There are techniques which "summarize" a larger number of features using a smaller number of features; these techniques go by the name of feature extraction or dimensionality reduction. The small set of features that such a technique yields is called the __reduced feature set__. After the reduced feature set is created, the instances may be replaced by __reduced instances__, which have as their features the reduced feature set rather than the original feature set of all gene expression levels. Note that the features in the reduced feature set do not necessarily correspond to genes; each feature in the reduced set may be any function of the set of gene expression levels.
1.120
1.121 -Dimensionality reduction before clustering is useful on large datasets. First, because the number of features in the reduced data set is less than in the original data set, the running time of clustering algorithms may be much less. Second, it is thought that some clustering algorithms may give better results on reduced data.
1.122 -
1.123 -Another use for dimensionality reduction is to visualize the relationships between regions after clustering. For example, one might want to make a 2-D plot upon which each region is represented by a single point, and with the property that regions with similar gene expression profiles should be nearby on the plot (that is, the property that distance between pairs of points in the plot should be proportional to some measure of dissimilarity in gene expression). It is likely that no arrangement of the points on a 2-D plan will exactly satisfy this property -- however, dimensionality reduction techniques allow one to find arrangements of points that approximately satisfy that property. Note that in this application, dimensionality reduction is being applied after clustering; whereas in the previous paragraph, we were talking about using dimensionality reduction before clustering.
1.124 +Dimensionality reduction before clustering is useful on large datasets. First, because the number of features in the reduced dataset is less than in the original dataset, the running time of clustering algorithms may be much less. Second, it is thought that some clustering algorithms may give better results on reduced data.
1.125 +
1.126 +Another use for dimensionality reduction is to visualize the relationships between regions after clustering. For example, one might want to make a 2-D plot upon which each region is represented by a single point, and with the property that regions with similar gene expression profiles should be nearby on the plot (that is, the property that distance between pairs of points in the plot should be proportional to some measure of dissimilarity in gene expression). It is likely that no arrangement of the points on a 2-D plan will exactly satisfy this property; however, dimensionality reduction techniques allow one to find arrangements of points that approximately satisfy that property. Note that in this application, dimensionality reduction is being applied after clustering; whereas in the previous paragraph, we were talking about using dimensionality reduction before clustering.
1.127
1.128
1.129 \vspace{0.3cm}**Clustering genes rather than voxels**
1.130 @@ -182,21 +186,21 @@
1.131
1.132 \vspace{0.3cm}**Background**
1.133
1.134 -The cortex is divided into areas and layers. To a first approximation, the parcellation of the cortex into areas can be drawn as a 2-D map on the surface of the cortex. In the third dimension, the boundaries between the areas continue downwards into the cortical depth, perpendicular to the surface. The layer boundaries run parallel to the surface. One can picture an area of the cortex as a slice of many-layered cake.
1.135 -
1.136 -Although it is known that different cortical areas have distinct roles in both normal functioning and in disease processes, there are no known marker genes for many cortical areas. When it is necessary to divide a tissue sample into cortical areas, this is a manual process that requires a skilled human to combine multiple visual cues and interpret them in the context of their approximate location upon the cortical surface.
1.137 +The cortex is divided into areas and layers. Because of the cortical columnar organization, the parcellation of the cortex into areas can be drawn as a 2-D map on the surface of the cortex. In the third dimension, the boundaries between the areas continue downwards into the cortical depth, perpendicular to the surface. The layer boundaries run parallel to the surface. One can picture an area of the cortex as a slice of a six-layered cake\footnote{Outside of isocortex, the number of layers varies.}.
1.138 +
1.139 +Although it is known that different cortical areas have distinct roles in both normal functioning and in disease processes, there are no known marker genes for most cortical areas. When it is necessary to divide a tissue sample into cortical areas, this is a manual process that requires a skilled human to combine multiple visual cues and interpret them in the context of their approximate location upon the cortical surface.
1.140
1.141 Even the questions of how many areas should be recognized in cortex, and what their arrangement is, are still not completely settled. A proposed division of the cortex into areas is called a cortical map. In the rodent, the lack of a single agreed-upon map can be seen by contrasting the recent maps given by Swanson\cite{swanson_brain_2003} on the one hand, and Paxinos and Franklin\cite{paxinos_mouse_2001} on the other. While the maps are certainly very similar in their general arrangement, significant differences remain in the details.
1.142
1.143 \vspace{0.3cm}**The Allen Mouse Brain Atlas dataset**
1.144
1.145 -The Allen Mouse Brain Atlas (ABA) data was produced by doing in-situ hybridization on slices of male, 56-day-old C57BL/6J mouse brains. Pictures were taken of the processed slice, and these pictures were semi-automatically analyzed in order to create a digital measurement of gene expression levels at each location in each slice. Per slice, cellular spatial resolution is achieved. Using this method, a single physical slice can only be used to measure one single gene; many different mouse brains were needed in order to measure the expression of many genes.
1.146 +The Allen Mouse Brain Atlas (ABA) data were produced by doing in-situ hybridization on slices of male, 56-day-old C57BL/6J mouse brains. Pictures were taken of the processed slice, and these pictures were semi-automatically analyzed in order to create a digital measurement of gene expression levels at each location in each slice. Per slice, cellular spatial resolution is achieved. Using this method, a single physical slice can only be used to measure one single gene; many different mouse brains were needed in order to measure the expression of many genes.
1.147
1.148 Next, an automated nonlinear alignment procedure located the 2D data from the various slices in a single 3D coordinate system. In the final 3D coordinate system, voxels are cubes with 200 microns on a side. There are 67x41x58 \= 159,326 voxels in the 3D coordinate system, of which 51,533 are in the brain\cite{ng_anatomic_2009}.
1.149
1.150 -Mus musculus, the common house mouse, is thought to contain about 22,000 protein-coding genes\cite{waterston_initial_2002}. The ABA contains data on about 20,000 genes in sagittal sections, out of which over 4,000 genes are also measured in coronal sections. Our dataset is derived from only the coronal subset of the ABA, because the sagittal data does not cover the entire cortex, and also has greater registration error\cite{ng_anatomic_2009}. Genes were selected by the Allen Institute for coronal sectioning based on, "classes of known neuroscientific interest... or through post hoc identification of a marked non-ubiquitous expression pattern"\cite{ng_anatomic_2009}.
1.151 -
1.152 -The ABA is not the only large public spatial gene expression dataset. Other such resources include GENSAT\cite{gong_gene_2003}, GenePaint\cite{visel_genepaint.org:atlas_2004}, its sister project GeneAtlas\cite{carson_digital_2005}, BGEM\cite{magdaleno_bgem:in_2006}, EMAGE\cite{venkataraman_emage_2008}, EurExpress\footnote{http://www.eurexpress.org/ee/; EurExpress data is also entered into EMAGE}, EADHB\footnote{http://www.ncl.ac.uk/ihg/EADHB/database/EADHB_database.html}, MAMEP\footnote{http://mamep.molgen.mpg.de/index.php}, Xenbase\footnote{http://xenbase.org/}, ZFIN\cite{sprague_zebrafish_2006}, Aniseed\footnote{http://aniseed-ibdm.univ-mrs.fr/}, VisiGene\footnote{http://genome.ucsc.edu/cgi-bin/hgVisiGene ; includes data from some the other listed data sources}, GEISHA\cite{bell_geishawhole-mount_2004}, Fruitfly.org\cite{tomancak_systematic_2002}, COMPARE\footnote{http://compare.ibdml.univ-mrs.fr/} GXD\cite{smith_mouse_2007}, GEO\cite{barrett_ncbi_2007}\footnote{GXD and GEO contain spatial data but also non-spatial data. All GXD spatial data are also in EMAGE.}. With the exception of the ABA, GenePaint, and EMAGE, most of these resources have not (yet) extracted the expression intensity from the ISH images and registered the results into a single 3-D space, and to our knowledge only ABA and EMAGE make this form of data available for public download from the website\footnote{without prior offline registration}. Many of these resources focus on developmental gene expression.
1.153 +Mus musculus, the common house mouse, is thought to contain about 22,000 protein-coding genes\cite{waterston_initial_2002}. The ABA contains data on about 20,000 genes in sagittal sections, out of which over 4,000 genes are also measured in coronal sections. Our dataset is derived from only the coronal subset of the ABA, because the sagittal data do not cover the entire cortex, and also has greater registration error\cite{ng_anatomic_2009}. Genes were selected by the Allen Institute for coronal sectioning based on, "classes of known neuroscientific interest... or through post hoc identification of a marked non-ubiquitous expression pattern"\cite{ng_anatomic_2009}.
1.154 +
1.155 +The ABA is not the only large public spatial gene expression dataset. Other such resources include GENSAT\cite{gong_gene_2003}, GenePaint\cite{visel_genepaint.org:atlas_2004}, its sister project GeneAtlas\cite{carson_digital_2005}, BGEM\cite{magdaleno_bgem:in_2006}, EMAGE\cite{venkataraman_emage_2008}, EurExpress\footnote{http://www.eurexpress.org/ee/; EurExpress data are also entered into EMAGE}, EADHB\footnote{http://www.ncl.ac.uk/ihg/EADHB/database/EADHB_database.html}, MAMEP\footnote{http://mamep.molgen.mpg.de/index.php}, Xenbase\footnote{http://xenbase.org/}, ZFIN\cite{sprague_zebrafish_2006}, Aniseed\footnote{http://aniseed-ibdm.univ-mrs.fr/}, VisiGene\footnote{http://genome.ucsc.edu/cgi-bin/hgVisiGene ; includes data from some the other listed data sources}, GEISHA\cite{bell_geishawhole-mount_2004}, Fruitfly.org\cite{tomancak_systematic_2002}, COMPARE\footnote{http://compare.ibdml.univ-mrs.fr/} GXD\cite{smith_mouse_2007}, GEO\cite{barrett_ncbi_2007}\footnote{GXD and GEO contain spatial data but also non-spatial data. All GXD spatial data are also in EMAGE.}. With the exception of the ABA, GenePaint, and EMAGE, most of these resources have not (yet) extracted the expression intensity from the ISH images and registered the results into a single 3-D space, and to our knowledge only ABA and EMAGE make this form of data available for public download from the website\footnote{without prior offline registration}. Many of these resources focus on developmental gene expression.
1.156
1.157
1.158
1.159 @@ -214,7 +218,7 @@
1.160
1.161 === Related work ===
1.162
1.163 -\cite{ng_anatomic_2009} describes the application of AGEA to the cortex. The paper describes interesting results on the structure of correlations between voxel gene expression profiles within a handful of cortical areas. However, this sort of analysis is not related to either of our aims, as it neither finds marker genes, nor does it suggest a cortical map based on gene expression data. Neither of the other components of AGEA can be applied to cortical areas; AGEA's Gene Finder cannot be used to find marker genes for the cortical areas; and AGEA's hierarchial clustering does not produce clusters corresponding to the cortical areas\footnote{In both cases, the root cause is that pairwise correlations between the gene expression of voxels in different areas but the same layer are often stronger than pairwise correlations between the gene expression of voxels in different layers but the same area. Therefore, a pairwise voxel correlation clustering algorithm will tend to create clusters representing cortical layers, not areas. This is why the hierarchial clustering does not find most cortical areas (there are clusters which presumably correspond to the intersection of a layer and an area, but since one area will have many layer-area intersection clusters, further work is needed to make sense of these). The reason that Gene Finder cannot find marker genes for most cortical areas is that in Gene Finder, although the user chooses a seed voxel, Gene Finder chooses the ROI for which genes will be found, and it creates that ROI by (pairwise voxel correlation) clustering around the seed.}.
1.164 +\cite{ng_anatomic_2009} describes the application of AGEA to the cortex. The paper describes interesting results on the structure of correlations between voxel gene expression profiles within a handful of cortical areas. However, this sort of analysis is not related to either of our aims, as it neither finds marker genes, nor does it suggest a cortical map based on gene expression data. Neither of the other components of AGEA can be applied to cortical areas; AGEA's Gene Finder cannot be used to find marker genes for the cortical areas; and AGEA's hierarchial clustering does not produce clusters corresponding to the cortical areas\footnote{In both cases, the root cause is that pairwise correlations between the gene expression of voxels in different areas but the same layer are often stronger than pairwise correlations between the gene expression of voxels in different layers but the same area. Therefore, a pairwise voxel correlation clustering algorithm will tend to create clusters representing cortical layers, not areas. This is why the hierarchial clustering does not find cortical areas (there are clusters which presumably correspond to the intersection of a layer and an area, but since one area will have many layer-area intersection clusters, further work is needed to make sense of these). The reason that Gene Finder cannot the find marker genes for cortical areas is that in Gene Finder, although the user chooses a seed voxel, Gene Finder chooses the ROI for which genes will be found, and it creates that ROI by (pairwise voxel correlation) clustering around the seed.}.
1.165
1.166
1.167 %% Most of the projects which have been discussed have been done by the same groups that develop the public datasets. Although these projects make their algorithms available for use on their own website, none of them have released an open-source software toolkit; instead, users are restricted to using the provided algorithms only on their own dataset.
1.168 @@ -245,7 +249,7 @@
1.169
1.170
1.171 === Format conversion between SEV, MATLAB, NIFTI ===
1.172 -We have created software to (politely) download all of the SEV files from the Allen Institute website. We have also created software to convert between the SEV, MATLAB, and NIFTI file formats, as well as some of Caret's file formats.
1.173 +We have created software to (politely) download all of the SEV files\footnote{SEV is a sparse format for spatial data. It is the format in which the ABA data is made available.} from the Allen Institute website. We have also created software to convert between the SEV, MATLAB, and NIFTI file formats, as well as some of Caret's file formats.
1.174
1.175
1.176 === Flatmap of cortex ===
1.177 @@ -259,7 +263,7 @@
1.178
1.179 We manually traced the boundaries of each of 49 cortical areas from the ABA coronal reference atlas slides. We then converted these manual traces into Caret-format regional boundary data on the mesh surface. We projected the regions onto the 2-d mesh, and then onto the grid, and then we converted the region data into MATLAB format.
1.180
1.181 -At this point, the data is in the form of a number of 2-D matrices, all in registration, with the matrix entries representing a grid of points (pixels) over the cortical surface:
1.182 +At this point, the data are in the form of a number of 2-D matrices, all in registration, with the matrix entries representing a grid of points (pixels) over the cortical surface:
1.183
1.184 * A 2-D matrix whose entries represent the regional label associated with each surface pixel
1.185 * For each gene, a 2-D matrix whose entries represent the average expression level underneath each surface pixel
1.186 @@ -271,7 +275,7 @@
1.187
1.188
1.189
1.190 -We created a normalized version of the gene expression data by subtracting each gene's mean expression level (over all surface pixels) and dividing each gene by its standard deviation.
1.191 +We created a normalized version of the gene expression data by subtracting each gene's mean expression level (over all surface pixels) and dividing the expression level of each gene by its standard deviation.
1.192
1.193 The features and the target area are both functions on the surface pixels. They can be referred to as scalar fields over the space of surface pixels; alternately, they can be thought of as images which can be displayed on the flatmapped surface.
1.194
1.195 @@ -297,7 +301,7 @@
1.196 \vspace{0.3cm}**Correlation**
1.197 Recall that the instances are surface pixels, and consider the problem of attempting to classify each instance as either a member of a particular anatomical area, or not. The target area can be represented as a boolean mask over the surface pixels.
1.198
1.199 -One class of feature selection scoring method are those which calculate some sort of "match" between each gene image and the target image. Those genes which match the best are good candidates for features.
1.200 +One class of feature selection scoring methods contains methods which calculate some sort of "match" between each gene image and the target image. Those genes which match the best are good candidates for features.
1.201
1.202 One of the simplest methods in this class is to use correlation as the match score. We calculated the correlation between each gene and each cortical area. The top row of Figure \ref{SScorrLr} shows the three genes most correlated with area SS.
1.203
1.204 @@ -372,14 +376,14 @@
1.205
1.206 \vspace{0.3cm}**Combinations of multiple genes are useful and necessary for some areas**
1.207
1.208 -In Figure \ref{MOcombo}, we give an example of a cortical area which is not marked by any single gene, but which can be identified combinatorially. Acccording to logistic regression, gene wwc1 is the best fit single gene for predicting whether or not a pixel on the cortical surface belongs to the motor area (area MO). The upper-left picture in Figure \ref{MOcombo} shows wwc1's spatial expression pattern over the cortex. The lower-right boundary of MO is represented reasonably well by this gene, however the gene overshoots the upper-left boundary. This flattened 2-D representation does not show it, but the area corresponding to the overshoot is the medial surface of the cortex. MO is only found on the lateral surface. Gene mtif2 is shown in the upper-right. Mtif2 captures MO's upper-left boundary, but not its lower-right boundary. Mtif2 does not express very much on the medial surface. By adding together the values at each pixel in these two figures, we get the lower-left image. This combination captures area MO much better than any single gene.
1.209 +In Figure \ref{MOcombo}, we give an example of a cortical area which is not marked by any single gene, but which can be identified combinatorially. Acccording to logistic regression, gene wwc1 is the best fit single gene for predicting whether or not a pixel on the cortical surface belongs to the motor area (area MO). The upper-left picture in Figure \ref{MOcombo} shows wwc1's spatial expression pattern over the cortex. The lower-right boundary of MO is represented reasonably well by this gene, but the gene overshoots the upper-left boundary. This flattened 2-D representation does not show it, but the area corresponding to the overshoot is the medial surface of the cortex. MO is only found on the dorsal surface. Gene mtif2 is shown in the upper-right. Mtif2 captures MO's upper-left boundary, but not its lower-right boundary. Mtif2 does not express very much on the medial surface. By adding together the values at each pixel in these two figures, we get the lower-left image. This combination captures area MO much better than any single gene.
1.210
1.211 This shows that our proposal to develop a method to find combinations of marker genes is both possible and necessary.
1.212
1.213 %% wwc1\footnote{"WW, C2 and coiled-coil domain containing 1"; EntrezGene ID 211652}
1.214 %% mtif2\footnote{"mitochondrial translational initiation factor 2"; EntrezGene ID 76784}
1.215
1.216 -%%Acccording to logistic regression, gene wwc1\footnote{"WW, C2 and coiled-coil domain containing 1"; EntrezGene ID 211652} is the best fit single gene for predicting whether or not a pixel on the cortical surface belongs to the motor area (area MO). The upper-left picture in Figure \ref{MOcombo} shows wwc1's spatial expression pattern over the cortex. The lower-right boundary of MO is represented reasonably well by this gene, however the gene overshoots the upper-left boundary. This flattened 2-D representation does not show it, but the area corresponding to the overshoot is the medial surface of the cortex. MO is only found on the lateral surface.
1.217 +%%Acccording to logistic regression, gene wwc1\footnote{"WW, C2 and coiled-coil domain containing 1"; EntrezGene ID 211652} is the best fit single gene for predicting whether or not a pixel on the cortical surface belongs to the motor area (area MO). The upper-left picture in Figure \ref{MOcombo} shows wwc1's spatial expression pattern over the cortex. The lower-right boundary of MO is represented reasonably well by this gene, but the gene overshoots the upper-left boundary. This flattened 2-D representation does not show it, but the area corresponding to the overshoot is the medial surface of the cortex. MO is only found on the lateral surface.
1.218
1.219 %%Gene mtif2\footnote{"mitochondrial translational initiation factor 2"; EntrezGene ID 76784} is shown in figure the upper-right of Fig. \ref{MOcombo}. Mtif2 captures MO's upper-left boundary, but not its lower-right boundary. Mtif2 does not express very much on the medial surface. By adding together the values at each pixel in these two figures, we get the lower-left of Figure \ref{MOcombo}. This combination captures area MO much better than any single gene.
1.220
1.221 @@ -409,7 +413,7 @@
1.222
1.223 \vspace{0.3cm}**SVM on all genes at once**
1.224
1.225 -In order to see how well one can do when looking at all genes at once, we ran a support vector machine to classify cortical surface pixels based on their gene expression profiles. We achieved classification accuracy of about 81%\footnote{5-fold cross-validation.}. This shows that the genes included in the ABA dataset are sufficient to define much of cortical anatomy. As noted above, however, a classifier that looks at all the genes at once isn't as practically useful as a classifier that uses only a few genes.
1.226 +In order to see how well one can do when looking at all genes at once, we ran a support vector machine to classify cortical surface pixels based on their gene expression profiles. We achieved classification accuracy of about 81%\footnote{5-fold cross-validation.}. This shows that the genes included in the ABA dataset are sufficient to define much of cortical anatomy. However, as noted above, a classifier that looks at all the genes at once isn't as practically useful as a classifier that uses only a few genes.
1.227
1.228
1.229
1.230 @@ -537,11 +541,3 @@
1.231 two hemis
1.232
1.233
1.234 -%%"genomic anatomy" is a name found in the titles of one of the cited papers which seems good; maybe "computational genomic anatomy"
1.235 -
1.236 -%% todo: actually i'm pretty sure AGEA doesn't find ANY areas, but i said "most" and "often" to be cautious.
1.237 -
1.238 -%% todo: MO is only found on the lateral surface (todo).
1.239 -%% todo: predicted genes like Riken
1.240 -
1.241 -%% todo: should we disclose genes?