cg

changeset 108:a38cc9a46200

.
author bshanks@bshanks-salk.dyndns.org
date Wed Apr 22 22:24:24 2009 -0700 (16 years ago)
parents f26370dc719b
children a6b99bc50476
files abstract.pdf abstract.txt dataSharing.pdf dataSharing.txt equipment.pdf equipment.txt facilities.pdf facilities.txt grant-oldtext.txt grant.pdf grant.txt narrative.pdf narrative.txt nih-blank.cls
line diff
1.1 Binary file abstract.pdf has changed
2.1 --- /dev/null Thu Jan 01 00:00:00 1970 +0000 2.2 +++ b/abstract.txt Wed Apr 22 22:24:24 2009 -0700 2.3 @@ -0,0 +1,23 @@ 2.4 +\documentclass[11pt]{nih-blank} 2.5 + 2.6 +\usepackage[small,compact]{titlesec} 2.7 + 2.8 +\begin{document} 2.9 + 2.10 +== Abstract == 2.11 + 2.12 +Massive new datasets obtained with techniques such as in situ hybridization (ISH), immunohistochemistry, in situ transgenic reporter, allow the expression levels of many genes at many locations to be compared. Our goal is to develop automated methods to relate spatial variation in gene expression to anatomy. We want to find marker genes for specific anatomical regions, and also to draw new anatomical maps based on gene expression patterns. We will validate these methods by applying them to 46 anatomical areas within the cerebral cortex, by using the Allen Mouse Brain Atlas coronal dataset (ABA). This gene expression dataset was generated using ISH, and contains over 4,000 genes. For each gene, a digitized 3-D raster of the expression pattern is available: for each gene, the level of expression at each of 51,533 voxels is recorded. Specifically, we will:\\ 2.13 + 2.14 +(1) develop an algorithm to screen spatial gene expression data for combinations of marker genes which selectively target anatomical regions\\ 2.15 + 2.16 +(2) develop an algorithm to suggest new ways of carving up a structure into anatomically distinct regions, based on spatial patterns in gene expression\\ 2.17 + 2.18 +(3) create a 2-D "flat map" dataset of the mouse cerebral cortex that contains a flattened version of the Allen Mouse Brain Atlas ISH data, as well as the boundaries of cortical anatomical areas.\\ 2.19 + 2.20 +In addition to validating the usefulness of the algorithms, the application of these methods to cortex will produce immediate benefits, because there are currently no known genetic markers for most cortical areas. The results of the project will support the development of new ways to selectively target cortical areas, and it will support the development of a method for identifying the cortical areal boundaries present in small tissue samples. 2.21 + 2.22 +All algorithms that we develop will be implemented in a GPL open-source software toolkit. The toolkit, as well as the machine-readable datasets developed in aim (3), will be published and freely available for others to use. 2.23 + 2.24 + 2.25 + 2.26 +\end{document}
3.1 Binary file dataSharing.pdf has changed
4.1 --- /dev/null Thu Jan 01 00:00:00 1970 +0000 4.2 +++ b/dataSharing.txt Wed Apr 22 22:24:24 2009 -0700 4.3 @@ -0,0 +1,14 @@ 4.4 +\documentclass[11pt]{nih-blank} 4.5 + 4.6 +\usepackage[small,compact]{titlesec} 4.7 + 4.8 +\begin{document} 4.9 + 4.10 +== Resource sharing plan == 4.11 + 4.12 +We are enthusiastic about the sharing of methods, data, and results, and at the conclusion of the project, we will make all of our data and computer source code publically available, either in supplemental attachments to publications, or on a website. The source code will be released under the GNU Public License. Our goal is that replicating our results, or applying the methods we develop to other targets, will be quick and easy for other investigators. In order to aid in understanding and replicating our results, we intend to include a software program which, when run, will take as input the Allen Brain Atlas raw data, and produce as output all numbers and charts found in publications resulting from the project. 4.13 + 4.14 + 4.15 + 4.16 + 4.17 +\end{document}
5.1 Binary file equipment.pdf has changed
6.1 --- /dev/null Thu Jan 01 00:00:00 1970 +0000 6.2 +++ b/equipment.txt Wed Apr 22 22:24:24 2009 -0700 6.3 @@ -0,0 +1,12 @@ 6.4 +\documentclass[11pt]{nih-blank} 6.5 + 6.6 +\usepackage[small,compact]{titlesec} 6.7 + 6.8 +\begin{document} 6.9 + 6.10 +== Equipment == 6.11 + 6.12 +This project concerns the development and application of methods for analyzing gene expression data. As such, the facilities needed are principally computers. We have a Dell Precision 4700 computer with two 3 GHz Intel Xeon processors and 8 gigabytes of RAM which will be fully dedicated to the project, and we will acquire more computers upon the commencement of the project. We can also access a shared 32-node cluster of dual Athlon computers with larger amounts of memory, running the Sun Grid Engine. 6.13 + 6.14 + 6.15 +\end{document}
7.1 Binary file facilities.pdf has changed
8.1 --- /dev/null Thu Jan 01 00:00:00 1970 +0000 8.2 +++ b/facilities.txt Wed Apr 22 22:24:24 2009 -0700 8.3 @@ -0,0 +1,14 @@ 8.4 +\documentclass[11pt]{nih-blank} 8.5 + 8.6 +\usepackage[small,compact]{titlesec} 8.7 + 8.8 +\begin{document} 8.9 + 8.10 +== Facilities == 8.11 + 8.12 +The Salk Institute is a world-class biological research institute with excellent facilities available for molecular genetics. The laboratory space available for this project is approximately 1000 square feet that includes computer facilities, equipment necessary for histology, and computer-driven microscopes. 8.13 + 8.14 +Across the street, the University of California at San Diego contains state-of-the-art computing facilities, and an excellent computer science department with some of the world's experts on machine learning and machine vision. 8.15 + 8.16 + 8.17 +\end{document}
9.1 --- a/grant-oldtext.txt Wed Apr 22 14:53:19 2009 -0700 9.2 +++ b/grant-oldtext.txt Wed Apr 22 22:24:24 2009 -0700 9.3 @@ -73,10 +73,7 @@ 9.4 9.5 9.6 9.7 -We are enthusiastic about the sharing of methods, data, and results, and at the conclusion of the project, we will make all of our data and computer source code publically available. Our goal is that replicating our results, or applying the methods we develop to other targets, will be quick and easy for other investigators. In order to aid in understanding and replicating our results, we intend to include a software program which, when run, will take as input the Allen Brain Atlas raw data, and produce as output all numbers and charts found in publications resulting from the project. 9.8 - 9.9 - 9.10 -To aid in the replication of our results, we will include a script which takes as input the dataset in aim (3) and provides as output all of the tables in figures in our publications . 9.11 +We are enthusiastic about the sharing of methods, data, and results, and at the conclusion of the project, we will make all of our data and computer source code publically available, either in supplemental attachments to publications, or on a website. Our goal is that replicating our results, or applying the methods we develop to other targets, will be quick and easy for other investigators. In order to aid in understanding and replicating our results, we intend to include a software program which, when run, will take as input the Allen Brain Atlas raw data, and produce as output all numbers and charts found in publications resulting from the project. 9.12 9.13 9.14
10.1 Binary file grant.pdf has changed
11.1 --- a/grant.txt Wed Apr 22 14:53:19 2009 -0700 11.2 +++ b/grant.txt Wed Apr 22 22:24:24 2009 -0700 11.3 @@ -47,7 +47,7 @@ 11.4 11.5 \newpage 11.6 11.7 -== The challenge topic == 11.8 +== Analysis of high dimensional data for genomic anatomy in the brain == 11.9 11.10 This proposal addresses challenge topic 06-HG-101. Massive new datasets obtained with techniques such as in situ hybridization (ISH), immunohistochemistry, in situ transgenic reporter, microarray voxelation, and others, allow the expression levels of many genes at many locations to be compared. Our goal is to develop automated methods to relate spatial variation in gene expression to anatomy. We want to find marker genes for specific anatomical regions, and also to draw new anatomical maps based on gene expression patterns. 11.11 11.12 @@ -252,9 +252,15 @@ 11.13 11.14 \vspace{0.3cm}**The Allen Mouse Brain Atlas dataset** 11.15 11.16 -The Allen Mouse Brain Atlas (ABA) data\cite{lein_genome-wide_2007} were produced by doing in-situ hybridization on slices of male, 56-day-old C57BL/6J mouse brains. Pictures were taken of the processed slice, and these pictures were semi-automatically analyzed to create a digital measurement of gene expression levels at each location in each slice. Per slice, cellular spatial resolution is achieved. Using this method, a single physical slice can only be used to measure one single gene; many different mouse brains were needed in order to measure the expression of many genes. 11.17 - 11.18 -Mus musculus is thought to contain about 22,000 protein-coding genes\cite{waterston_initial_2002}. The ABA contains data on about 20,000 genes in sagittal sections, out of which over 4,000 genes are also measured in coronal sections. Our dataset is derived from only the coronal subset of the ABA\footnote{The sagittal data do not cover the entire cortex, and also have greater registration error\cite{ng_anatomic_2009}. Genes were selected by the Allen Institute for coronal sectioning based on, "classes of known neuroscientific interest... or through post hoc identification of a marked non-ubiquitous expression pattern"\cite{ng_anatomic_2009}.}. An automated nonlinear alignment procedure located the 2D data from the various slices in a single 3D coordinate system. In the final 3D coordinate system, voxels are cubes with 200 microns on a side. There are 67x41x58 \= 159,326 voxels, of which 51,533 are in the brain\cite{ng_anatomic_2009}. For each voxel and each gene, the expression energy\cite{lein_genome-wide_2007} within that voxel is made available. 11.19 +%%The Allen Mouse Brain Atlas (ABA) data\cite{lein_genome-wide_2007} 11.20 + 11.21 +The Allen Mouse Brain Atlas (ABA) data were produced by doing in-situ hybridization on slices of male, 56-day-old C57BL/6J mouse brains. Pictures were taken of the processed slice, and these pictures were semi-automatically analyzed to create a digital measurement of gene expression levels at each location in each slice. Per slice, cellular spatial resolution is achieved. Using this method, a single physical slice can only be used to measure one single gene; many different mouse brains were needed in order to measure the expression of many genes. 11.22 + 11.23 +%%Mus musculus is thought to contain about 22,000 protein-coding genes\cite{waterston_initial_2002}. 11.24 + 11.25 +Mus musculus is thought to contain about 22,000 protein-coding genes. The ABA contains data on about 20,000 genes in sagittal sections, out of which over 4,000 genes are also measured in coronal sections. Our dataset is derived from only the coronal subset of the ABA\footnote{The sagittal data do not cover the entire cortex, and also have greater registration error\cite{ng_anatomic_2009}. Genes were selected by the Allen Institute for coronal sectioning based on, "classes of known neuroscientific interest... or through post hoc identification of a marked non-ubiquitous expression pattern"\cite{ng_anatomic_2009}.}. An automated nonlinear alignment procedure located the 2D data from the various slices in a single 3D coordinate system. In the final 3D coordinate system, voxels are cubes with 200 microns on a side. There are 67x41x58 \= 159,326 voxels, of which 51,533 are in the brain\cite{ng_anatomic_2009}. For each voxel and each gene, the expression energy within that voxel is made available. 11.26 + 11.27 +%% For each voxel and each gene, the expression energy\cite{lein_genome-wide_2007} within that voxel is made available. 11.28 11.29 11.30 11.31 @@ -537,8 +543,10 @@ 11.32 11.33 A future publication on the method that we develop in Aim 1 will review the scoring measures and quantitatively compare their performance in order to provide a foundation for future research of methods of marker gene finding. We will measure the robustness of the scoring measures as well as their absolute performance on our dataset. 11.34 11.35 +%% (including spatial models\cite{paciorek_computational_2007}) 11.36 + 11.37 \vspace{0.3cm}**Classifiers** 11.38 -We will explore and compare different classifiers. As noted above, this activity is not separate from the previous one, because some supervised learning algorithms include feature selection, and any classifier can be combined with a stepwise wrapper for use as a feature selection method. We will explore logistic regression (including spatial models\cite{paciorek_computational_2007}), decision trees\footnote{Actually, we have already begun to explore decision trees. For each cortical area, we have used the C4.5 algorithm to find a decision tree for that area. We achieved good classification accuracy on our training set, but the number of genes that appeared in each tree was too large. We plan to implement a pruning procedure to generate trees that use fewer genes.}, sparse SVMs, generative mixture models (including naive bayes), kernel density estimation, instance-based learning methods (such as k-nearest neighbor), genetic algorithms, and artificial neural networks. 11.39 +We will explore and compare different classifiers. As noted above, this activity is not separate from the previous one, because some supervised learning algorithms include feature selection, and any classifier can be combined with a stepwise wrapper for use as a feature selection method. We will explore logistic regression (including spatial models), decision trees\footnote{Actually, we have already begun to explore decision trees. For each cortical area, we have used the C4.5 algorithm to find a decision tree for that area. We achieved good classification accuracy on our training set, but the number of genes that appeared in each tree was too large. We plan to implement a pruning procedure to generate trees that use fewer genes.}, sparse SVMs, generative mixture models (including naive bayes), kernel density estimation, instance-based learning methods (such as k-nearest neighbor), genetic algorithms, and artificial neural networks. 11.40 11.41 11.42 11.43 @@ -567,7 +575,9 @@ 11.44 In addition to using the cluster expression prototypes directly to identify spatial regions, this might be useful as a component of dimensionality reduction. For example, one could imagine clustering similar genes and then replacing their expression levels with a single average expression level, thereby removing some redundancy from the gene expression profiles. One could then perform clustering on pixels (possibly after a second dimensionality reduction step) in order to identify spatial regions. It remains to be seen whether removal of redundancy would help or hurt the ultimate goal of identifying interesting spatial regions. 11.45 11.46 \vspace{0.3cm}**Co-clustering** 11.47 -There are some algorithms which simultaneously incorporate clustering on instances and on features (in our case, genes and pixels), for example, IRM\cite{kemp_learning_2006}. These are called co-clustering or biclustering algorithms. 11.48 +There are some algorithms which simultaneously incorporate clustering on instances and on features (in our case, genes and pixels), for example, IRM. These are called co-clustering or biclustering algorithms. 11.49 + 11.50 +%%IRM\cite{kemp_learning_2006}. 11.51 11.52 \vspace{0.3cm}**Radial profiles** 11.53 We wil explore the use of the radial profile of gene expression under each pixel. 11.54 @@ -583,7 +593,9 @@ 11.55 === Apply the new methods to the cortex === 11.56 Using the methods developed in Aim 1, we will present, for each cortical area, a short list of markers to identify that area; and we will also present lists of "panels" of genes that can be used to delineate many areas at once. 11.57 11.58 -Because in most cases the ABA coronal dataset only contains one ISH per gene, it is possible for an unrelated combination of genes to seem to identify an area when in fact it is only coincidence. There are two ways we will validate our marker genes to guard against this. First, we will confirm that putative combinations of marker genes express the same pattern in both hemispheres. Second, we will manually validate our final results on other gene expression datasets such as EMAGE, GeneAtlas, and GENSAT\cite{gong_gene_2003}. 11.59 +%% GENSAT\cite{gong_gene_2003} 11.60 + 11.61 +Because in most cases the ABA coronal dataset only contains one ISH per gene, it is possible for an unrelated combination of genes to seem to identify an area when in fact it is only coincidence. There are two ways we will validate our marker genes to guard against this. First, we will confirm that putative combinations of marker genes express the same pattern in both hemispheres. Second, we will manually validate our final results on other gene expression datasets such as EMAGE, GeneAtlas, and GENSAT. 11.62 11.63 Using the methods developed in Aim 2, we will present one or more hierarchical cortical maps. We will identify and explain how the statistical structure in the gene expression data led to any unexpected or interesting features of these maps, and we will provide biological hypotheses to interpret any new cortical areas, or groupings of areas, which are discovered. 11.64
12.1 Binary file narrative.pdf has changed
13.1 --- /dev/null Thu Jan 01 00:00:00 1970 +0000 13.2 +++ b/narrative.txt Wed Apr 22 22:24:24 2009 -0700 13.3 @@ -0,0 +1,12 @@ 13.4 +\documentclass[11pt]{nih-blank} 13.5 + 13.6 +\usepackage[small,compact]{titlesec} 13.7 + 13.8 +\begin{document} 13.9 +== Public Health Relevance Statement (narrative) == 13.10 + 13.11 +This project will lead to the discovery of marker genes that will allow drugs and medical interventions to be targeted at specific anatomical structures, opening the door to the new treatments for many diseases. These marker genes will also be useful for histological diagnosis of patients. In addition, by providing a new method to divide organs into regions, this project will significantly advance our understanding of our bodies. 13.12 + 13.13 + 13.14 + 13.15 +\end{document}
14.1 --- a/nih-blank.cls Wed Apr 22 14:53:19 2009 -0700 14.2 +++ b/nih-blank.cls Wed Apr 22 22:24:24 2009 -0700 14.3 @@ -100,5 +100,7 @@ 14.4 14.5 % rename the bibliography section 14.6 %\AtBeginDocument{\renewcommand{\refname}{Literature~Cited}} 14.7 -\AtBeginDocument{\renewcommand{\refname}{Bibliography \& References~Cited}} 14.8 +%%\AtBeginDocument{\renewcommand{\refname}{Bibliography \& References~Cited}} 14.9 +%% changed by bayle shanks to literature cited 14.10 +\AtBeginDocument{\renewcommand{\refname}{Bibliography \& Literature~Cited}} 14.11 %FIXME: something is going on with the bibliography style. Dunno what.