rev |
line source |
bshanks@0 | 1 Specific aims
|
bshanks@15 | 2 Massive new datasets obtained with techniques such as in situ hybridization
|
bshanks@0 | 3 (ISH) and BAC-transgenics allow the expression levels of many genes at many
|
bshanks@0 | 4 locations to be compared. Our goal is to develop automated methods to relate
|
bshanks@0 | 5 spatial variation in gene expression to anatomy. We want to find marker genes
|
bshanks@0 | 6 for specific anatomical regions, and also to draw new anatomical maps based on
|
bshanks@0 | 7 gene expression patterns. We have three specific aims:
|
bshanks@17 | 8 (1) develop an algorithm to screen spatial gene expression data for combi-
|
bshanks@17 | 9 nations of marker genes which selectively target anatomical regions
|
bshanks@17 | 10 (2) develop an algorithm to suggest new ways of carving up a structure into
|
bshanks@17 | 11 anatomical subregions, based on spatial patterns in gene expression
|
bshanks@17 | 12 (3) create a 2-D “flat map” dataset of the mouse cerebral cortex that con-
|
bshanks@17 | 13 tains a flattened version of the Allen Mouse Brain Atlas ISH data, as well as
|
bshanks@17 | 14 the boundaries of cortical anatomical areas. Use this dataset to validate the
|
bshanks@17 | 15 methods developed in (1) and (2).
|
bshanks@0 | 16 In addition to validating the usefulness of the algorithms, the application of
|
bshanks@0 | 17 these methods to cerebral cortex will produce immediate benefits, because there
|
bshanks@0 | 18 are currently no known genetic markers for many cortical areas. The results
|
bshanks@0 | 19 of the project will support the development of new ways to selectively target
|
bshanks@0 | 20 cortical areas, and it will support the development of a method for identifying
|
bshanks@0 | 21 the cortical areal boundaries present in small tissue samples.
|
bshanks@0 | 22 All algorithms that we develop will be implemented in an open-source soft-
|
bshanks@0 | 23 ware toolkit. The toolkit, as well as the machine-readable datasets developed
|
bshanks@0 | 24 in aim (3), will be published and freely available for others to use.
|
bshanks@26 | 25 1
|
bshanks@26 | 26
|
bshanks@0 | 27 Background and significance
|
bshanks@0 | 28 Aim 1
|
bshanks@16 | 29 Machine learning terminology: supervised learning
|
bshanks@16 | 30 The task of looking for marker genes for anatomical subregions means that
|
bshanks@16 | 31 one is looking for a set of genes such that, if the expression level of those genes
|
bshanks@16 | 32 is known, then the locations of the subregions can be inferred.
|
bshanks@0 | 33 If we define the subregions so that they cover the entire anatomical structure
|
bshanks@0 | 34 to be divided, then instead of saying that we are using gene expression to find
|
bshanks@0 | 35 the locations of the subregions, we may say that we are using gene expression to
|
bshanks@0 | 36 determine to which subregion each voxel within the structure belongs. We call
|
bshanks@0 | 37 this a classification task, because each voxel is being assigned to a class (namely,
|
bshanks@0 | 38 its subregion).
|
bshanks@0 | 39 Therefore, an understanding of the relationship between the combination of
|
bshanks@0 | 40 their expression levels and the locations of the subregions may be expressed as
|
bshanks@16 | 41 a function. The input to this function is a voxel, along with the gene expression
|
bshanks@0 | 42 levels within that voxel; the output is the subregional identity of the target
|
bshanks@0 | 43 voxel, that is, the subregion to which the target voxel belongs. We call this
|
bshanks@0 | 44 function a classifier. In general, the input to a classifier is called an instance,
|
bshanks@15 | 45 and the output is called a label (or a class label).
|
bshanks@0 | 46 The object of aim 1 is not to produce a single classifier, but rather to develop
|
bshanks@0 | 47 an automated method for determining a classifier for any known anatomical
|
bshanks@0 | 48 structure. Therefore, we seek a procedure by which a gene expression dataset
|
bshanks@0 | 49 may be analyzed in concert with an anatomical atlas in order to produce a
|
bshanks@0 | 50 classifier. Such a procedure is a type of a machine learning procedure. The
|
bshanks@0 | 51 construction of the classifier is called training (also learning), and the initial
|
bshanks@0 | 52 gene expression dataset used in the construction of the classifier is called training
|
bshanks@0 | 53 data.
|
bshanks@0 | 54 In the machine learning literature, this sort of procedure may be thought
|
bshanks@0 | 55 of as a supervised learning task, defined as a task in whcih the goal is to learn
|
bshanks@0 | 56 a mapping from instances to labels, and the training data consists of a set of
|
bshanks@0 | 57 instances (voxels) for which the labels (subregions) are known.
|
bshanks@0 | 58 Each gene expression level is called a feature, and the selection of which
|
bshanks@0 | 59 genes to include is called feature selection. Feature selection is one component
|
bshanks@0 | 60 of the task of learning a classifier. Some methods for learning classifiers start
|
bshanks@0 | 61 out with a separate feature selection phase, whereas other methods combine
|
bshanks@0 | 62 feature selection with other aspects of training.
|
bshanks@0 | 63 One class of feature selection methods assigns some sort of score to each
|
bshanks@0 | 64 candidate gene. The top-ranked genes are then chosen. Some scoring measures
|
bshanks@0 | 65 can assign a score to a set of selected genes, not just to a single gene; in this
|
bshanks@0 | 66 case, a dynamic procedure may be used in which features are added and sub-
|
bshanks@0 | 67 tracted from the selected set depending on how much they raise the score. Such
|
bshanks@0 | 68 procedures are called “stepwise” or “greedy”.
|
bshanks@0 | 69 Although the classifier itself may only look at the gene expression data within
|
bshanks@0 | 70 each voxel before classifying that voxel, the learning algorithm which constructs
|
bshanks@26 | 71 2
|
bshanks@26 | 72
|
bshanks@0 | 73 the classifier may look over the entire dataset. We can categorize score-based
|
bshanks@0 | 74 feature selection methods depending on how the score of calculated. Often
|
bshanks@0 | 75 the score calculation consists of assigning a sub-score to each voxel, and then
|
bshanks@0 | 76 aggregating these sub-scores into a final score (the aggregation is often a sum or
|
bshanks@0 | 77 a sum of squares). If only information from nearby voxels is used to calculate a
|
bshanks@0 | 78 voxel’s sub-score, then we say it is a local scoring method. If only information
|
bshanks@0 | 79 from the voxel itself is used to calculate a voxel’s sub-score, then we say it is a
|
bshanks@0 | 80 pointwise scoring method.
|
bshanks@0 | 81 Key questions when choosing a learning method are: What are the instances?
|
bshanks@0 | 82 What are the features? How are the features chosen? Here are four principles
|
bshanks@0 | 83 that outline our answers to these questions.
|
bshanks@16 | 84 Principle 1: Combinatorial gene expression
|
bshanks@16 | 85 Above, we defined an “instance” as the combination of a voxel with the
|
bshanks@16 | 86 “associated gene expression data”. In our case this refers to the expression level
|
bshanks@16 | 87 of genes within the voxel, but should we include the expression levels of all
|
bshanks@16 | 88 genes, or only a few of them?
|
bshanks@16 | 89 It is too much to hope that every anatomical region of interest will be iden-
|
bshanks@0 | 90 tified by a single gene. For example, in the cortex, there are some areas which
|
bshanks@0 | 91 are not clearly delineated by any gene included in the Allen Brain Atlas (ABA)
|
bshanks@0 | 92 dataset. However, at least some of these areas can be delineated by looking
|
bshanks@0 | 93 at combinations of genes (an example of an area for which multiple genes are
|
bshanks@0 | 94 necessary and sufficient is provided in Preliminary Results).
|
bshanks@16 | 95 Principle 2: Only look at combinations of small numbers of genes
|
bshanks@16 | 96 When the classifier classifies a voxel, it is only allowed to look at the expres-
|
bshanks@16 | 97 sion of the genes which have been selected as features. The more data that is
|
bshanks@16 | 98 available to a classifier, the better that it can do. For example, perhaps there
|
bshanks@16 | 99 are weak correlations over many genes that add up to a strong signal. So, why
|
bshanks@16 | 100 not include every gene as a feature? The reason is that we wish to employ
|
bshanks@16 | 101 the classifier in situations in which it is not feasible to gather data about every
|
bshanks@16 | 102 gene. For example, if we want to use the expression of marker genes as a trigger
|
bshanks@16 | 103 for some regionally-targeted intervention, then our intervention must contain a
|
bshanks@16 | 104 molecular mechanism to check the expression level of each marker gene before
|
bshanks@16 | 105 it triggers. It is currently infeasible to design a molecular trigger that checks
|
bshanks@16 | 106 the level of more than a handful of genes. Similarly, if the goal is to develop a
|
bshanks@16 | 107 procedure to do ISH on tissue samples in order to label their anatomy, then it
|
bshanks@16 | 108 is infeasible to label more than a few genes. Therefore, we must select only a
|
bshanks@16 | 109 few genes as features.
|
bshanks@16 | 110 Principle 3: Use geometry in feature selection
|
bshanks@16 | 111 When doing feature selection with score-based methods, the simplest thing
|
bshanks@16 | 112 to do would be to score the performance of each voxel by itself and then com-
|
bshanks@16 | 113 bine these scores (pointwise scoring). A more powerful approach is to also use
|
bshanks@16 | 114 information about the geometric relations between each voxel and its neighbors;
|
bshanks@16 | 115 this requires non-pointwise, local scoring methods. See Preliminary Results for
|
bshanks@16 | 116 evidence of the complementary nature of pointwise and local scoring methods.
|
bshanks@16 | 117 Principle 4: Work in 2-D whenever possible
|
bshanks@16 | 118 There are many anatomical structures which are commonly characterized in
|
bshanks@26 | 119 3
|
bshanks@26 | 120
|
bshanks@0 | 121 terms of a two-dimensional manifold. When it is known that the structure that
|
bshanks@0 | 122 one is looking for is two-dimensional, the results may be improved by allowing
|
bshanks@0 | 123 the analysis algorithm to take advantage of this prior knowledge. In addition,
|
bshanks@0 | 124 it is easier for humans to visualize and work with 2-D data.
|
bshanks@0 | 125 Therefore, when possible, the instances should represent pixels, not voxels.
|
bshanks@1 | 126 Aim 2
|
bshanks@16 | 127 Machine learning terminology: clustering
|
bshanks@16 | 128 If one is given a dataset consisting merely of instances, with no class labels,
|
bshanks@16 | 129 then analysis of the dataset is referred to as unsupervised learning in the jargon
|
bshanks@16 | 130 of machine learning. One thing that you can do with such a dataset is to group
|
bshanks@15 | 131 instances together. A set of similar instances is called a cluster, and the activity
|
bshanks@15 | 132 of finding grouping the data into clusters is called clustering or cluster analysis.
|
bshanks@15 | 133 The task of deciding how to carve up a structure into anatomical subregions
|
bshanks@15 | 134 can be put into these terms. The instances are once again voxels (or pixels)
|
bshanks@15 | 135 along with their associated gene expression profiles. We make the assumption
|
bshanks@15 | 136 that voxels from the same subregion have similar gene expression profiles, at
|
bshanks@15 | 137 least compared to the other subregions. This means that clustering voxels is
|
bshanks@15 | 138 the same as finding potential subregions; we seek a partitioning of the voxels
|
bshanks@15 | 139 into subregions, that is, into clusters of voxels with similar gene expression.
|
bshanks@15 | 140 It is desirable to determine not just one set of subregions, but also how
|
bshanks@15 | 141 these subregions relate to each other, if at all; perhaps some of the subregions
|
bshanks@15 | 142 are more similar to each other than to the rest, suggesting that, although at a
|
bshanks@15 | 143 fine spatial scale they could be considered separate, on a coarser spatial scale
|
bshanks@15 | 144 they could be grouped together into one large subregion. This suggests the
|
bshanks@15 | 145 outcome of clustering may be a hierarchial tree of clusters, rather than a single
|
bshanks@15 | 146 set of clusters which partition the voxels. This is called hierarchial clustering.
|
bshanks@16 | 147 Similarity scores
|
bshanks@18 | 148 A crucial choice when designing a clustering method is how to measure
|
bshanks@18 | 149 similarity, across either pairs of instances, or clusters, or both. There is much
|
bshanks@18 | 150 overlap between scoring methods for feature selection (discussed above under
|
bshanks@18 | 151 Aim 1) and scoring methods for similarity.
|
bshanks@16 | 152 Spatially contiguous clusters; image segmentation
|
bshanks@16 | 153 We have shown that aim 2 is a type of clustering task. In fact, it is a
|
bshanks@16 | 154 special type of clustering task because we have an additional constraint on
|
bshanks@16 | 155 clusters; voxels grouped together into a cluster must be spatially contiguous.
|
bshanks@16 | 156 In Preliminary Results, we show that one can get reasonable results without
|
bshanks@16 | 157 enforcing this constraint, however, we plan to compare these results against
|
bshanks@16 | 158 other methods which guarantee contiguous clusters.
|
bshanks@15 | 159 Perhaps the biggest source of continguous clustering algorithms is the field
|
bshanks@15 | 160 of computer vision, which has produced a variety of image segmentation algo-
|
bshanks@15 | 161 rithms. Image segmentation is the task of partitioning the pixels in a digital
|
bshanks@15 | 162 image into clusters, usually contiguous clusters. Aim 2 is similar to an image
|
bshanks@15 | 163 segmentation task. There are two main differences; in our task, there are thou-
|
bshanks@15 | 164 sands of color channels (one for each gene), rather than just three. There are
|
bshanks@26 | 165 4
|
bshanks@26 | 166
|
bshanks@15 | 167 imaging tasks which use more than three colors, however, for example multispec-
|
bshanks@15 | 168 tral imaging and hyperspectral imaging, which are often used to process satellite
|
bshanks@15 | 169 imagery. A more crucial difference is that there are various cues which are ap-
|
bshanks@15 | 170 propriate for detecting sharp object boundaries in a visual scene but which are
|
bshanks@15 | 171 not appropriate for segmenting abstract spatial data such as gene expression.
|
bshanks@15 | 172 Although many image segmentation algorithms can be expected to work well
|
bshanks@15 | 173 for segmenting other sorts of spatially arranged data, some of these algorithms
|
bshanks@15 | 174 are specialized for visual images.
|
bshanks@16 | 175 Dimensionality reduction
|
bshanks@16 | 176 Unlike aim 1, there is no externally-imposed need to select only a handful
|
bshanks@16 | 177 of informative genes for inclusion in the instances. However, some clustering
|
bshanks@16 | 178 algorithms perform better on small numbers of features. There are techniques
|
bshanks@15 | 179 which “summarize” a larger number of features using a smaller number of fea-
|
bshanks@15 | 180 tures; these techniques go by the name of feature extraction or dimensionality
|
bshanks@15 | 181 reduction. The small set of features that such a technique yields is called the
|
bshanks@15 | 182 reduced feature set. After the reduced feature set is created, the instances may
|
bshanks@15 | 183 be replaced by reduced instances, which have as their features the reduced fea-
|
bshanks@15 | 184 ture set rather than the original feature set of all gene expression levels. Note
|
bshanks@15 | 185 that the features in the reduced feature set do not necessarily correspond to
|
bshanks@15 | 186 genes; each feature in the reduced set may be any function of the set of gene
|
bshanks@15 | 187 expression levels.
|
bshanks@15 | 188 Another use for dimensionality reduction is to visualize the relationships
|
bshanks@15 | 189 between subregions. For example, one might want to make a 2-D plot upon
|
bshanks@15 | 190 which each subregion is represented by a single point, and with the property
|
bshanks@15 | 191 that subregions with similar gene expression profiles should be nearby on the
|
bshanks@15 | 192 plot (that is, the property that distance between pairs of points in the plot
|
bshanks@15 | 193 should be proportional to some measure of dissimilarity in gene expression). It
|
bshanks@15 | 194 is likely that no arrangement of the points on a 2-D plan will exactly satisfy
|
bshanks@15 | 195 this property – however, dimensionality reduction techniques allow one to find
|
bshanks@15 | 196 arrangements of points that approximately satisfy that property. Note that
|
bshanks@15 | 197 in this application, dimensionality reduction is being applied after clustering;
|
bshanks@15 | 198 whereas in the previous paragraph, we were talking about using dimensionality
|
bshanks@15 | 199 reduction before clustering.
|
bshanks@16 | 200 Clustering genes rather than voxels
|
bshanks@16 | 201 Although the ultimate goal is to cluster the instances (voxels or pixels), one
|
bshanks@15 | 202 strategy to achieve this goal is to first cluster the features (genes). There are
|
bshanks@15 | 203 two ways that clusters of genes could be used.
|
bshanks@15 | 204 Gene clusters could be used as part of dimensionality reduction: rather than
|
bshanks@15 | 205 have one feature for each gene, we could have one reduced feature for each gene
|
bshanks@15 | 206 cluster.
|
bshanks@15 | 207 Gene clusters could also be used to directly yield a clustering on instances.
|
bshanks@15 | 208 This is because many genes have an expression pattern which seems to pick
|
bshanks@15 | 209 out a single, spatially continguous subregion. Therefore, it seems likely that an
|
bshanks@15 | 210 anatomically interesting subregion will have multiple genes which each individ-
|
bshanks@26 | 211 5
|
bshanks@26 | 212
|
bshanks@15 | 213 ually pick it out1. This suggests the following procedure: cluster together genes
|
bshanks@15 | 214 which pick out similar subregions, and then to use the more popular common
|
bshanks@15 | 215 subregions as the final clusters. In the Preliminary Data we show that a num-
|
bshanks@15 | 216 ber of anatomically recognized cortical regions, as well as some “superregions”
|
bshanks@15 | 217 formed by lumping together a few regions, are associated with gene clusters in
|
bshanks@15 | 218 this fashion.
|
bshanks@0 | 219 Aim 3
|
bshanks@16 | 220 Background
|
bshanks@18 | 221 The cortex is divided into areas and layers. To a first approximation, the
|
bshanks@18 | 222 parcellation of the cortex into areas can be drawn as a 2-D map on the surface of
|
bshanks@18 | 223 the cortex. In the third dimension, the boundaries between the areas continue
|
bshanks@18 | 224 downwards into the cortical depth, perpendicular to the surface. The layer
|
bshanks@17 | 225 boundaries run parallel to the surface. One can picture an area of the cortex as
|
bshanks@17 | 226 a slice of many-layered cake.
|
bshanks@0 | 227 Although it is known that different cortical areas have distinct roles in both
|
bshanks@0 | 228 normal functioning and in disease processes, there are no known marker genes
|
bshanks@0 | 229 for many cortical areas. When it is necessary to divide a tissue sample into
|
bshanks@0 | 230 cortical areas, this is a manual process that requires a skilled human to combine
|
bshanks@0 | 231 multiple visual cues and interpret them in the context of their approximate
|
bshanks@0 | 232 location upon the cortical surface.
|
bshanks@0 | 233 Even the questions of how many areas should be recognized in cortex, and
|
bshanks@0 | 234 what their arrangement is, are still not completely settled. A proposed division
|
bshanks@0 | 235 of the cortex into areas is called a cortical map. In the rodent, the lack of a
|
bshanks@0 | 236 single agreed-upon map can be seen by contrasting the recent maps given by
|
bshanks@0 | 237 Swanson?? on the one hand, and Paxinos and Franklin?? on the other. While
|
bshanks@0 | 238 the maps are certainly very similar in their general arrangement, significant
|
bshanks@0 | 239 differences remain in the details.
|
bshanks@16 | 240 Significance
|
bshanks@16 | 241 The method developed in aim (1) will be applied to each cortical area to find
|
bshanks@0 | 242 a set of marker genes such that the combinatorial expression pattern of those
|
bshanks@0 | 243 genes uniquely picks out the target area. Finding marker genes will be useful
|
bshanks@0 | 244 for drug discovery as well as for experimentation because marker genes can be
|
bshanks@0 | 245 used to design interventions which selectively target individual cortical areas.
|
bshanks@0 | 246 The application of the marker gene finding algorithm to the cortex will
|
bshanks@0 | 247 also support the development of new neuroanatomical methods. In addition to
|
bshanks@0 | 248 finding markers for each individual cortical areas, we will find a small panel
|
bshanks@0 | 249 of genes that can find many of the areal boundaries at once. This panel of
|
bshanks@0 | 250 marker genes will allow the development of an ISH protocol that will allow
|
bshanks@26 | 251 __________________________
|
bshanks@26 | 252 1This would seem to contradict our finding in aim 1 that some cortical areas are combina-
|
bshanks@26 | 253 torially coded by multiple genes. However, it is possible that the currently accepted cortical
|
bshanks@26 | 254 maps divide the cortex into subregions which are unnatural from the point of view of gene
|
bshanks@26 | 255 expression; perhaps there is some other way to map the cortex for which each subregion can
|
bshanks@26 | 256 be identified by single genes.
|
bshanks@26 | 257 6
|
bshanks@26 | 258
|
bshanks@0 | 259 experimenters to more easily identify which anatomical areas are present in
|
bshanks@0 | 260 small samples of cortex.
|
bshanks@0 | 261 The method developed in aim (3) will provide a genoarchitectonic viewpoint
|
bshanks@0 | 262 that will contribute to the creation of a better map. The development of present-
|
bshanks@0 | 263 day cortical maps was driven by the application of histological stains. It is
|
bshanks@0 | 264 conceivable that if a different set of stains had been available which identified
|
bshanks@0 | 265 a different set of features, then the today’s cortical maps would have come out
|
bshanks@0 | 266 differently. Since the number of classes of stains is small compared to the number
|
bshanks@0 | 267 of genes, it is likely that there are many repeated, salient spatial patterns in
|
bshanks@0 | 268 the gene expression which have not yet been captured by any stain. Therefore,
|
bshanks@0 | 269 current ideas about cortical anatomy need to incorporate what we can learn
|
bshanks@0 | 270 from looking at the patterns of gene expression.
|
bshanks@0 | 271 While we do not here propose to analyze human gene expression data, it is
|
bshanks@0 | 272 conceivable that the methods we propose to develop could be used to suggest
|
bshanks@0 | 273 modifications to the human cortical map as well.
|
bshanks@0 | 274 Related work
|
bshanks@18 | 275 There does not appear to be much work on the automated analysis of spatial
|
bshanks@18 | 276 gene expression data.
|
bshanks@18 | 277 There is a substantial body of work on the analysis of gene expression data,
|
bshanks@18 | 278 however, most of this concerns gene expression data which is not fundamentally
|
bshanks@23 | 279 spatial.
|
bshanks@18 | 280 As noted above, there has been much work on both supervised learning and
|
bshanks@22 | 281 clustering, and there are many available algorithms for each. However, the
|
bshanks@22 | 282 completion of Aims 1 and 2 involves more than just choosing between a set of
|
bshanks@22 | 283 existing algorithms, and will constitute a substantial contribution to biology.
|
bshanks@22 | 284 The algorithms require the scientist to provide a framework for representing the
|
bshanks@22 | 285 problem domain, and the way that this framework is set up has a large impact
|
bshanks@22 | 286 on performance. Creating a good framework can require creatively reconcep-
|
bshanks@22 | 287 tualizing the problem domain, and is not merely a mechanical “fine-tuning”
|
bshanks@22 | 288 of numerical parameters. For example, we believe that domain-specific scoring
|
bshanks@22 | 289 measures (such as gradient similarity, which is discussed in Preliminary Work)
|
bshanks@22 | 290 may be necessary in order to achieve the best results in this application.
|
bshanks@20 | 291 We are aware of two existing efforts to relate spatial gene expression data to
|
bshanks@20 | 292 anatomy through computational methods.
|
bshanks@20 | 293 [?] describes an analysis of the anatomy of the hippocampus using the ABA
|
bshanks@20 | 294 dataset. In addition to manual analysis, two clustering methods were employed,
|
bshanks@20 | 295 a modified Non-negative Matrix Factorization (NNMF), and a hierarchial bifur-
|
bshanks@20 | 296 cation clustering scheme based on correlation as the similarity score. The paper
|
bshanks@20 | 297 yielded impressive results, proving the usefulness of such research. We have run
|
bshanks@20 | 298 NNMF on the cortical dataset and while the results are promising (see Prelim-
|
bshanks@20 | 299 inary Data), we think that it will be possible to find a better method2 (we also
|
bshanks@26 | 300 __________________________
|
bshanks@26 | 301 2We ran “vanilla” NNMF, whereas the paper under discussion used a modified method.
|
bshanks@26 | 302 Their main modification consisted of adding a soft spatial contiguity constraint. However,
|
bshanks@26 | 303 on our dataset, NNMF naturally produced spatially contiguous clusters, so no additional
|
bshanks@26 | 304 7
|
bshanks@26 | 305
|
bshanks@20 | 306 think that more automation of the parts that this paper’s authors did manually
|
bshanks@20 | 307 will be possible).
|
bshanks@20 | 308 and [?] describes AGEA. todo
|
bshanks@26 | 309 __________________________
|
bshanks@26 | 310 constraint was needed. The paper under discussion mentions that they also tried a hierarchial
|
bshanks@26 | 311 variant of NNMF, but since they didn’t report its results, we assume that those result were
|
bshanks@26 | 312 not any more impressive than the results of the non-hierarchial variant.
|
bshanks@26 | 313 8
|
bshanks@26 | 314
|
bshanks@25 | 315 Preliminary work
|
bshanks@25 | 316 Format conversion between SEV, MATLAB, NIFTI
|
bshanks@25 | 317 todo
|
bshanks@25 | 318 Flatmap of cortex
|
bshanks@25 | 319 todo
|
bshanks@16 | 320 Using combinations of multiple genes is necessary and sufficient to
|
bshanks@15 | 321 delineate some cortical areas
|
bshanks@16 | 322 Here we give an example of a cortical area which is not marked by any
|
bshanks@16 | 323 single gene, but which can be identified combinatorially. according to logistic
|
bshanks@20 | 324 regression, gene wwc13 is the best fit single gene for predicting whether or not a
|
bshanks@16 | 325 pixel on the cortical surface belongs to the motor area (area MO). The upper-left
|
bshanks@0 | 326 picture in Figure shows wwc1’s spatial expression pattern over the cortex. The
|
bshanks@0 | 327 lower-right boundary of MO is represented reasonably well by this gene, however
|
bshanks@0 | 328 the gene overshoots the upper-left boundary. This flattened 2-D representation
|
bshanks@0 | 329 does not show it, but the area corresponding to the overshoot is the medial
|
bshanks@0 | 330 surface of the cortex. MO is only found on the lateral surface (todo).
|
bshanks@20 | 331 Gnee mtif24 is shown in figure the upper-right of Fig. . Mtif2 captures MO’s
|
bshanks@0 | 332 upper-left boundary, but not its lower-right boundary. Mtif2 does not express
|
bshanks@0 | 333 very much on the medial surface. By adding together the values at each pixel
|
bshanks@16 | 334 in these two figures, we get the lower-left of Figure . This combination captures
|
bshanks@16 | 335 area MO much better than any single gene.
|
bshanks@17 | 336 Correlation todo
|
bshanks@17 | 337 Conditional entropy todo
|
bshanks@17 | 338 Gradient similarity todo
|
bshanks@16 | 339 Geometric and pointwise scoring methods provide complementary
|
bshanks@16 | 340 information
|
bshanks@16 | 341 To show that local geometry can provide useful information that cannot be
|
bshanks@16 | 342 detected via pointwise analyses, consider Fig. . The top row of Fig. displays the
|
bshanks@21 | 343 3 genes which most match area AUD, according to a pointwise method5. The
|
bshanks@21 | 344 bottom row displays the 3 genes which most match AUD according to a method
|
bshanks@22 | 345 which considers local geometry6 The pointwise method in the top row identifies
|
bshanks@22 | 346 genes which express more strongly in AUD than outside of it; its weakness is that
|
bshanks@22 | 347 this includes many areas which don’t have a salient border matching the areal
|
bshanks@22 | 348 border. The geometric method identifies genes whose salient expression border
|
bshanks@26 | 349 __________________________
|
bshanks@20 | 350 3“WW, C2 and coiled-coil domain containing 1”; EntrezGene ID 211652
|
bshanks@20 | 351 4“mitochondrial translational initiation factor 2”; EntrezGene ID 76784
|
bshanks@21 | 352 5For each gene, a logistic regression in which the response variable was whether or not a
|
bshanks@21 | 353 surface pixel was within area AUD, and the predictor variable was the value of the expression
|
bshanks@21 | 354 of the gene underneath that pixel. The resulting scores were used to rank the genes in terms
|
bshanks@21 | 355 of how well they predict area AUD.
|
bshanks@22 | 356 6For each gene the gradient similarity (see section ??) between (a) a map of the expression
|
bshanks@22 | 357 of each gene on the cortical surface and (b) the shape of area AUD, was calculated, and this
|
bshanks@22 | 358 was used to rank the genes.
|
bshanks@26 | 359 9
|
bshanks@0 | 360
|
bshanks@0 | 361
|
bshanks@0 | 362
|
bshanks@0 | 363 Figure 1: Upper left: wwc1. Upper right: mtif2. Lower left: wwc1 + mtif2
|
bshanks@0 | 364 (each pixel’s value on the lower left is the sum of the corresponding pixels in
|
bshanks@0 | 365 the upper row). Within each picture, the vertical axis roughly corresponds to
|
bshanks@0 | 366 anterior at the top and posterior at the bottom, and the horizontal axis roughly
|
bshanks@0 | 367 corresponds to medial at the left and lateral at the right. The red outline is
|
bshanks@0 | 368 the boundary of region MO. Pixels are colored approximately according to the
|
bshanks@0 | 369 density of expressing cells underneath each pixel, with red meaning a lot of
|
bshanks@0 | 370 expression and blue meaning little.
|
bshanks@26 | 371 10
|
bshanks@26 | 372
|
bshanks@15 | 373
|
bshanks@15 | 374
|
bshanks@15 | 375 Figure 2: The top row shows the three genes which (individually) best predict
|
bshanks@15 | 376 area AUD, according to logistic regression. The bottom row shows the three
|
bshanks@15 | 377 genes which (individually) best match area AUD, according to gradient similar-
|
bshanks@15 | 378 ity. From left to right and top to bottom, the genes are Ssr1, Efcbp1, Aph1a,
|
bshanks@15 | 379 Ptk7, Aph1a again, and Lepr
|
bshanks@26 | 380 seems to partially line up with the border of AUD; its weakness is that this
|
bshanks@26 | 381 includes genes which don’t express over the entire area. Genes which have high
|
bshanks@26 | 382 rankings using both pointwise and border criteria, such as Aph1a in the example,
|
bshanks@26 | 383 may be particularly good markers. None of these genes are, individually, a
|
bshanks@26 | 384 perfect marker for AUD; we deliberately chose a “difficult” area in order to
|
bshanks@26 | 385 better contrast pointwise with geometric methods.
|
bshanks@26 | 386 Areas which can be identified by single genes
|
bshanks@26 | 387 todo
|
bshanks@18 | 388 Specific to Aim 1 (and Aim 3)
|
bshanks@17 | 389 Forward stepwise logistic regression todo
|
bshanks@17 | 390 SVM on all genes at once
|
bshanks@16 | 391 In order to see how well one can do when looking at all genes at once, we
|
bshanks@16 | 392 ran a support vector machine to classify cortical surface pixels based on their
|
bshanks@20 | 393 gene expression profiles. We achieved classification accuracy of about 81%7.
|
bshanks@16 | 394 As noted above, however, a classifier that looks at all the genes at once isn’t
|
bshanks@16 | 395 practically useful.
|
bshanks@16 | 396 The requirement to find combinations of only a small number of genes limits
|
bshanks@16 | 397 us from straightforwardly applying many of the most simple techniques from
|
bshanks@17 | 398 the field of supervised machine learning. In the parlance of machine learning,
|
bshanks@17 | 399 our task combines feature selection with supervised learning.
|
bshanks@17 | 400 Decision trees
|
bshanks@17 | 401 todo
|
bshanks@26 | 402 ____________________
|
bshanks@26 | 403 75-fold cross-validation.
|
bshanks@26 | 404 11
|
bshanks@26 | 405
|
bshanks@18 | 406 Specific to Aim 2 (and Aim 3)
|
bshanks@18 | 407 Raw dimensionality reduction results
|
bshanks@20 | 408 todo
|
bshanks@20 | 409 (might want to incld nnMF since mentioned above)
|
bshanks@18 | 410 Dimensionality reduction plus K-means or spectral clustering
|
bshanks@18 | 411 Many areas are captured by clusters of genes
|
bshanks@16 | 412 todo
|
bshanks@15 | 413 todo
|
bshanks@26 | 414 12
|
bshanks@26 | 415
|
bshanks@15 | 416 Research plan
|
bshanks@18 | 417 todo amongst other things:
|
bshanks@16 | 418 Develop algorithms that find genetic markers for anatomical re-
|
bshanks@16 | 419 gions
|
bshanks@0 | 420 1. Develop scoring measures for evaluating how good individual genes are at
|
bshanks@0 | 421 marking areas: we will compare pointwise, geometric, and information-
|
bshanks@0 | 422 theoretic measures.
|
bshanks@0 | 423 2. Develop a procedure to find single marker genes for anatomical regions: for
|
bshanks@0 | 424 each cortical area, by using or combining the scoring measures developed,
|
bshanks@0 | 425 we will rank the genes by their ability to delineate each area.
|
bshanks@0 | 426 3. Extend the procedure to handle difficult areas by using combinatorial cod-
|
bshanks@0 | 427 ing: for areas that cannot be identified by any single gene, identify them
|
bshanks@0 | 428 with a handful of genes. We will consider both (a) algorithms that incre-
|
bshanks@0 | 429 mentally/greedily combine single gene markers into sets, such as forward
|
bshanks@0 | 430 stepwise regression and decision trees, and also (b) supervised learning
|
bshanks@0 | 431 techniques which use soft constraints to minimize the number of features,
|
bshanks@0 | 432 such as sparse support vector machines.
|
bshanks@0 | 433 4. Extend the procedure to handle difficult areas by combining or redrawing
|
bshanks@0 | 434 the boundaries: An area may be difficult to identify because the bound-
|
bshanks@0 | 435 aries are misdrawn, or because it does not “really” exist as a single area,
|
bshanks@0 | 436 at least on the genetic level. We will develop extensions to our procedure
|
bshanks@0 | 437 which (a) detect when a difficult area could be fit if its boundary were
|
bshanks@0 | 438 redrawn slightly, and (b) detect when a difficult area could be combined
|
bshanks@0 | 439 with adjacent areas to create a larger area which can be fit.
|
bshanks@16 | 440 Apply these algorithms to the cortex
|
bshanks@0 | 441 1. Create open source format conversion tools: we will create tools to bulk
|
bshanks@0 | 442 download the ABA dataset and to convert between SEV, NIFTI and MAT-
|
bshanks@0 | 443 LAB formats.
|
bshanks@0 | 444 2. Flatmap the ABA cortex data: map the ABA data onto a plane and draw
|
bshanks@0 | 445 the cortical area boundaries onto it.
|
bshanks@0 | 446 3. Find layer boundaries: cluster similar voxels together in order to auto-
|
bshanks@0 | 447 matically find the cortical layer boundaries.
|
bshanks@0 | 448 4. Run the procedures that we developed on the cortex: we will present, for
|
bshanks@0 | 449 each area, a short list of markers to identify that area; and we will also
|
bshanks@0 | 450 present lists of “panels” of genes that can be used to delineate many areas
|
bshanks@0 | 451 at once.
|
bshanks@16 | 452 Develop algorithms to suggest a division of a structure into anatom-
|
bshanks@0 | 453 ical parts
|
bshanks@26 | 454 13
|
bshanks@26 | 455
|
bshanks@0 | 456 1. Explore dimensionality reduction algorithms applied to pixels: including
|
bshanks@0 | 457 TODO
|
bshanks@0 | 458 2. Explore dimensionality reduction algorithms applied to genes: including
|
bshanks@0 | 459 TODO
|
bshanks@0 | 460 3. Explore clustering algorithms applied to pixels: including TODO
|
bshanks@0 | 461 4. Explore clustering algorithms applied to genes: including gene shaving,
|
bshanks@0 | 462 TODO
|
bshanks@0 | 463 5. Develop an algorithm to use dimensionality reduction and/or hierarchial
|
bshanks@0 | 464 clustering to create anatomical maps
|
bshanks@0 | 465 6. Run this algorithm on the cortex: present a hierarchial, genoarchitectonic
|
bshanks@0 | 466 map of the cortex
|
bshanks@26 | 467 ______________________________________________
|
bshanks@26 | 468 stuff i dunno where to put yet (there is more scattered through grant-
|
bshanks@15 | 469 oldtext):
|
bshanks@16 | 470 Principle 4: Work in 2-D whenever possible
|
bshanks@21 | 471 In anatomy, the manifold of interest is usually either defined by a combina-
|
bshanks@21 | 472 tion of two relevant anatomical axes (todo), or by the surface of the structure
|
bshanks@21 | 473 (as is the case with the cortex). In the former case, the manifold of interest is
|
bshanks@21 | 474 a plane, but in the latter case it is curved. If the manifold is curved, there are
|
bshanks@21 | 475 various methods for mapping the manifold into a plane.
|
bshanks@22 | 476 The method that we will develop will begin by mapping the data into a
|
bshanks@22 | 477 2-D plane. Although the manifold that characterized cortical areas is known
|
bshanks@22 | 478 to be the cortical surface, it remains to be seen which method of mapping the
|
bshanks@22 | 479 manifold into a plane is optimal for this application. We will compare mappings
|
bshanks@22 | 480 which attempt to preserve size (such as the one used by Caret??) with mappings
|
bshanks@22 | 481 which preserve angle (conformal maps).
|
bshanks@22 | 482 Although there is much 2-D organization in anatomy, there are also struc-
|
bshanks@22 | 483 tures whose shape is fundamentally 3-dimensional. If possible, we would like
|
bshanks@22 | 484 the method we develop to include a statistical test that warns the user if the
|
bshanks@22 | 485 assumption of 2-D structure seems to be wrong.
|
bshanks@22 | 486 if we need citations for aim 3 significance, http://www.sciencedirect.
|
bshanks@22 | 487 com/science?_ob=ArticleURL&_udi=B6WSS-4V70FHY-9&_user=4429&_coverDate=
|
bshanks@25 | 488 12%2F26%2F2008&_rdoc=1&_fmt=full&_orig=na&_cdi=7054&_docanchor=&_acct=
|
bshanks@25 | 489 C000059602&_version=1&_urlVersion=0&_userid=4429&md5=551eccc743a2bfe6e992eee0c3194203#
|
bshanks@25 | 490 app2 has examples of genetic targeting to specific anatomical regions
|
bshanks@25 | 491 —
|
bshanks@25 | 492 note:
|
bshanks@26 | 493 14
|
bshanks@26 | 494
|
bshanks@26 | 495
|