cg: grant.html annotate

cg

annotate grant.html @ 29:5e2e4732b647

author	bshanks@bshanks.dyndns.org
date	Mon Apr 13 03:43:51 2009 -0700 (16 years ago)
parents	01c118d1074b
children	6ec3230fe1dc

rev	line source
bshanks@0	1 Specific aims
bshanks@15	2 Massive new datasets obtained with techniques such as in situ hybridization
bshanks@0	3 (ISH) and BAC-transgenics allow the expression levels of many genes at many
bshanks@0	4 locations to be compared. Our goal is to develop automated methods to relate
bshanks@0	5 spatial variation in gene expression to anatomy. We want to find marker genes
bshanks@0	6 for specific anatomical regions, and also to draw new anatomical maps based on
bshanks@0	7 gene expression patterns. We have three specific aims:
bshanks@17	8 (1) develop an algorithm to screen spatial gene expression data for combi-
bshanks@17	9 nations of marker genes which selectively target anatomical regions
bshanks@17	10 (2) develop an algorithm to suggest new ways of carving up a structure into
bshanks@17	11 anatomical subregions, based on spatial patterns in gene expression
bshanks@17	12 (3) create a 2-D “flat map” dataset of the mouse cerebral cortex that con-
bshanks@17	13 tains a flattened version of the Allen Mouse Brain Atlas ISH data, as well as
bshanks@17	14 the boundaries of cortical anatomical areas. Use this dataset to validate the
bshanks@17	15 methods developed in (1) and (2).
bshanks@0	16 In addition to validating the usefulness of the algorithms, the application of
bshanks@0	17 these methods to cerebral cortex will produce immediate benefits, because there
bshanks@0	18 are currently no known genetic markers for many cortical areas. The results
bshanks@0	19 of the project will support the development of new ways to selectively target
bshanks@0	20 cortical areas, and it will support the development of a method for identifying
bshanks@0	21 the cortical areal boundaries present in small tissue samples.
bshanks@0	22 All algorithms that we develop will be implemented in an open-source soft-
bshanks@0	23 ware toolkit. The toolkit, as well as the machine-readable datasets developed
bshanks@0	24 in aim (3), will be published and freely available for others to use.
bshanks@26	25 1
bshanks@26	26
bshanks@0	27 Background and significance
bshanks@0	28 Aim 1
bshanks@16	29 Machine learning terminology: supervised learning
bshanks@16	30 The task of looking for marker genes for anatomical subregions means that
bshanks@16	31 one is looking for a set of genes such that, if the expression level of those genes
bshanks@16	32 is known, then the locations of the subregions can be inferred.
bshanks@0	33 If we define the subregions so that they cover the entire anatomical structure
bshanks@0	34 to be divided, then instead of saying that we are using gene expression to find
bshanks@0	35 the locations of the subregions, we may say that we are using gene expression to
bshanks@0	36 determine to which subregion each voxel within the structure belongs. We call
bshanks@0	37 this a classification task, because each voxel is being assigned to a class (namely,
bshanks@0	38 its subregion).
bshanks@0	39 Therefore, an understanding of the relationship between the combination of
bshanks@0	40 their expression levels and the locations of the subregions may be expressed as
bshanks@16	41 a function. The input to this function is a voxel, along with the gene expression
bshanks@0	42 levels within that voxel; the output is the subregional identity of the target
bshanks@0	43 voxel, that is, the subregion to which the target voxel belongs. We call this
bshanks@0	44 function a classifier. In general, the input to a classifier is called an instance,
bshanks@15	45 and the output is called a label (or a class label).
bshanks@0	46 The object of aim 1 is not to produce a single classifier, but rather to develop
bshanks@0	47 an automated method for determining a classifier for any known anatomical
bshanks@0	48 structure. Therefore, we seek a procedure by which a gene expression dataset
bshanks@0	49 may be analyzed in concert with an anatomical atlas in order to produce a
bshanks@0	50 classifier. Such a procedure is a type of a machine learning procedure. The
bshanks@0	51 construction of the classifier is called training (also learning), and the initial
bshanks@0	52 gene expression dataset used in the construction of the classifier is called training
bshanks@0	53 data.
bshanks@0	54 In the machine learning literature, this sort of procedure may be thought
bshanks@28	55 of as a supervised learning task, defined as a task in which the goal is to learn
bshanks@0	56 a mapping from instances to labels, and the training data consists of a set of
bshanks@0	57 instances (voxels) for which the labels (subregions) are known.
bshanks@0	58 Each gene expression level is called a feature, and the selection of which
bshanks@29	59 genes1 to include is called feature selection. Feature selection is one component
bshanks@0	60 of the task of learning a classifier. Some methods for learning classifiers start
bshanks@0	61 out with a separate feature selection phase, whereas other methods combine
bshanks@0	62 feature selection with other aspects of training.
bshanks@0	63 One class of feature selection methods assigns some sort of score to each
bshanks@0	64 candidate gene. The top-ranked genes are then chosen. Some scoring measures
bshanks@0	65 can assign a score to a set of selected genes, not just to a single gene; in this
bshanks@0	66 case, a dynamic procedure may be used in which features are added and sub-
bshanks@0	67 tracted from the selected set depending on how much they raise the score. Such
bshanks@0	68 procedures are called “stepwise” or “greedy”.
bshanks@29	69 __________________________
bshanks@29	70 1Strictly speaking, the features are gene expression levels, but we’ll call them genes.
bshanks@29	71 2
bshanks@29	72
bshanks@0	73 Although the classifier itself may only look at the gene expression data within
bshanks@0	74 each voxel before classifying that voxel, the learning algorithm which constructs
bshanks@0	75 the classifier may look over the entire dataset. We can categorize score-based
bshanks@0	76 feature selection methods depending on how the score of calculated. Often
bshanks@0	77 the score calculation consists of assigning a sub-score to each voxel, and then
bshanks@0	78 aggregating these sub-scores into a final score (the aggregation is often a sum or
bshanks@0	79 a sum of squares). If only information from nearby voxels is used to calculate a
bshanks@0	80 voxel’s sub-score, then we say it is a local scoring method. If only information
bshanks@0	81 from the voxel itself is used to calculate a voxel’s sub-score, then we say it is a
bshanks@0	82 pointwise scoring method.
bshanks@0	83 Key questions when choosing a learning method are: What are the instances?
bshanks@0	84 What are the features? How are the features chosen? Here are four principles
bshanks@0	85 that outline our answers to these questions.
bshanks@29	86 Principle 1: Combinatorial gene expression It is too much to hope
bshanks@29	87 that every anatomical region of interest will be identified by a single gene. For
bshanks@29	88 example, in the cortex, there are some areas which are not clearly delineated
bshanks@29	89 by any gene included in the Allen Brain Atlas (ABA) dataset. However, at
bshanks@29	90 least some of these areas can be delineated by looking at combinations of genes
bshanks@29	91 (an example of an area for which multiple genes are necessary and sufficient
bshanks@29	92 is provided in Preliminary Results). Therefore, each instance should contain
bshanks@29	93 multiple features (genes).
bshanks@16	94 Principle 2: Only look at combinations of small numbers of genes
bshanks@29	95 When the classifier classifies a voxel, it is only allowed to look at the expression of
bshanks@29	96 the genes which have been selected as features. The more data that is available
bshanks@29	97 to a classifier, the better that it can do. For example, perhaps there are weak
bshanks@29	98 correlations over many genes that add up to a strong signal. So, why not include
bshanks@29	99 every gene as a feature? The reason is that we wish to employ the classifier in
bshanks@29	100 situations in which it is not feasible to gather data about every gene. For
bshanks@29	101 example, if we want to use the expression of marker genes as a trigger for some
bshanks@29	102 regionally-targeted intervention, then our intervention must contain a molecular
bshanks@29	103 mechanism to check the expression level of each marker gene before it triggers.
bshanks@29	104 It is currently infeasible to design a molecular trigger that checks the level of
bshanks@29	105 more than a handful of genes. Similarly, if the goal is to develop a procedure to
bshanks@29	106 do ISH on tissue samples in order to label their anatomy, then it is infeasible
bshanks@29	107 to label more than a few genes. Therefore, we must select only a few genes as
bshanks@29	108 features.
bshanks@16	109 Principle 3: Use geometry in feature selection
bshanks@16	110 When doing feature selection with score-based methods, the simplest thing
bshanks@16	111 to do would be to score the performance of each voxel by itself and then com-
bshanks@16	112 bine these scores (pointwise scoring). A more powerful approach is to also use
bshanks@16	113 information about the geometric relations between each voxel and its neighbors;
bshanks@16	114 this requires non-pointwise, local scoring methods. See Preliminary Results for
bshanks@16	115 evidence of the complementary nature of pointwise and local scoring methods.
bshanks@29	116 3
bshanks@29	117
bshanks@16	118 Principle 4: Work in 2-D whenever possible
bshanks@16	119 There are many anatomical structures which are commonly characterized in
bshanks@0	120 terms of a two-dimensional manifold. When it is known that the structure that
bshanks@0	121 one is looking for is two-dimensional, the results may be improved by allowing
bshanks@0	122 the analysis algorithm to take advantage of this prior knowledge. In addition,
bshanks@0	123 it is easier for humans to visualize and work with 2-D data.
bshanks@0	124 Therefore, when possible, the instances should represent pixels, not voxels.
bshanks@1	125 Aim 2
bshanks@16	126 Machine learning terminology: clustering
bshanks@16	127 If one is given a dataset consisting merely of instances, with no class labels,
bshanks@16	128 then analysis of the dataset is referred to as unsupervised learning in the jargon
bshanks@16	129 of machine learning. One thing that you can do with such a dataset is to group
bshanks@15	130 instances together. A set of similar instances is called a cluster, and the activity
bshanks@15	131 of finding grouping the data into clusters is called clustering or cluster analysis.
bshanks@15	132 The task of deciding how to carve up a structure into anatomical subregions
bshanks@15	133 can be put into these terms. The instances are once again voxels (or pixels)
bshanks@15	134 along with their associated gene expression profiles. We make the assumption
bshanks@15	135 that voxels from the same subregion have similar gene expression profiles, at
bshanks@15	136 least compared to the other subregions. This means that clustering voxels is
bshanks@15	137 the same as finding potential subregions; we seek a partitioning of the voxels
bshanks@15	138 into subregions, that is, into clusters of voxels with similar gene expression.
bshanks@15	139 It is desirable to determine not just one set of subregions, but also how
bshanks@15	140 these subregions relate to each other, if at all; perhaps some of the subregions
bshanks@15	141 are more similar to each other than to the rest, suggesting that, although at a
bshanks@15	142 fine spatial scale they could be considered separate, on a coarser spatial scale
bshanks@15	143 they could be grouped together into one large subregion. This suggests the
bshanks@15	144 outcome of clustering may be a hierarchial tree of clusters, rather than a single
bshanks@15	145 set of clusters which partition the voxels. This is called hierarchial clustering.
bshanks@16	146 Similarity scores
bshanks@18	147 A crucial choice when designing a clustering method is how to measure
bshanks@18	148 similarity, across either pairs of instances, or clusters, or both. There is much
bshanks@18	149 overlap between scoring methods for feature selection (discussed above under
bshanks@18	150 Aim 1) and scoring methods for similarity.
bshanks@16	151 Spatially contiguous clusters; image segmentation
bshanks@16	152 We have shown that aim 2 is a type of clustering task. In fact, it is a
bshanks@16	153 special type of clustering task because we have an additional constraint on
bshanks@16	154 clusters; voxels grouped together into a cluster must be spatially contiguous.
bshanks@16	155 In Preliminary Results, we show that one can get reasonable results without
bshanks@16	156 enforcing this constraint, however, we plan to compare these results against
bshanks@16	157 other methods which guarantee contiguous clusters.
bshanks@15	158 Perhaps the biggest source of continguous clustering algorithms is the field
bshanks@15	159 of computer vision, which has produced a variety of image segmentation algo-
bshanks@29	160 4
bshanks@29	161
bshanks@15	162 rithms. Image segmentation is the task of partitioning the pixels in a digital
bshanks@15	163 image into clusters, usually contiguous clusters. Aim 2 is similar to an image
bshanks@15	164 segmentation task. There are two main differences; in our task, there are thou-
bshanks@15	165 sands of color channels (one for each gene), rather than just three. There are
bshanks@15	166 imaging tasks which use more than three colors, however, for example multispec-
bshanks@15	167 tral imaging and hyperspectral imaging, which are often used to process satellite
bshanks@15	168 imagery. A more crucial difference is that there are various cues which are ap-
bshanks@15	169 propriate for detecting sharp object boundaries in a visual scene but which are
bshanks@15	170 not appropriate for segmenting abstract spatial data such as gene expression.
bshanks@15	171 Although many image segmentation algorithms can be expected to work well
bshanks@15	172 for segmenting other sorts of spatially arranged data, some of these algorithms
bshanks@15	173 are specialized for visual images.
bshanks@16	174 Dimensionality reduction
bshanks@16	175 Unlike aim 1, there is no externally-imposed need to select only a handful
bshanks@16	176 of informative genes for inclusion in the instances. However, some clustering
bshanks@16	177 algorithms perform better on small numbers of features. There are techniques
bshanks@15	178 which “summarize” a larger number of features using a smaller number of fea-
bshanks@15	179 tures; these techniques go by the name of feature extraction or dimensionality
bshanks@15	180 reduction. The small set of features that such a technique yields is called the
bshanks@15	181 reduced feature set. After the reduced feature set is created, the instances may
bshanks@15	182 be replaced by reduced instances, which have as their features the reduced fea-
bshanks@15	183 ture set rather than the original feature set of all gene expression levels. Note
bshanks@15	184 that the features in the reduced feature set do not necessarily correspond to
bshanks@15	185 genes; each feature in the reduced set may be any function of the set of gene
bshanks@15	186 expression levels.
bshanks@15	187 Another use for dimensionality reduction is to visualize the relationships
bshanks@15	188 between subregions. For example, one might want to make a 2-D plot upon
bshanks@15	189 which each subregion is represented by a single point, and with the property
bshanks@15	190 that subregions with similar gene expression profiles should be nearby on the
bshanks@15	191 plot (that is, the property that distance between pairs of points in the plot
bshanks@15	192 should be proportional to some measure of dissimilarity in gene expression). It
bshanks@15	193 is likely that no arrangement of the points on a 2-D plan will exactly satisfy
bshanks@15	194 this property – however, dimensionality reduction techniques allow one to find
bshanks@15	195 arrangements of points that approximately satisfy that property. Note that
bshanks@15	196 in this application, dimensionality reduction is being applied after clustering;
bshanks@15	197 whereas in the previous paragraph, we were talking about using dimensionality
bshanks@15	198 reduction before clustering.
bshanks@16	199 Clustering genes rather than voxels
bshanks@16	200 Although the ultimate goal is to cluster the instances (voxels or pixels), one
bshanks@15	201 strategy to achieve this goal is to first cluster the features (genes). There are
bshanks@15	202 two ways that clusters of genes could be used.
bshanks@15	203 Gene clusters could be used as part of dimensionality reduction: rather than
bshanks@15	204 have one feature for each gene, we could have one reduced feature for each gene
bshanks@15	205 cluster.
bshanks@29	206 5
bshanks@29	207
bshanks@15	208 Gene clusters could also be used to directly yield a clustering on instances.
bshanks@15	209 This is because many genes have an expression pattern which seems to pick
bshanks@15	210 out a single, spatially continguous subregion. Therefore, it seems likely that an
bshanks@15	211 anatomically interesting subregion will have multiple genes which each individ-
bshanks@29	212 ually pick it out2. This suggests the following procedure: cluster together genes
bshanks@15	213 which pick out similar subregions, and then to use the more popular common
bshanks@15	214 subregions as the final clusters. In the Preliminary Data we show that a num-
bshanks@15	215 ber of anatomically recognized cortical regions, as well as some “superregions”
bshanks@15	216 formed by lumping together a few regions, are associated with gene clusters in
bshanks@15	217 this fashion.
bshanks@0	218 Aim 3
bshanks@16	219 Background
bshanks@18	220 The cortex is divided into areas and layers. To a first approximation, the
bshanks@18	221 parcellation of the cortex into areas can be drawn as a 2-D map on the surface of
bshanks@18	222 the cortex. In the third dimension, the boundaries between the areas continue
bshanks@18	223 downwards into the cortical depth, perpendicular to the surface. The layer
bshanks@17	224 boundaries run parallel to the surface. One can picture an area of the cortex as
bshanks@17	225 a slice of many-layered cake.
bshanks@0	226 Although it is known that different cortical areas have distinct roles in both
bshanks@0	227 normal functioning and in disease processes, there are no known marker genes
bshanks@0	228 for many cortical areas. When it is necessary to divide a tissue sample into
bshanks@0	229 cortical areas, this is a manual process that requires a skilled human to combine
bshanks@0	230 multiple visual cues and interpret them in the context of their approximate
bshanks@0	231 location upon the cortical surface.
bshanks@0	232 Even the questions of how many areas should be recognized in cortex, and
bshanks@0	233 what their arrangement is, are still not completely settled. A proposed division
bshanks@0	234 of the cortex into areas is called a cortical map. In the rodent, the lack of a
bshanks@0	235 single agreed-upon map can be seen by contrasting the recent maps given by
bshanks@0	236 Swanson?? on the one hand, and Paxinos and Franklin?? on the other. While
bshanks@0	237 the maps are certainly very similar in their general arrangement, significant
bshanks@0	238 differences remain in the details.
bshanks@16	239 Significance
bshanks@16	240 The method developed in aim (1) will be applied to each cortical area to find
bshanks@0	241 a set of marker genes such that the combinatorial expression pattern of those
bshanks@29	242 genes uniquely picks out the target area. Finding marker genes will be useful
bshanks@29	243 for drug discovery as well as for experimentation because marker genes can be
bshanks@29	244 used to design interventions which selectively target individual cortical areas.
bshanks@27	245 __________________________
bshanks@29	246 2This would seem to contradict our finding in aim 1 that some cortical areas are combina-
bshanks@27	247 torially coded by multiple genes. However, it is possible that the currently accepted cortical
bshanks@27	248 maps divide the cortex into subregions which are unnatural from the point of view of gene
bshanks@27	249 expression; perhaps there is some other way to map the cortex for which each subregion can
bshanks@27	250 be identified by single genes.
bshanks@27	251 6
bshanks@27	252
bshanks@0	253 The application of the marker gene finding algorithm to the cortex will
bshanks@0	254 also support the development of new neuroanatomical methods. In addition to
bshanks@0	255 finding markers for each individual cortical areas, we will find a small panel
bshanks@0	256 of genes that can find many of the areal boundaries at once. This panel of
bshanks@0	257 marker genes will allow the development of an ISH protocol that will allow
bshanks@0	258 experimenters to more easily identify which anatomical areas are present in
bshanks@0	259 small samples of cortex.
bshanks@0	260 The method developed in aim (3) will provide a genoarchitectonic viewpoint
bshanks@0	261 that will contribute to the creation of a better map. The development of present-
bshanks@0	262 day cortical maps was driven by the application of histological stains. It is
bshanks@0	263 conceivable that if a different set of stains had been available which identified
bshanks@0	264 a different set of features, then the today’s cortical maps would have come out
bshanks@0	265 differently. Since the number of classes of stains is small compared to the number
bshanks@0	266 of genes, it is likely that there are many repeated, salient spatial patterns in
bshanks@0	267 the gene expression which have not yet been captured by any stain. Therefore,
bshanks@0	268 current ideas about cortical anatomy need to incorporate what we can learn
bshanks@0	269 from looking at the patterns of gene expression.
bshanks@0	270 While we do not here propose to analyze human gene expression data, it is
bshanks@0	271 conceivable that the methods we propose to develop could be used to suggest
bshanks@0	272 modifications to the human cortical map as well.
bshanks@0	273 Related work
bshanks@18	274 There does not appear to be much work on the automated analysis of spatial
bshanks@18	275 gene expression data.
bshanks@18	276 There is a substantial body of work on the analysis of gene expression data,
bshanks@18	277 however, most of this concerns gene expression data which is not fundamentally
bshanks@23	278 spatial.
bshanks@18	279 As noted above, there has been much work on both supervised learning and
bshanks@22	280 clustering, and there are many available algorithms for each. However, the
bshanks@22	281 completion of Aims 1 and 2 involves more than just choosing between a set of
bshanks@22	282 existing algorithms, and will constitute a substantial contribution to biology.
bshanks@22	283 The algorithms require the scientist to provide a framework for representing the
bshanks@22	284 problem domain, and the way that this framework is set up has a large impact
bshanks@22	285 on performance. Creating a good framework can require creatively reconcep-
bshanks@22	286 tualizing the problem domain, and is not merely a mechanical “fine-tuning”
bshanks@22	287 of numerical parameters. For example, we believe that domain-specific scoring
bshanks@22	288 measures (such as gradient similarity, which is discussed in Preliminary Work)
bshanks@22	289 may be necessary in order to achieve the best results in this application.
bshanks@20	290 We are aware of two existing efforts to relate spatial gene expression data to
bshanks@20	291 anatomy through computational methods.
bshanks@20	292 [?] describes an analysis of the anatomy of the hippocampus using the ABA
bshanks@20	293 dataset. In addition to manual analysis, two clustering methods were employed,
bshanks@20	294 a modified Non-negative Matrix Factorization (NNMF), and a hierarchial bifur-
bshanks@20	295 cation clustering scheme based on correlation as the similarity score. The paper
bshanks@20	296 yielded impressive results, proving the usefulness of such research. We have run
bshanks@29	297 7
bshanks@29	298
bshanks@20	299 NNMF on the cortical dataset and while the results are promising (see Prelim-
bshanks@29	300 inary Data), we think that it will be possible to find a better method3 (we also
bshanks@27	301 think that more automation of the parts that this paper’s authors did manually
bshanks@27	302 will be possible).
bshanks@27	303 and [?] describes AGEA. todo
bshanks@26	304 __________________________
bshanks@29	305 3We ran “vanilla” NNMF, whereas the paper under discussion used a modified method.
bshanks@26	306 Their main modification consisted of adding a soft spatial contiguity constraint. However,
bshanks@26	307 on our dataset, NNMF naturally produced spatially contiguous clusters, so no additional
bshanks@26	308 constraint was needed. The paper under discussion mentions that they also tried a hierarchial
bshanks@26	309 variant of NNMF, but since they didn’t report its results, we assume that those result were
bshanks@26	310 not any more impressive than the results of the non-hierarchial variant.
bshanks@26	311 8
bshanks@26	312
bshanks@25	313 Preliminary work
bshanks@25	314 Format conversion between SEV, MATLAB, NIFTI
bshanks@25	315 todo
bshanks@25	316 Flatmap of cortex
bshanks@25	317 todo
bshanks@16	318 Using combinations of multiple genes is necessary and sufficient to
bshanks@15	319 delineate some cortical areas
bshanks@16	320 Here we give an example of a cortical area which is not marked by any
bshanks@16	321 single gene, but which can be identified combinatorially. according to logistic
bshanks@29	322 regression, gene wwc14 is the best fit single gene for predicting whether or not a
bshanks@16	323 pixel on the cortical surface belongs to the motor area (area MO). The upper-left
bshanks@0	324 picture in Figure shows wwc1’s spatial expression pattern over the cortex. The
bshanks@0	325 lower-right boundary of MO is represented reasonably well by this gene, however
bshanks@0	326 the gene overshoots the upper-left boundary. This flattened 2-D representation
bshanks@0	327 does not show it, but the area corresponding to the overshoot is the medial
bshanks@0	328 surface of the cortex. MO is only found on the lateral surface (todo).
bshanks@29	329 Gnee mtif25 is shown in figure the upper-right of Fig. . Mtif2 captures MO’s
bshanks@0	330 upper-left boundary, but not its lower-right boundary. Mtif2 does not express
bshanks@0	331 very much on the medial surface. By adding together the values at each pixel
bshanks@16	332 in these two figures, we get the lower-left of Figure . This combination captures
bshanks@16	333 area MO much better than any single gene.
bshanks@17	334 Correlation todo
bshanks@17	335 Conditional entropy todo
bshanks@17	336 Gradient similarity todo
bshanks@16	337 Geometric and pointwise scoring methods provide complementary
bshanks@16	338 information
bshanks@16	339 To show that local geometry can provide useful information that cannot be
bshanks@16	340 detected via pointwise analyses, consider Fig. . The top row of Fig. displays the
bshanks@29	341 3 genes which most match area AUD, according to a pointwise method6. The
bshanks@21	342 bottom row displays the 3 genes which most match AUD according to a method
bshanks@29	343 which considers local geometry7 The pointwise method in the top row identifies
bshanks@26	344 __________________________
bshanks@29	345 4“WW, C2 and coiled-coil domain containing 1”; EntrezGene ID 211652
bshanks@29	346 5“mitochondrial translational initiation factor 2”; EntrezGene ID 76784
bshanks@29	347 6For each gene, a logistic regression in which the response variable was whether or not a
bshanks@21	348 surface pixel was within area AUD, and the predictor variable was the value of the expression
bshanks@21	349 of the gene underneath that pixel. The resulting scores were used to rank the genes in terms
bshanks@21	350 of how well they predict area AUD.
bshanks@29	351 7For each gene the gradient similarity (see section ??) between (a) a map of the expression
bshanks@22	352 of each gene on the cortical surface and (b) the shape of area AUD, was calculated, and this
bshanks@22	353 was used to rank the genes.
bshanks@26	354 9
bshanks@0	355
bshanks@0	356
bshanks@0	357
bshanks@0	358 Figure 1: Upper left: wwc1. Upper right: mtif2. Lower left: wwc1 + mtif2
bshanks@0	359 (each pixel’s value on the lower left is the sum of the corresponding pixels in
bshanks@0	360 the upper row). Within each picture, the vertical axis roughly corresponds to
bshanks@0	361 anterior at the top and posterior at the bottom, and the horizontal axis roughly
bshanks@0	362 corresponds to medial at the left and lateral at the right. The red outline is
bshanks@0	363 the boundary of region MO. Pixels are colored approximately according to the
bshanks@0	364 density of expressing cells underneath each pixel, with red meaning a lot of
bshanks@0	365 expression and blue meaning little.
bshanks@26	366 10
bshanks@26	367
bshanks@15	368
bshanks@15	369
bshanks@15	370 Figure 2: The top row shows the three genes which (individually) best predict
bshanks@15	371 area AUD, according to logistic regression. The bottom row shows the three
bshanks@15	372 genes which (individually) best match area AUD, according to gradient similar-
bshanks@15	373 ity. From left to right and top to bottom, the genes are Ssr1, Efcbp1, Aph1a,
bshanks@15	374 Ptk7, Aph1a again, and Lepr
bshanks@27	375 genes which express more strongly in AUD than outside of it; its weakness is that
bshanks@27	376 this includes many areas which don’t have a salient border matching the areal
bshanks@27	377 border. The geometric method identifies genes whose salient expression border
bshanks@26	378 seems to partially line up with the border of AUD; its weakness is that this
bshanks@26	379 includes genes which don’t express over the entire area. Genes which have high
bshanks@26	380 rankings using both pointwise and border criteria, such as Aph1a in the example,
bshanks@26	381 may be particularly good markers. None of these genes are, individually, a
bshanks@26	382 perfect marker for AUD; we deliberately chose a “difficult” area in order to
bshanks@26	383 better contrast pointwise with geometric methods.
bshanks@26	384 Areas which can be identified by single genes
bshanks@26	385 todo
bshanks@18	386 Specific to Aim 1 (and Aim 3)
bshanks@17	387 Forward stepwise logistic regression todo
bshanks@17	388 SVM on all genes at once
bshanks@16	389 In order to see how well one can do when looking at all genes at once, we
bshanks@16	390 ran a support vector machine to classify cortical surface pixels based on their
bshanks@29	391 gene expression profiles. We achieved classification accuracy of about 81%8.
bshanks@16	392 As noted above, however, a classifier that looks at all the genes at once isn’t
bshanks@16	393 practically useful.
bshanks@27	394 ____________
bshanks@29	395 85-fold cross-validation.
bshanks@27	396 11
bshanks@27	397
bshanks@16	398 The requirement to find combinations of only a small number of genes limits
bshanks@16	399 us from straightforwardly applying many of the most simple techniques from
bshanks@17	400 the field of supervised machine learning. In the parlance of machine learning,
bshanks@17	401 our task combines feature selection with supervised learning.
bshanks@17	402 Decision trees
bshanks@17	403 todo
bshanks@18	404 Specific to Aim 2 (and Aim 3)
bshanks@18	405 Raw dimensionality reduction results
bshanks@20	406 todo
bshanks@20	407 (might want to incld nnMF since mentioned above)
bshanks@18	408 Dimensionality reduction plus K-means or spectral clustering
bshanks@18	409 Many areas are captured by clusters of genes
bshanks@16	410 todo
bshanks@15	411 todo
bshanks@26	412 12
bshanks@26	413
bshanks@15	414 Research plan
bshanks@18	415 todo amongst other things:
bshanks@16	416 Develop algorithms that find genetic markers for anatomical re-
bshanks@16	417 gions
bshanks@0	418 1. Develop scoring measures for evaluating how good individual genes are at
bshanks@0	419 marking areas: we will compare pointwise, geometric, and information-
bshanks@0	420 theoretic measures.
bshanks@0	421 2. Develop a procedure to find single marker genes for anatomical regions: for
bshanks@0	422 each cortical area, by using or combining the scoring measures developed,
bshanks@0	423 we will rank the genes by their ability to delineate each area.
bshanks@0	424 3. Extend the procedure to handle difficult areas by using combinatorial cod-
bshanks@0	425 ing: for areas that cannot be identified by any single gene, identify them
bshanks@0	426 with a handful of genes. We will consider both (a) algorithms that incre-
bshanks@0	427 mentally/greedily combine single gene markers into sets, such as forward
bshanks@0	428 stepwise regression and decision trees, and also (b) supervised learning
bshanks@0	429 techniques which use soft constraints to minimize the number of features,
bshanks@0	430 such as sparse support vector machines.
bshanks@0	431 4. Extend the procedure to handle difficult areas by combining or redrawing
bshanks@0	432 the boundaries: An area may be difficult to identify because the bound-
bshanks@0	433 aries are misdrawn, or because it does not “really” exist as a single area,
bshanks@0	434 at least on the genetic level. We will develop extensions to our procedure
bshanks@0	435 which (a) detect when a difficult area could be fit if its boundary were
bshanks@0	436 redrawn slightly, and (b) detect when a difficult area could be combined
bshanks@0	437 with adjacent areas to create a larger area which can be fit.
bshanks@16	438 Apply these algorithms to the cortex
bshanks@0	439 1. Create open source format conversion tools: we will create tools to bulk
bshanks@0	440 download the ABA dataset and to convert between SEV, NIFTI and MAT-
bshanks@0	441 LAB formats.
bshanks@0	442 2. Flatmap the ABA cortex data: map the ABA data onto a plane and draw
bshanks@0	443 the cortical area boundaries onto it.
bshanks@0	444 3. Find layer boundaries: cluster similar voxels together in order to auto-
bshanks@0	445 matically find the cortical layer boundaries.
bshanks@0	446 4. Run the procedures that we developed on the cortex: we will present, for
bshanks@0	447 each area, a short list of markers to identify that area; and we will also
bshanks@0	448 present lists of “panels” of genes that can be used to delineate many areas
bshanks@0	449 at once.
bshanks@27	450 13
bshanks@27	451
bshanks@16	452 Develop algorithms to suggest a division of a structure into anatom-
bshanks@0	453 ical parts
bshanks@0	454 1. Explore dimensionality reduction algorithms applied to pixels: including
bshanks@0	455 TODO
bshanks@0	456 2. Explore dimensionality reduction algorithms applied to genes: including
bshanks@0	457 TODO
bshanks@0	458 3. Explore clustering algorithms applied to pixels: including TODO
bshanks@0	459 4. Explore clustering algorithms applied to genes: including gene shaving,
bshanks@0	460 TODO
bshanks@0	461 5. Develop an algorithm to use dimensionality reduction and/or hierarchial
bshanks@0	462 clustering to create anatomical maps
bshanks@0	463 6. Run this algorithm on the cortex: present a hierarchial, genoarchitectonic
bshanks@0	464 map of the cortex
bshanks@26	465 ______________________________________________
bshanks@26	466 stuff i dunno where to put yet (there is more scattered through grant-
bshanks@15	467 oldtext):
bshanks@16	468 Principle 4: Work in 2-D whenever possible
bshanks@21	469 In anatomy, the manifold of interest is usually either defined by a combina-
bshanks@21	470 tion of two relevant anatomical axes (todo), or by the surface of the structure
bshanks@21	471 (as is the case with the cortex). In the former case, the manifold of interest is
bshanks@21	472 a plane, but in the latter case it is curved. If the manifold is curved, there are
bshanks@21	473 various methods for mapping the manifold into a plane.
bshanks@22	474 The method that we will develop will begin by mapping the data into a
bshanks@22	475 2-D plane. Although the manifold that characterized cortical areas is known
bshanks@22	476 to be the cortical surface, it remains to be seen which method of mapping the
bshanks@22	477 manifold into a plane is optimal for this application. We will compare mappings
bshanks@22	478 which attempt to preserve size (such as the one used by Caret??) with mappings
bshanks@22	479 which preserve angle (conformal maps).
bshanks@22	480 Although there is much 2-D organization in anatomy, there are also struc-
bshanks@22	481 tures whose shape is fundamentally 3-dimensional. If possible, we would like
bshanks@22	482 the method we develop to include a statistical test that warns the user if the
bshanks@22	483 assumption of 2-D structure seems to be wrong.
bshanks@22	484 if we need citations for aim 3 significance, http://www.sciencedirect.
bshanks@22	485 com/science?_ob=ArticleURL&_udi=B6WSS-4V70FHY-9&_user=4429&_coverDate=
bshanks@25	486 12%2F26%2F2008&_rdoc=1&_fmt=full&_orig=na&_cdi=7054&_docanchor=&_acct=
bshanks@25	487 C000059602&_version=1&_urlVersion=0&_userid=4429&md5=551eccc743a2bfe6e992eee0c3194203#
bshanks@25	488 app2 has examples of genetic targeting to specific anatomical regions
bshanks@25	489 —
bshanks@25	490 note:
bshanks@29	491 do we need to cite: no known markers, impressive results?
bshanks@26	492 14
bshanks@26	493
bshanks@26	494