cg: grant.html annotate

cg

annotate grant.html @ 0:29eee29f9bc1

initial commit to hg version control repository

author	bshanks@bshanks-salk.dyndns.org
date	Sat Apr 11 19:12:32 2009 -0700 (16 years ago)
parents
children	7487ad7f5d8f

rev	line source
bshanks@0	1 Specific aims
bshanks@0	2 Massive new datasets obtained with techniques such as in situ hybridization
bshanks@0	3 (ISH) and BAC-transgenics allow the expression levels of many genes at many
bshanks@0	4 locations to be compared. Our goal is to develop automated methods to relate
bshanks@0	5 spatial variation in gene expression to anatomy. We want to find marker genes
bshanks@0	6 for specific anatomical regions, and also to draw new anatomical maps based on
bshanks@0	7 gene expression patterns. We have three specific aims:
bshanks@0	8 (1) develop an algorithm to screen spatial gene expression data for combina-
bshanks@0	9 tions of marker genes which selectively target anatomical regions
bshanks@0	10 (2) develop an algorithm to suggest new ways of carving up a structure into
bshanks@0	11 anatomical subregions, based on spatial patterns in gene expression
bshanks@0	12 (3) create a 2-D “flat map” dataset of the mouse cerebral cortex that contains
bshanks@0	13 a flattened version of the Allen Mouse Brain Atlas ISH data, as well as
bshanks@0	14 the boundaries of cortical anatomical areas. Use this dataset to validate
bshanks@0	15 the methods developed in (1) and (2).
bshanks@0	16 In addition to validating the usefulness of the algorithms, the application of
bshanks@0	17 these methods to cerebral cortex will produce immediate benefits, because there
bshanks@0	18 are currently no known genetic markers for many cortical areas. The results
bshanks@0	19 of the project will support the development of new ways to selectively target
bshanks@0	20 cortical areas, and it will support the development of a method for identifying
bshanks@0	21 the cortical areal boundaries present in small tissue samples.
bshanks@0	22 All algorithms that we develop will be implemented in an open-source soft-
bshanks@0	23 ware toolkit. The toolkit, as well as the machine-readable datasets developed
bshanks@0	24 in aim (3), will be published and freely available for others to use.
bshanks@0	25 Background and significance
bshanks@0	26 Aim 1
bshanks@0	27 Machine learning terminology
bshanks@0	28 The task of looking for marker genes for anatomical subregions means that one
bshanks@0	29 is looking for a set of genes such that, if the expression level of those genes is
bshanks@0	30 known, then the locations of the subregions can be inferred.
bshanks@0	31 If we define the subregions so that they cover the entire anatomical structure
bshanks@0	32 to be divided, then instead of saying that we are using gene expression to find
bshanks@0	33 the locations of the subregions, we may say that we are using gene expression to
bshanks@0	34 determine to which subregion each voxel within the structure belongs. We call
bshanks@0	35 this a classification task, because each voxel is being assigned to a class (namely,
bshanks@0	36 its subregion).
bshanks@0	37 Therefore, an understanding of the relationship between the combination of
bshanks@0	38 their expression levels and the locations of the subregions may be expressed as
bshanks@0	39 1
bshanks@0	40
bshanks@0	41 a function. The input to this function is a voxel, along with the gene expression
bshanks@0	42 levels within that voxel; the output is the subregional identity of the target
bshanks@0	43 voxel, that is, the subregion to which the target voxel belongs. We call this
bshanks@0	44 function a classifier. In general, the input to a classifier is called an instance,
bshanks@0	45 and the output is called a label.
bshanks@0	46 The object of aim 1 is not to produce a single classifier, but rather to develop
bshanks@0	47 an automated method for determining a classifier for any known anatomical
bshanks@0	48 structure. Therefore, we seek a procedure by which a gene expression dataset
bshanks@0	49 may be analyzed in concert with an anatomical atlas in order to produce a
bshanks@0	50 classifier. Such a procedure is a type of a machine learning procedure. The
bshanks@0	51 construction of the classifier is called training (also learning), and the initial
bshanks@0	52 gene expression dataset used in the construction of the classifier is called training
bshanks@0	53 data.
bshanks@0	54 In the machine learning literature, this sort of procedure may be thought
bshanks@0	55 of as a supervised learning task, defined as a task in whcih the goal is to learn
bshanks@0	56 a mapping from instances to labels, and the training data consists of a set of
bshanks@0	57 instances (voxels) for which the labels (subregions) are known.
bshanks@0	58 Each gene expression level is called a feature, and the selection of which
bshanks@0	59 genes to include is called feature selection. Feature selection is one component
bshanks@0	60 of the task of learning a classifier. Some methods for learning classifiers start
bshanks@0	61 out with a separate feature selection phase, whereas other methods combine
bshanks@0	62 feature selection with other aspects of training.
bshanks@0	63 One class of feature selection methods assigns some sort of score to each
bshanks@0	64 candidate gene. The top-ranked genes are then chosen. Some scoring measures
bshanks@0	65 can assign a score to a set of selected genes, not just to a single gene; in this
bshanks@0	66 case, a dynamic procedure may be used in which features are added and sub-
bshanks@0	67 tracted from the selected set depending on how much they raise the score. Such
bshanks@0	68 procedures are called “stepwise” or “greedy”.
bshanks@0	69 Although the classifier itself may only look at the gene expression data within
bshanks@0	70 each voxel before classifying that voxel, the learning algorithm which constructs
bshanks@0	71 the classifier may look over the entire dataset. We can categorize score-based
bshanks@0	72 feature selection methods depending on how the score of calculated. Often
bshanks@0	73 the score calculation consists of assigning a sub-score to each voxel, and then
bshanks@0	74 aggregating these sub-scores into a final score (the aggregation is often a sum or
bshanks@0	75 a sum of squares). If only information from nearby voxels is used to calculate a
bshanks@0	76 voxel’s sub-score, then we say it is a local scoring method. If only information
bshanks@0	77 from the voxel itself is used to calculate a voxel’s sub-score, then we say it is a
bshanks@0	78 pointwise scoring method.
bshanks@0	79 Key questions when choosing a learning method are: What are the instances?
bshanks@0	80 What are the features? How are the features chosen? Here are four principles
bshanks@0	81 that outline our answers to these questions.
bshanks@0	82 Principle 1: Combinatorial gene expression
bshanks@0	83 Above, we defined an “instance” as the combination of a voxel with the “asso-
bshanks@0	84 ciated gene expression data”. In our case this refers to the expression level of
bshanks@0	85 2
bshanks@0	86
bshanks@0	87 genes within the voxel, but should we include the expression levels of all genes,
bshanks@0	88 or only a few of them?
bshanks@0	89 It is too much to hope that every anatomical region of interest will be iden-
bshanks@0	90 tified by a single gene. For example, in the cortex, there are some areas which
bshanks@0	91 are not clearly delineated by any gene included in the Allen Brain Atlas (ABA)
bshanks@0	92 dataset. However, at least some of these areas can be delineated by looking
bshanks@0	93 at combinations of genes (an example of an area for which multiple genes are
bshanks@0	94 necessary and sufficient is provided in Preliminary Results).
bshanks@0	95 Principle 2: Only look at combinations of small numbers of genes
bshanks@0	96 When the classifier classifies a voxel, it is only allowed to look at the expression of
bshanks@0	97 the genes which have been selected as features. The more data that is available
bshanks@0	98 to a classifier, the better that it can do. For example, perhaps there are weak
bshanks@0	99 correlations over many genes that add up to a strong signal. So, why not include
bshanks@0	100 every gene as a feature? The reason is that we wish to employ the classifier in
bshanks@0	101 situations in which it is not feasible to gather data about every gene. For
bshanks@0	102 example, if we want to use the expression of marker genes as a trigger for some
bshanks@0	103 regionally-targeted intervention, then our intervention must contain a molecular
bshanks@0	104 mechanism to check the expression level of each marker gene before it triggers.
bshanks@0	105 It is currently infeasible to design a molecular trigger that checks the level of
bshanks@0	106 more than a handful of genes. Similarly, if the goal is to develop a procedure to
bshanks@0	107 do ISH on tissue samples in order to label their anatomy, then it is infeasible
bshanks@0	108 to label more than a few genes. Therefore, we must select only a few genes as
bshanks@0	109 features.
bshanks@0	110 Principle 3: Use geometry in feature selection
bshanks@0	111 When doing feature selection with score-based methods, the simplest thing to
bshanks@0	112 do would be to score the performance of each voxel by itself and then combine
bshanks@0	113 these scores; this is pointwise scoring. A more powerful approach is to also use
bshanks@0	114 information about the geometric relations between each voxel and its neighbors;
bshanks@0	115 this requires non-pointwise, local scoring methods. See Preliminary Results for
bshanks@0	116 evidence of the complementary nature of pointwise and local scoring methods.
bshanks@0	117 Principle 4: Work in 2-D whenever possible
bshanks@0	118 There are many anatomical structures which are commonly characterized in
bshanks@0	119 terms of a two-dimensional manifold. When it is known that the structure that
bshanks@0	120 one is looking for is two-dimensional, the results may be improved by allowing
bshanks@0	121 the analysis algorithm to take advantage of this prior knowledge. In addition,
bshanks@0	122 it is easier for humans to visualize and work with 2-D data.
bshanks@0	123 Therefore, when possible, the instances should represent pixels, not voxels.
bshanks@0	124 3
bshanks@0	125
bshanks@0	126 Aim 3
bshanks@0	127 Background
bshanks@0	128 The cortex is divided into areas and layers. To a first approximation, the par-
bshanks@0	129 cellation of the cortex into areas can be drawn as a 2-D map on the surface
bshanks@0	130 of the cortex. In the third dimension, the boundaries between the areas con-
bshanks@0	131 tinue downwards into the cortical depth, perpendicular to the surface. The layer
bshanks@0	132 boundaries run parallel to the surface. One can picture an area of the cortex as
bshanks@0	133 a slice of many-layered cake.
bshanks@0	134 Although it is known that different cortical areas have distinct roles in both
bshanks@0	135 normal functioning and in disease processes, there are no known marker genes
bshanks@0	136 for many cortical areas. When it is necessary to divide a tissue sample into
bshanks@0	137 cortical areas, this is a manual process that requires a skilled human to combine
bshanks@0	138 multiple visual cues and interpret them in the context of their approximate
bshanks@0	139 location upon the cortical surface.
bshanks@0	140 Even the questions of how many areas should be recognized in cortex, and
bshanks@0	141 what their arrangement is, are still not completely settled. A proposed division
bshanks@0	142 of the cortex into areas is called a cortical map. In the rodent, the lack of a
bshanks@0	143 single agreed-upon map can be seen by contrasting the recent maps given by
bshanks@0	144 Swanson?? on the one hand, and Paxinos and Franklin?? on the other. While
bshanks@0	145 the maps are certainly very similar in their general arrangement, significant
bshanks@0	146 differences remain in the details.
bshanks@0	147 Significance
bshanks@0	148 The method developed in aim (1) will be applied to each cortical area to find
bshanks@0	149 a set of marker genes such that the combinatorial expression pattern of those
bshanks@0	150 genes uniquely picks out the target area. Finding marker genes will be useful
bshanks@0	151 for drug discovery as well as for experimentation because marker genes can be
bshanks@0	152 used to design interventions which selectively target individual cortical areas.
bshanks@0	153 The application of the marker gene finding algorithm to the cortex will
bshanks@0	154 also support the development of new neuroanatomical methods. In addition to
bshanks@0	155 finding markers for each individual cortical areas, we will find a small panel
bshanks@0	156 of genes that can find many of the areal boundaries at once. This panel of
bshanks@0	157 marker genes will allow the development of an ISH protocol that will allow
bshanks@0	158 experimenters to more easily identify which anatomical areas are present in
bshanks@0	159 small samples of cortex.
bshanks@0	160 The method developed in aim (3) will provide a genoarchitectonic viewpoint
bshanks@0	161 that will contribute to the creation of a better map. The development of present-
bshanks@0	162 day cortical maps was driven by the application of histological stains. It is
bshanks@0	163 conceivable that if a different set of stains had been available which identified
bshanks@0	164 a different set of features, then the today’s cortical maps would have come out
bshanks@0	165 differently. Since the number of classes of stains is small compared to the number
bshanks@0	166 of genes, it is likely that there are many repeated, salient spatial patterns in
bshanks@0	167 the gene expression which have not yet been captured by any stain. Therefore,
bshanks@0	168 4
bshanks@0	169
bshanks@0	170 current ideas about cortical anatomy need to incorporate what we can learn
bshanks@0	171 from looking at the patterns of gene expression.
bshanks@0	172 While we do not here propose to analyze human gene expression data, it is
bshanks@0	173 conceivable that the methods we propose to develop could be used to suggest
bshanks@0	174 modifications to the human cortical map as well.
bshanks@0	175 Related work
bshanks@0	176 Preliminary work
bshanks@0	177 Justification of principles 1 thur 3
bshanks@0	178 Principle 1: Combinatorial gene expression
bshanks@0	179 Here we give an example of a cortical area which is not marked by any single
bshanks@0	180 gene, but which can be identified combinatorially. according to logistic regres-
bshanks@0	181 sion, gene wwc11 is the best fit single gene for predicting whether or not a pixel
bshanks@0	182 on the cortical surface belongs to the motor area (area MO). The upper-left
bshanks@0	183 picture in Figure shows wwc1’s spatial expression pattern over the cortex. The
bshanks@0	184 lower-right boundary of MO is represented reasonably well by this gene, however
bshanks@0	185 the gene overshoots the upper-left boundary. This flattened 2-D representation
bshanks@0	186 does not show it, but the area corresponding to the overshoot is the medial
bshanks@0	187 surface of the cortex. MO is only found on the lateral surface (todo).
bshanks@0	188 Gnee mtif22 is shown in figure the upper-right of Fig. . Mtif2 captures MO’s
bshanks@0	189 upper-left boundary, but not its lower-right boundary. Mtif2 does not express
bshanks@0	190 very much on the medial surface. By adding together the values at each pixel
bshanks@0	191 in these two figures, we get the lower-left of Figure . This combination captures
bshanks@0	192 area MO much better than any single gene.
bshanks@0	193 Principle 2: Only look at combinations of small numbers of genes
bshanks@0	194 In order to see how well one can do when looking at all genes at once, we ran
bshanks@0	195 a support vector machine to classify cortical surface pixels based on their gene
bshanks@0	196 expression profiles. We achieved classification accuracy of about 81%3. As noted
bshanks@0	197 above, however, a classifier that looks at all the genes at once isn’t practically
bshanks@0	198 useful.
bshanks@0	199 The requirement to find combinations of only a small number of genes limits
bshanks@0	200 us from straightforwardly applying many of the most simple techniques from
bshanks@0	201 the field of supervised machine learning. In the parlance of machine learning,
bshanks@0	202 our task combines feature selection with supervised learning.
bshanks@0	203 __________________________
bshanks@0	204 1“WW, C2 and coiled-coil domain containing 1”; EntrezGene ID 211652
bshanks@0	205 2“mitochondrial translational initiation factor 2”; EntrezGene ID 76784
bshanks@0	206 3Using the Shogun SVM package (todo:cite), with parameters type=GMNPSVM (multi-
bshanks@0	207 class b-SVM), kernal = gaussian with sigma = 0.1, c = 10, epsilon = 1e-1 – these are the
bshanks@0	208 first parameters we tried, so presumably performance would improve with different choices of
bshanks@0	209 parameters. 5-fold cross-validation.
bshanks@0	210 5
bshanks@0	211
bshanks@0	212
bshanks@0	213
bshanks@0	214 Figure 1: Upper left: wwc1. Upper right: mtif2. Lower left: wwc1 + mtif2
bshanks@0	215 (each pixel’s value on the lower left is the sum of the corresponding pixels in
bshanks@0	216 the upper row). Within each picture, the vertical axis roughly corresponds to
bshanks@0	217 anterior at the top and posterior at the bottom, and the horizontal axis roughly
bshanks@0	218 corresponds to medial at the left and lateral at the right. The red outline is
bshanks@0	219 the boundary of region MO. Pixels are colored approximately according to the
bshanks@0	220 density of expressing cells underneath each pixel, with red meaning a lot of
bshanks@0	221 expression and blue meaning little.
bshanks@0	222 6
bshanks@0	223
bshanks@0	224
bshanks@0	225
bshanks@0	226 Figure 2: The top row shows the three genes which (individually) best predict
bshanks@0	227 area AUD, according to logistic regression. The bottom row shows the three
bshanks@0	228 genes which (individually) best match area AUD, according to gradient similar-
bshanks@0	229 ity. From left to right and top to bottom, the genes are Ssr1, Efcbp1, Aph1a,
bshanks@0	230 Ptk7, Aph1a again, and Lepr
bshanks@0	231 Principle 3: Use geometry
bshanks@0	232 To show that local geometry can provide useful information that cannot be
bshanks@0	233 detected via pointwise analyses, consider Fig. . The top row of Fig. displays
bshanks@0	234 the 3 genes which most match area AUD, according to a pointwise method4. The
bshanks@0	235 bottom row displays the 3 genes which most match AUD according to a method
bshanks@0	236 which considers local geometry5 The pointwise method in the top row identifies
bshanks@0	237 genes which express more strongly in AUD than outside of it; its weakness is that
bshanks@0	238 this includes many areas which don’t have a salient border matching the areal
bshanks@0	239 border. The geometric method identifies genes whose salient expression border
bshanks@0	240 seems to partially line up with the border of AUD; its weakness is that this
bshanks@0	241 includes genes which don’t express over the entire area. Genes which have high
bshanks@0	242 rankings using both pointwise and border criteria, such as Aph1a in the example,
bshanks@0	243 may be particularly good markers. None of these genes are, individually, a
bshanks@0	244 perfect marker for AUD; we deliberately chose a “difficult” area in order to
bshanks@0	245 better contrast pointwise with geometric methods.
bshanks@0	246 __________________________
bshanks@0	247 4For each gene, a logistic regression in which the response variable was whether or not a
bshanks@0	248 surface pixel was within area AUD, and the predictor variable was the value of the expression
bshanks@0	249 of the gene underneath that pixel. The resulting scores were used to rank the genes in terms
bshanks@0	250 of how well they predict area AUD.
bshanks@0	251 5For each gene the gradient similarity (see section ??) between (a) a map of the expression
bshanks@0	252 of each gene on the cortical surface and (b) the shape of area AUD, was calculated, and this
bshanks@0	253 was used to rank the genes.
bshanks@0	254 7
bshanks@0	255
bshanks@0	256 Principle 4: Work in 2-D whenever possible
bshanks@0	257 In anatomy, the manifold of interest is usually either defined by a combination
bshanks@0	258 of two relevant anatomical axes (todo), or by the surface of the structure (as is
bshanks@0	259 the case with the cortex). In the former case, the manifold of interest is a plane,
bshanks@0	260 but in the latter case it is curved. If the manifold is curved, there are various
bshanks@0	261 methods for mapping the manifold into a plane.
bshanks@0	262 The method that we will develop will begin by mapping the data into a
bshanks@0	263 2-D plane. Although the manifold that characterized cortical areas is known
bshanks@0	264 to be the cortical surface, it remains to be seen which method of mapping the
bshanks@0	265 manifold into a plane is optimal for this application. We will compare mappings
bshanks@0	266 which attempt to preserve size (such as the one used by Caret??) with mappings
bshanks@0	267 which preserve angle (conformal maps).
bshanks@0	268 Although there is much 2-D organization in anatomy, there are also struc-
bshanks@0	269 tures whose shape is fundamentally 3-dimensional. If possible, we would like
bshanks@0	270 the method we develop to include a statistical test that warns the user if the
bshanks@0	271 assumption of 2-D structure seems to be wrong.
bshanks@0	272 ——
bshanks@0	273 Massive new datasets obtained with techniques such as in situ hybridization
bshanks@0	274 (ISH) and BAC-transgenics allow the expression levels of many genes at many
bshanks@0	275 locations to be compared. This can be used to find marker genes for specific
bshanks@0	276 anatomical structures, as well as to draw new anatomical maps. Our goal is
bshanks@0	277 to develop automated methods to relate spatial variation in gene expression to
bshanks@0	278 anatomy. We have five specific aims:
bshanks@0	279 (1) develop an algorithm to screen spatial gene expression data for combi-
bshanks@0	280 nations of marker genes which selectively target individual anatomical
bshanks@0	281 structures
bshanks@0	282 (2) develop an algorithm to screen spatial gene expression data for combina-
bshanks@0	283 tions of marker genes which can be used to delineate most of the bound-
bshanks@0	284 aries between a number of anatomical structures at once
bshanks@0	285 (3) develop an algorithm to suggest new ways of dividing a structure up into
bshanks@0	286 anatomical subregions, based on spatial patterns in gene expression
bshanks@0	287 (4) create a flat (2-D) map of the mouse cerebral cortex that contains a flat-
bshanks@0	288 tened version of the Allen Mouse Brain Atlas ISH dataset, as well as the
bshanks@0	289 boundaries of anatomical areas within the cortex. For each cortical layer,
bshanks@0	290 a layer-specific flat dataset will be created. A single combined flat dataset
bshanks@0	291 will be created which averages information from all of the layers. These
bshanks@0	292 datasets will be made available in both MATLAB and Caret formats.
bshanks@0	293 (5) validate the methods developed in (1), (2) and (3) by applying them to
bshanks@0	294 the cerebral cortex datasets created in (4)
bshanks@0	295 All algorithms that we develop will be implemented in an open-source soft-
bshanks@0	296 ware toolkit. The toolkit, as well as the machine-readable datasets developed in
bshanks@0	297 8
bshanks@0	298
bshanks@0	299 aim (4) and any other intermediate dataset we produce, will be published and
bshanks@0	300 freely available for others to use.
bshanks@0	301 In addition to developing generally useful methods, the application of these
bshanks@0	302 methods to cerebral cortex will produce immediate benefits that are only one
bshanks@0	303 step removed from clinical application, while also supporting the development
bshanks@0	304 of new neuroanatomical techniques. The method developed in aim (1) will be
bshanks@0	305 applied to each cortical area to find a set of marker genes. Currently, despite
bshanks@0	306 the distinct roles of different cortical areas in both normal functioning and
bshanks@0	307 disease processes, there are no known marker genes for many cortical areas.
bshanks@0	308 Finding marker genes will be immediately useful for drug discovery as well as for
bshanks@0	309 experimentation because once marker genes for an area are known, interventions
bshanks@0	310 can be designed which selectively target that area.
bshanks@0	311 The method developed in aim (2) will be used to find a small panel of genes
bshanks@0	312 that can find most of the boundaries between areas in the cortex. Today, finding
bshanks@0	313 cortical areal boundaries in a tissue sample is a manual process that requires a
bshanks@0	314 skilled human to combine multiple visual cues over a large area of the cortical
bshanks@0	315 surface. A panel of marker genes will allow the development of an ISH protocol
bshanks@0	316 that will allow experimenters to more easily identify which anatomical areas are
bshanks@0	317 present in small samples of cortex.
bshanks@0	318 For each cortical layer, a layer-specific flat dataset will be created. A single
bshanks@0	319 combined flat dataset will be created which averages information from all of
bshanks@0	320 the layers. These datasets will be made available in both MATLAB and Caret
bshanks@0	321 formats.
bshanks@0	322 —-
bshanks@0	323 New techniques allow the expression levels of many genes at many locations
bshanks@0	324 to be compared. It is thought that even neighboring anatomical structures have
bshanks@0	325 different gene expression profiles. We propose to develop automated methods
bshanks@0	326 to relate the spatial variation in gene expression to anatomy. We will develop
bshanks@0	327 two kinds of techniques:
bshanks@0	328 (a) techniques to screen for combinations of marker genes which selectively
bshanks@0	329 target anatomical structures
bshanks@0	330 (b) techniques to suggest new ways of dividing a structure up into anatomical
bshanks@0	331 subregions, based on the shapes of contours in the gene expression
bshanks@0	332 The first kind of technique will be helpful for finding marker genes associated
bshanks@0	333 with known anatomical features. The second kind of technique will be helpful in
bshanks@0	334 creating new anatomical maps, maps which reflect differences in gene expression
bshanks@0	335 the same way that existing maps reflect differences in histology.
bshanks@0	336 We intend to develop our techniques using the adult mouse cerebral cortex
bshanks@0	337 as a testbed. The Allen Brain Atlas has collected a dataset containing the
bshanks@0	338 expression level of about 4000 genes* over a set of over 150000 voxels, with a
bshanks@0	339 spatial resolution of approximately 200 microns[?].
bshanks@0	340 We expect to discover sets of marker genes that pick out specific cortical
bshanks@0	341 areas. This will allow the development of drugs and other interventions that
bshanks@0	342 selectively target individual cortical areas. Therefore our research will lead
bshanks@0	343 9
bshanks@0	344
bshanks@0	345 to application in drug discovery, in the development of other targeted clinical
bshanks@0	346 interventions, and in the development of new experimental techniques.
bshanks@0	347 The best way to divide up rodent cortex into areas has not been completely
bshanks@0	348 determined, as can be seen by the differences in the recent maps given by Swan-
bshanks@0	349 son on the one hand, and Paxinos and Franklin on the other. It is likely that our
bshanks@0	350 study, by showing which areal divisions naturally follow from gene expression
bshanks@0	351 data, as opposed to traditional histological data, will contribute to the creation
bshanks@0	352 of a better map. While we do not here propose to analyze human gene expres-
bshanks@0	353 sion data, it is conceivable that the methods we propose to develop could be
bshanks@0	354 used to suggest modifications to the human cortical map as well.
bshanks@0	355 In the following, we will only be talking about coronal data.
bshanks@0	356 The Allen Brain Atlas provides “Smoothed Energy Volumes”, which are
bshanks@0	357 One type of artifact in the Allen Brain Atlas data is what we call a “slice
bshanks@0	358 artifact”. We have noticed two types of slice artifacts in the dataset. The first
bshanks@0	359 type, a “missing slice artifact”, occurs when the ISH procedure on a slice did
bshanks@0	360 not come out well. In this case, the Allen Brain investigators excluded the slice
bshanks@0	361 at issue from the dataset. This means that no gene expression information is
bshanks@0	362 available for that gene for the region of space covered by that slice. This results
bshanks@0	363 in an expression level of zero being assigned to voxels covered by the slice. This
bshanks@0	364 is partially but not completely ameliorated by the smoothing that is applied to
bshanks@0	365 create the Smoothed Energy Volumes. The usual end result is that a region of
bshanks@0	366 space which is shaped and oriented like a coronal slice is marked as having less
bshanks@0	367 gene expression than surrounding regions.
bshanks@0	368 The second type of slice artifact is caused by the fact that all of the slices
bshanks@0	369 have a consistent orientation. Since there may be artifacts (such as how well
bshanks@0	370 the ISH worked) which are constant within each slice but which vary between
bshanks@0	371 different slices, the result is that ceteris paribus, when one compares the genetic
bshanks@0	372 data of a voxel to another voxel within the same coronal plane, one would expect
bshanks@0	373 to find more similarity than if one compared a voxel to another voxel displaced
bshanks@0	374 along the rostrocaudal axis.
bshanks@0	375 We are enthusiastic about the sharing of methods, data, and results, and
bshanks@0	376 at the conclusion of the project, we will make all of our data and computer
bshanks@0	377 source code publically available. Our goal is that replicating our results, or
bshanks@0	378 applying the methods we develop to other targets, will be quick and easy for
bshanks@0	379 other investigators. In order to aid in understanding and replicating our results,
bshanks@0	380 we intend to include a software program which, when run, will take as input
bshanks@0	381 the Allen Brain Atlas raw data, and produce as output all numbers and charts
bshanks@0	382 found in publications resulting from the project.
bshanks@0	383 To aid in the replication of our results, we will include a script which takes
bshanks@0	384 as input the dataset in aim (3) and provides as output all of the tables in figures
bshanks@0	385 in our publications .
bshanks@0	386 We also expect to weigh in on the debate about how to best partition rodent
bshanks@0	387 cortex
bshanks@0	388 be useful for drug discovery as well
bshanks@0	389 * Another 16000 genes are available, but they do not cover the entire cerebral
bshanks@0	390 cortex with high spatial resolution.
bshanks@0	391 10
bshanks@0	392
bshanks@0	393 User-definable ROIs Combinatorial gene expression Negative as well as pos-
bshanks@0	394 itive signal Use geometry Search for local boundaries if necessary Flatmapped
bshanks@0	395 Specific aims
bshanks@0	396 Develop algorithms that find genetic markers for anatomical regions
bshanks@0	397 1. Develop scoring measures for evaluating how good individual genes are at
bshanks@0	398 marking areas: we will compare pointwise, geometric, and information-
bshanks@0	399 theoretic measures.
bshanks@0	400 2. Develop a procedure to find single marker genes for anatomical regions: for
bshanks@0	401 each cortical area, by using or combining the scoring measures developed,
bshanks@0	402 we will rank the genes by their ability to delineate each area.
bshanks@0	403 3. Extend the procedure to handle difficult areas by using combinatorial cod-
bshanks@0	404 ing: for areas that cannot be identified by any single gene, identify them
bshanks@0	405 with a handful of genes. We will consider both (a) algorithms that incre-
bshanks@0	406 mentally/greedily combine single gene markers into sets, such as forward
bshanks@0	407 stepwise regression and decision trees, and also (b) supervised learning
bshanks@0	408 techniques which use soft constraints to minimize the number of features,
bshanks@0	409 such as sparse support vector machines.
bshanks@0	410 4. Extend the procedure to handle difficult areas by combining or redrawing
bshanks@0	411 the boundaries: An area may be difficult to identify because the bound-
bshanks@0	412 aries are misdrawn, or because it does not “really” exist as a single area,
bshanks@0	413 at least on the genetic level. We will develop extensions to our procedure
bshanks@0	414 which (a) detect when a difficult area could be fit if its boundary were
bshanks@0	415 redrawn slightly, and (b) detect when a difficult area could be combined
bshanks@0	416 with adjacent areas to create a larger area which can be fit.
bshanks@0	417 Apply these algorithms to the cortex
bshanks@0	418 1. Create open source format conversion tools: we will create tools to bulk
bshanks@0	419 download the ABA dataset and to convert between SEV, NIFTI and MAT-
bshanks@0	420 LAB formats.
bshanks@0	421 2. Flatmap the ABA cortex data: map the ABA data onto a plane and draw
bshanks@0	422 the cortical area boundaries onto it.
bshanks@0	423 3. Find layer boundaries: cluster similar voxels together in order to auto-
bshanks@0	424 matically find the cortical layer boundaries.
bshanks@0	425 4. Run the procedures that we developed on the cortex: we will present, for
bshanks@0	426 each area, a short list of markers to identify that area; and we will also
bshanks@0	427 present lists of “panels” of genes that can be used to delineate many areas
bshanks@0	428 at once.
bshanks@0	429 11
bshanks@0	430
bshanks@0	431 Develop algorithms to suggest a division of a structure into anatom-
bshanks@0	432 ical parts
bshanks@0	433 1. Explore dimensionality reduction algorithms applied to pixels: including
bshanks@0	434 TODO
bshanks@0	435 2. Explore dimensionality reduction algorithms applied to genes: including
bshanks@0	436 TODO
bshanks@0	437 3. Explore clustering algorithms applied to pixels: including TODO
bshanks@0	438 4. Explore clustering algorithms applied to genes: including gene shaving,
bshanks@0	439 TODO
bshanks@0	440 5. Develop an algorithm to use dimensionality reduction and/or hierarchial
bshanks@0	441 clustering to create anatomical maps
bshanks@0	442 6. Run this algorithm on the cortex: present a hierarchial, genoarchitectonic
bshanks@0	443 map of the cortex
bshanks@0	444 gradient similarity is calculated as: ∑
bshanks@0	445 pixels cos(abs(∠∇1 - ∠∇2)) ⋅\|∇1\|+\|∇2\|
bshanks@0	446 2 ⋅
bshanks@0	447 pixel_value1+pixel_value2
bshanks@0	448 2
bshanks@0	449 (todo) Technically, we say that an anatomical structure has a fundamen-
bshanks@0	450 tally 2-D organization when there exists a commonly used, generic, anatomical
bshanks@0	451 structure-preserving map from 3-D space to a 2-D manifold.
bshanks@0	452 Related work:
bshanks@0	453 The Allen Brain Institute has developed an interactive web interface called
bshanks@0	454 AGEA which allows an investigator to (1) calculate lists of genes which are se-
bshanks@0	455 lectively overexpressed in certain anatomical regions (ABA calls this the “Gene
bshanks@0	456 Finder” function) (2) to visualize the correlation between the genetic profiles of
bshanks@0	457 voxels in the dataset, and (3) to visualize a hierarchial clustering of voxels in
bshanks@0	458 the dataset [?]. AGEA is an impressive and useful tool, however, it does not
bshanks@0	459 solve the same problems that we propose to solve with this project.
bshanks@0	460 First we describe AGEA’s “Gene Finder”, and then compare it to our pro-
bshanks@0	461 posed method for finding marker genes. AGEA’s Gene Finder first asks the
bshanks@0	462 investigator to select a single “seed voxel” of interest. It then uses a clustering
bshanks@0	463 method, combined with built-in knowledge of major anatomical structures, to
bshanks@0	464 select two sets of voxels; an “ROI” and a “comparator region”*. The seed voxel
bshanks@0	465 is always contained within the ROI, and the ROI is always contained within the
bshanks@0	466 comparator region. The comparator region is similar but not identical to the
bshanks@0	467 set of voxels making up the major anatomical region containing the ROI. Gene
bshanks@0	468 Finder then looks for genes which can distinguish the ROI from the comparator
bshanks@0	469 region. Specifically, it finds genes for which the ratio (expression energy in the
bshanks@0	470 ROI) / (expression energy in the comparator region) is high.
bshanks@0	471 Informally, the Gene Finder first infers an ROI based on clustering the seed
bshanks@0	472 voxel with other voxels. Then, the Gene Finder finds genes which overexpress
bshanks@0	473 in the ROI as compared to other voxels in the major anatomical region.
bshanks@0	474 There are three major differences between our approach and Gene Finder.
bshanks@0	475 12
bshanks@0	476
bshanks@0	477 First, Gene Finder focuses on individual genes and individual ROIs in isola-
bshanks@0	478 tion. This is great for regions which can be picked out from all other regions by a
bshanks@0	479 single gene, but not all of them can (todo). There are at least two ways this can
bshanks@0	480 miss out on useful genes. First, a gene might express in part of a region, but not
bshanks@0	481 throughout the whole region, but there may be another gene which expresses
bshanks@0	482 in the rest of the region*. Second, a gene might express in a region, but not in
bshanks@0	483 any of its neighbors, but it might express also in other non-neighboring regions.
bshanks@0	484 To take advantage of these types of genes, we propose to find combinations of
bshanks@0	485 genes which, together, can identify the boundaries of all subregions within the
bshanks@0	486 containing region.
bshanks@0	487 Second, Gene Finder uses a pointwise metric, namely expression energy ratio,
bshanks@0	488 to decide whether a gene is good for picking out a region. We have found better
bshanks@0	489 results by using metrics which take into account not just single voxels, but also
bshanks@0	490 the local geometry of neighboring voxels, such as the local gradient (todo). In
bshanks@0	491 addition, we have found that often the absence of gene expression can be used
bshanks@0	492 as a marker, which will not be caught by Gene Finder’s expression energy ratio
bshanks@0	493 (todo).
bshanks@0	494 Third, Gene Finder chooses the ROI based only on the seed voxel. This
bshanks@0	495 often does not permit the user to query the ROI that they are interested in. For
bshanks@0	496 example, in all of our tests of Gene Finder in cortex, the ROIs chosen tend to
bshanks@0	497 be cortical layers, rather than cortical areas.
bshanks@0	498 In summary, when Gene Finder picks the ROI that you want, and when this
bshanks@0	499 ROI can be easily picked out from neighboring regions by single genes which
bshanks@0	500 selectively overexpress in the ROI compared to the entire major anatomical re-
bshanks@0	501 gion, Gene Finder will work. However, Gene Finder will not pick cortical areas
bshanks@0	502 as ROIs, and even if it could, many cortical areas cannot be uniquely picked out
bshanks@0	503 by the overexpression of any single gene. By contrast, we will target cortical
bshanks@0	504 areas, we will explore a variety of metrics which can complement the shortcom-
bshanks@0	505 ings of expression energy ratio, and we will use the combinatorial expression of
bshanks@0	506 genes to pick out cortical areas even when no individual gene will do.
bshanks@0	507 * The terms “ROI” and “comparator region” are our own; the ABI calls
bshanks@0	508 them the “local region” and the “larger anatomical context”. The ABI uses the
bshanks@0	509 term “specificity comparator” to mean the major anatomic region containing
bshanks@0	510 the ROI, which is not exactly identical to the comparator region.
bshanks@0	511 ** In this case, the union of the area of expression of the two genes would
bshanks@0	512 suffice; one could also imagine that there could be situations in which the in-
bshanks@0	513 tersection of multiple genes would be needed, or a combination of unions and
bshanks@0	514 intersections.
bshanks@0	515 Now we describe AGEA’s hierarchial clustering, and compare it to our pro-
bshanks@0	516 posal. The goal of AGEA’s hierarchial clustering is to generate a binary tree of
bshanks@0	517 clusters, where a cluster is a collection of voxels. AGEA begins by computing
bshanks@0	518 the Pearson correlation between each pair of voxels. They then employ a recur-
bshanks@0	519 sive divisive (top-down) hierarchial clustering procedure on the voxels, which
bshanks@0	520 means that they start with all of the voxels, and then they divide them into clus-
bshanks@0	521 ters, and then within each cluster, they divide that cluster into smaller clusters,
bshanks@0	522 etc***. At each step, the collection of voxels is partitioned into two smaller
bshanks@0	523 13
bshanks@0	524
bshanks@0	525 clusters in a way that maximizes the following quantity: average correlation
bshanks@0	526 between all possible pairs of voxels containing one voxel from each cluster.
bshanks@0	527 There are three major differences between our approach and AGEA’s hier-
bshanks@0	528 archial clustering. First, AGEA’s clustering method separates cortical layers
bshanks@0	529 before it separates cortical areas.
bshanks@0	530 following procedure is used for the purpose of dividing a collection of voxels
bshanks@0	531 into smaller clusters: partition the voxels into two sets, such that the following
bshanks@0	532 quantity is maximized:
bshanks@0	533 *** depending on which level of the tree is being created, the voxels are
bshanks@0	534 subsampled in order to save time
bshanks@0	535 does not allow the user to input anything other than a seed voxel; this means
bshanks@0	536 that for each seed voxel, there is only one
bshanks@0	537 The role of the “local region” is to serve as a region of interest for which
bshanks@0	538 marker genes are desired; the role of the “larger anatomical context” is to be
bshanks@0	539 the structure
bshanks@0	540 There are two kinds of differences between AGEA and our project; differ-
bshanks@0	541 ences that relate to the treatment of the cortex, and differences in the type of
bshanks@0	542 generalizable methods being developed. As relates
bshanks@0	543 indicate an ROI
bshanks@0	544 explore simple correlation-based relationships between voxels, genes, and
bshanks@0	545 clusters of voxels.
bshanks@0	546 There have not yet been any studies which describe the results of applying
bshanks@0	547 AGEA to the cerebral cortex; however, we suspect that the AGEA metrics are
bshanks@0	548 not optimal for the task of relating genes to cortical areas. A voxel’s gene
bshanks@0	549 expression profile depends upon both its cortical area and its cortical layer,
bshanks@0	550 however, AGEA has no mechanism to distinguish these two. As a result, voxels
bshanks@0	551 in the same layer but different areas are often clustered together by AGEA. As
bshanks@0	552 part of the project, we will compare the performance of our techniques against
bshanks@0	553 AGEA’s.
bshanks@0	554 —
bshanks@0	555 The Allen Brain Institute has developed interactive tools called AGEA which
bshanks@0	556 allow an investigator to explore simple correlation-based relationships between
bshanks@0	557 voxels, genes, and clusters of voxels. There have not yet been any studies
bshanks@0	558 which describe the results of applying AGEA to the cerebral cortex; however,
bshanks@0	559 we suspect that the AGEA metrics are not optimal for the task of relating
bshanks@0	560 genes to cortical areas. A voxel’s gene expression profile depends upon both
bshanks@0	561 its cortical area and its cortical layer, however, AGEA has no mechanism to
bshanks@0	562 distinguish these two. As a result, voxels in the same layer but different areas
bshanks@0	563 are often clustered together by AGEA. As part of the project, we will compare
bshanks@0	564 the performance of our techniques against AGEA’s.
bshanks@0	565 Another difference between our techniques and AGEA’s is that AGEA allows
bshanks@0	566 the user to enter only a voxel location, and then to either explore the rest of
bshanks@0	567 the brain’s relationship to that particular voxel, or explore a partitioning of
bshanks@0	568 the brain based on pairwise voxel correlation. If the user is interested not in a
bshanks@0	569 single voxel, but rather an entire anatomical structure, AGEA will only succeed
bshanks@0	570 to the extent that the selected voxel is a typical representative of the structure.
bshanks@0	571 14
bshanks@0	572
bshanks@0	573 As discussed in the previous paragraph, this poses problems for structures like
bshanks@0	574 cortical areas, which (because of their division into cortical layers) do not have
bshanks@0	575 a single “typical representative”.
bshanks@0	576 By contrast, in our system, the user will start by selecting, not a single voxel,
bshanks@0	577 but rather, an anatomical superstructure to be divided into pieces (for example,
bshanks@0	578 the cerebral cortex). We expect that our methods will take into account not
bshanks@0	579 just pairwise statistics between voxels, but also large-scale geometric features
bshanks@0	580 (for example, the rapidity of change in gene expression as regional boundaries
bshanks@0	581 are crossed) which optimize the discriminability of regions within the selected
bshanks@0	582 superstructure.
bshanks@0	583 —–
bshanks@0	584 screen for combinations of marker genes which selectively target anatom-
bshanks@0	585 ical structures pick delineate the boundaries between neighboring anatomical
bshanks@0	586 structures. (b) techniques to screen for marker genes which pick out anatomical
bshanks@0	587 structures of interest
bshanks@0	588 , techniques which: (a) screen for marker genes , and (b) suggest new
bshanks@0	589 anatomical maps based on
bshanks@0	590 whose expression partitions the region of interest into its anatomical sub-
bshanks@0	591 structures, and (b) use the natural contours of gene expression to suggest new
bshanks@0	592 ways of dividing an organ into
bshanks@0	593 The Allen Brain Atlas
bshanks@0	594 –
bshanks@0	595 to: brooksl@mail.nih.gov
bshanks@0	596 Hi, I’m writing to confirm the applicability of a potential research project to
bshanks@0	597 the challenge grant topic ”New computational and statistical methods for the
bshanks@0	598 analysis of large data sets from next-generation sequencing technologies”.
bshanks@0	599 We want to develop methods for the analysis of gene expression datasets that
bshanks@0	600 can be used to uncover the relationships between gene expression and anatomical
bshanks@0	601 regions. Specifically, we want to develop techniques to (a) given a set of known
bshanks@0	602 anatomical areas, identify genetic markers for each of these areas, and (b) given
bshanks@0	603 an anatomical structure whose substructure is unknown, suggest a map, that
bshanks@0	604 is, a division of the space into anatomical sub-structures, that represents the
bshanks@0	605 boundaries inherent in the gene expression data.
bshanks@0	606 We propose to develop our techniques on the Allen Brain Atlas mouse brain
bshanks@0	607 gene expression dataset by finding genetic markers for anatomical areas within
bshanks@0	608 the cerebral cortex. The Allen Brain Atlas contains a registered 3-D map of
bshanks@0	609 gene expression data with 200-micron voxel resolution which was created from
bshanks@0	610 in situ hybridization data. The dataset contains about 4000 genes which are
bshanks@0	611 available at this resolution across the entire cerebral cortex.
bshanks@0	612 Despite the distinct roles of different cortical areas in both normal function-
bshanks@0	613 ing and disease processes, there are no known marker genes for many cortical
bshanks@0	614 areas. This project will be immediately useful for both drug discovery and clini-
bshanks@0	615 cal research because once the markers are known, interventions can be designed
bshanks@0	616 which selectively target specific cortical areas.
bshanks@0	617 This techniques we develop will be useful because they will be applicable to
bshanks@0	618 the analysis of other anatomical areas, both in terms of finding marker genes
bshanks@0	619 15
bshanks@0	620
bshanks@0	621 for known areas, and in terms of suggesting new anatomical subdivisions that
bshanks@0	622 are based upon the gene expression data.
bshanks@0	623 —-
bshanks@0	624 It is likely that our study, by showing which areal divisions naturally fol-
bshanks@0	625 low from gene expression data, as opposed to traditional histological data, will
bshanks@0	626 contribute to the creation of
bshanks@0	627 there are clear genetic or chemical markers known for only a few cortical
bshanks@0	628 areas. This makes it difficult to target drugs to specific
bshanks@0	629 As part of aims (1) and (5), we will discover sets of marker genes that pick
bshanks@0	630 out specific cortical areas. This will allow the development of drugs and other
bshanks@0	631 interventions that selectively target individual cortical areas. As part of aims
bshanks@0	632 (2) and (5), we will also discover small panels of marker genes that can be used
bshanks@0	633 to delineate most of the cortical areal map.
bshanks@0	634 With aims (2) and (4), we
bshanks@0	635 There are five principals
bshanks@0	636 In addition to validating the usefulness of the algorithms, the application of
bshanks@0	637 these methods to cerebral cortex will produce immediate benefits that are only
bshanks@0	638 one step removed from clinical application.
bshanks@0	639 todo: remember to check gensat, etc for validation (mention bias/variance)
bshanks@0	640 Why it is useful to apply these methods to cortex
bshanks@0	641 There is still room for debate as to exactly how the cortex should be parcellated
bshanks@0	642 into areas.
bshanks@0	643 The best way to divide up rodent cortex into areas has not been completely
bshanks@0	644 determined,
bshanks@0	645 not yet been accounted for in
bshanks@0	646 that the expression of some genes will contain novel spatial patterns which
bshanks@0	647 are not account
bshanks@0	648 that a genoarchitectonic map
bshanks@0	649 This principle is only applicable to aim 1 (marker genes). For aim 2 (partition
bshanks@0	650 a structure in into anatomical subregions), we plan to work with many genes at
bshanks@0	651 once.
bshanks@0	652 tood: aim 2 b+s?
bshanks@0	653 Principle 5: Interoperate with existing tools
bshanks@0	654 In order for our software to be as useful as possible for our users, it will be
bshanks@0	655 able to import and export data to standard formats so that users can use our
bshanks@0	656 software in tandem with other software tools created by other teams. We will
bshanks@0	657 support the following formats: NIFTI (Neuroimaging Informatics Technology
bshanks@0	658 Initiative), SEV (Allen Brain Institute Smoothed Energy Volume), and MAT-
bshanks@0	659 LAB. This ensures that our users will not have to exclusively rely on our tools
bshanks@0	660 when analyzing data. For example, users will be able to use the data visualiza-
bshanks@0	661 tion and analysis capabilities of MATLAB and Caret alongside our software.
bshanks@0	662 16
bshanks@0	663
bshanks@0	664 To our knowledge, there is no currently available software to convert between
bshanks@0	665 these formats, so we will also provide a format conversion tool. This may be
bshanks@0	666 useful even for groups that don’t use any of our other software.
bshanks@0	667 todo: is “marker gene” even a phrase that we should use at all?
bshanks@0	668 note for aim 1 apps: combo of genes is for voxel, not within any single cell
bshanks@0	669 , as when genetic markers allow the development of selective interventions;
bshanks@0	670 the reason that one can be confident that the intervention is selective is that it
bshanks@0	671 is only turned on when a certain combination of genes is turned on and off. The
bshanks@0	672 result procedure is what assures us that when that combination is present, the
bshanks@0	673 local tissue is probably part of a certain subregion.
bshanks@0	674 The basic idea is that we want to find a procedure by
bshanks@0	675 The task of finding genes that mark anatomical areas can be phrased in
bshanks@0	676 terms of what the field of machine learning calls a “supervised learning” task.
bshanks@0	677 The goal of this task is to learn a function (the “classifier”) which
bshanks@0	678 If a person knows a combination of genes that mark an area, that implies
bshanks@0	679 that the person can be told how strong those genes express in any voxel, and
bshanks@0	680 the person can use this information to determine how
bshanks@0	681 finding how to infer the areal identity of a voxel if given the gene expression
bshanks@0	682 profile of that voxel.
bshanks@0	683 For each voxel in the cortex, we want to start with data about the gene
bshanks@0	684 expression
bshanks@0	685 There are various ways to look for marker genes. We will define some terms,
bshanks@0	686 and along the way we will describe a few design choices encountered in the
bshanks@0	687 process of creating a marker gene finding method, and then we will present four
bshanks@0	688 principles that describe which options we have chosen.
bshanks@0	689 In developing a procedure for finding marker genes, we are developing a
bshanks@0	690 procedure that takes a dataset of experimental observations and produces a
bshanks@0	691 result. One can think of the result as merely a list of genes, but really the result
bshanks@0	692 is an understanding of a predictive relationship between, on the one hand, the
bshanks@0	693 expression levels of genes, and, on the other hand, anatomical subregions.
bshanks@0	694 One way to more formally define this understanding is to look at it as a
bshanks@0	695 procedure. In this view, the result of the learning procedure is itself a procedure.
bshanks@0	696 The result procedure provides a way to use the gene expression profiles of voxels
bshanks@0	697 in a tissue sample in order to determine where the subregions are.
bshanks@0	698 This result procedure can be used directly, as when an experimenter has
bshanks@0	699 a tissue sample and needs to know what subregions are present in it, and,
bshanks@0	700 if multiple subregions are present, where they each are. Or it can be used
bshanks@0	701 indirectly; imagine that the result procedure tells us that whenever a certain
bshanks@0	702 combination of genes are expressed, the local tissue is probably part of a certain
bshanks@0	703 subregion. This means that we can then confidentally develop an intervention
bshanks@0	704 which is triggered only when that combination of genes are expressed; and to
bshanks@0	705 the extent that the result procedure is reliable, we know that the intervention
bshanks@0	706 will only be triggered in the target subregion.
bshanks@0	707 We said that the result procedure provides “a way to use the gene expression
bshanks@0	708 profiles of voxels in a tissue sample” in order to “determine where the subregions
bshanks@0	709 are”.
bshanks@0	710 17
bshanks@0	711
bshanks@0	712 Does the result procedure get as input all of the gene expression profiles
bshanks@0	713 of each voxel in the entire tissue sample, and produce as output all of the
bshanks@0	714 subregional boundaries all at once?
bshanks@0	715 it is helpful for the classifier to look at the global “shape” of gene expression
bshanks@0	716 patterns over the whole structure, rather than just nearby voxels.
bshanks@0	717 there is some small bit of additional information that can be gleaned from
bshanks@0	718 knowing the
bshanks@0	719 Design choices for a supervised learning procedure
bshanks@0	720 After all,
bshanks@0	721 there is a small correlation between the gene expression levels from distant
bshanks@0	722 voxels and
bshanks@0	723 Depending on how we intend to use the classifier, we may want to design it
bshanks@0	724 so that
bshanks@0	725 It is possible for many things to
bshanks@0	726 The choice of which data is made part of an instance
bshanks@0	727 what we seek is a procedure
bshanks@0	728 partition the tissue sample into subregions.
bshanks@0	729 each part of the anatomical structure
bshanks@0	730 must be One way to rephrase this task is to say that, instead of searching
bshanks@0	731 for the location of the subregions, we are looking to partition the tissue sample
bshanks@0	732 into subregions.
bshanks@0	733 There are various ways to look for marker genes. We will define some terms,
bshanks@0	734 and along the way we will describe a few design choices encountered in the
bshanks@0	735 process of creating a marker gene finding method, and then we will present four
bshanks@0	736 principles that describe which options we have chosen.
bshanks@0	737 In developing a procedure for finding marker genes, we are developing a
bshanks@0	738 procedure that takes a dataset of experimental observations and produces a
bshanks@0	739 result. One can think of the result as merely a list of genes, but really the result
bshanks@0	740 is an understanding of a predictive relationship between, on the one hand, the
bshanks@0	741 expression levels of genes, and, on the other hand, anatomical subregions.
bshanks@0	742 One way to more formally define this understanding is to look at it as a
bshanks@0	743 procedure. In this view, the result of the learning procedure is itself a procedure.
bshanks@0	744 The result procedure provides a way to use the gene expression profiles of voxels
bshanks@0	745 in a tissue sample in order to determine where the subregions are.
bshanks@0	746 This result procedure can be used directly, as when an experimenter has
bshanks@0	747 a tissue sample and needs to know what subregions are present in it, and,
bshanks@0	748 if multiple subregions are present, where they each are. Or it can be used
bshanks@0	749 indirectly; imagine that the result procedure tells us that whenever a certain
bshanks@0	750 combination of genes are expressed, the local tissue is probably part of a certain
bshanks@0	751 subregion. This means that we can then confidentally develop an intervention
bshanks@0	752 which is triggered only when that combination of genes are expressed; and to
bshanks@0	753 the extent that the result procedure is reliable, we know that the intervention
bshanks@0	754 will only be triggered in the target subregion.
bshanks@0	755 18
bshanks@0	756
bshanks@0	757 We said that the result procedure provides “a way to use the gene expression
bshanks@0	758 profiles of voxels in a tissue sample” in order to “determine where the subregions
bshanks@0	759 are”.
bshanks@0	760 Does the result procedure get as input all of the gene expression profiles
bshanks@0	761 of each voxel in the entire tissue sample, and produce as output all of the
bshanks@0	762 subregional boundaries all at once?
bshanks@0	763 Or are we given one voxel at a time,
bshanks@0	764 In the jargon of the field of machine learning, the result procedure is called
bshanks@0	765 a classifier.
bshanks@0	766 The task of finding genes that mark anatomical areas can be phrased in
bshanks@0	767 terms of what the field of machine learning calls a “supervised learning” task.
bshanks@0	768 The goal of this task is to learn a function (the “classifier”) which
bshanks@0	769 If a person knows a combination of genes that mark an area, that implies
bshanks@0	770 that the person can be told how strong those genes express in any voxel, and
bshanks@0	771 the person can use this information to determine how
bshanks@0	772 finding how to infer the areal identity of a voxel if given the gene expression
bshanks@0	773 profile of that voxel.
bshanks@0	774 For each voxel in the cortex, we want to start with data about the gene
bshanks@0	775 expression
bshanks@0	776 single voxels, but rather groups of voxels, such that the groups can be placed
bshanks@0	777 in some 2-D space. We will call such instances “pixels”.
bshanks@0	778 We have been speaking as if instances necessarily correspond to single voxels.
bshanks@0	779 But it is possible for instances to be groupings of many voxels, in which case
bshanks@0	780 each grouping must be assigned the same label (that is, each voxel grouping
bshanks@0	781 must stay inside a single anatomical subregion).
bshanks@0	782 In some but not all cases, the groups are either rows or columns of voxels.
bshanks@0	783 This is the case with the cerebral cortex, in which one may assume that columns
bshanks@0	784 of voxels which run perpendicular to the cortical surface all share the same areal
bshanks@0	785 identity. In the cortex, we call such an instance a “surface pixel”, because such
bshanks@0	786 an instance represents the data associated with all voxels underneath a specific
bshanks@0	787 patch of the cortical surface.
bshanks@0	788 19
bshanks@0	789
bshanks@0	790