cg

changeset 38:82076af297cd

.
author bshanks@bshanks.dyndns.org
date Tue Apr 14 02:23:38 2009 -0700 (16 years ago)
parents af3389b432e9
children 9365a696c0b8
files grant.html grant.odt grant.pdf grant.txt
line diff
1.1 --- a/grant.html Mon Apr 13 23:17:40 2009 -0700 1.2 +++ b/grant.html Tue Apr 14 02:23:38 2009 -0700 1.3 @@ -251,7 +251,7 @@ 1.4 Preliminary work 1.5 Format conversion between SEV, MATLAB, NIFTI 1.6 We have created software to (politely) download all of the SEV files from the Allen Institute website. We have also created 1.7 -software to convert between the SEV, MATLAB, and NIFTI file formats, as well as some of Caret’s formats. 1.8 +software to convert between the SEV, MATLAB, and NIFTI file formats, as well as some of Caret’s file formats. 1.9 Flatmap of cortex 1.10 We downloaded the ABA data and applied a mask to select only those voxels which belong to cerebral cortex. We divided 1.11 the cortex into hemispheres. 1.12 @@ -267,12 +267,45 @@ 1.13 a grid of points (pixels) over the cortical surface: 1.14 ∙A 2-D matrix whose entries represent the regional label associated with each surface pixel 1.15 ∙For each gene, a 2-D matrix whose entries represent the average expression level underneath each surface pixel 1.16 +We created a normalized version of the gene expression data by subtracting each gene’s mean expression level (over all 1.17 +surface pixels) and dividing each gene by its standard deviation. 1.18 To move beyond a single average expression level for each surface pixel, we plan to create a separate matrix for each 1.19 cortical layer to represent the average expression level within that layer. Cortical layers are found at different depths in 1.20 different parts of the cortex. In preparation for extracting the layer-specific datasets, we have extended Caret with routines 1.21 that allow the depth of the ROI for volume-to-surface projection to vary. 1.22 In the Research Plan, we describe how we will automatically locate the layer depths. For validation, we have manually 1.23 demarcated the depth of the outer boundary of cortical layer 5 throughout the cortex. 1.24 +Feature selection and scoring methods 1.25 +Correlation Recall that the instances are surface pixels, and consider the problem of attempting to classify each instance 1.26 +as either a member of a particular anatomical area, or not. The target area can be represented as a binary mask over the 1.27 +surface pixels. 1.28 +The features and the target area are both functions on the surface pixels; alternately, they can be thought of as images 1.29 +which can be displayed on the flatmapped surface. One class of feature selection scoring method are those which calculate 1.30 +some sort of “match” between each gene image and the target image. Those genes which match the best are good candidates 1.31 +for features. 1.32 +One of the simplest methods in this class is to use correlation as the match score. We calculated the correlation between 1.33 +each gene and each cortical area. 1.34 +Conditional entropy An information-theoretic scoring method is to find features such that, if the features (gene 1.35 +expression levels) are known, uncertainty about the target (the regional identity) is reduced. Entropy measures uncertainty, 1.36 +so what we want is to find features such that the conditional distribution of the target has minimal entropy. The distribution 1.37 +to which we are referring is the probability distribution over the population of surface pixels. 1.38 +The simplest way to use information theory is on discrete data, so we discretized our gene expression data by creating, 1.39 +for each gene, five thresholded binary masks of the gene data. For each gene, we created a binary mask of its expression 1.40 +levels over pixels using each of these thresholds: the mean of that gene, the mean minus one standard deviation, the mean 1.41 +minus two standard deviations, the mean plus one standard deviation, the mean plus two standard deviations. 1.42 +Now, for each region, we ran a forward stepwise procedure which attempted to find pairs of gene expression binary masks 1.43 +such that the conditional entropy of the target area’s binary mask, conditioned upon the pair of gene expression binary 1.44 +masks, is minimized. 1.45 +This finds pairs of genes which are most informative, at least at these discretization thresholds. 1.46 +Gradient similarity todo 1.47 + 1.48 + 1.49 + 1.50 +Figure 1: Upper left: wwc1. Upper right: mtif2. Lower left: wwc1 + mtif2 (each pixel’s value on the lower left is the sum 1.51 +of the corresponding pixels in the upper row). Within each picture, the vertical axis roughly corresponds to anterior at the 1.52 +top and posterior at the bottom, and the horizontal axis roughly corresponds to medial at the left and lateral at the right. 1.53 +The red outline is the boundary of region MO. Pixels are colored approximately according to the density of expressing cells 1.54 +underneath each pixel, with red meaning a lot of expression and blue meaning little. 1.55 Using combinations of multiple genes is necessary and sufficient to delineate some cortical areas 1.56 Here we give an example of a cortical area which is not marked by any single gene, but which can be identified combi- 1.57 natorially. according to logistic regression, gene wwc19 is the best fit single gene for predicting whether or not a pixel on 1.58 @@ -280,19 +313,26 @@ 1.59 pattern over the cortex. The lower-right boundary of MO is represented reasonably well by this gene, however the gene 1.60 overshoots the upper-left boundary. This flattened 2-D representation does not show it, but the area corresponding to the 1.61 overshoot is the medial surface of the cortex. MO is only found on the lateral surface (todo). 1.62 -Gnee mtif210 is shown in figure the upper-right of Fig. . Mtif2 captures MO’s upper-left boundary, but not its lower-right 1.63 +Gene mtif210 is shown in figure the upper-right of Fig. . Mtif2 captures MO’s upper-left boundary, but not its lower-right 1.64 boundary. Mtif2 does not express very much on the medial surface. By adding together the values at each pixel in these 1.65 two figures, we get the lower-left of Figure . This combination captures area MO much better than any single gene. 1.66 -Correlation todo 1.67 -Conditional entropy todo 1.68 -Gradient similarity todo 1.69 Geometric and pointwise scoring methods provide complementary information 1.70 To show that local geometry can provide useful information that cannot be detected via pointwise analyses, consider Fig. 1.71 . The top row of Fig. displays the 3 genes which most match area AUD, according to a pointwise method11. The bottom 1.72 row displays the 3 genes which most match AUD according to a method which considers local geometry12 The pointwise 1.73 method in the top row identifies genes which express more strongly in AUD than outside of it; its weakness is that this 1.74 includes many areas which don’t have a salient border matching the areal border. The geometric method identifies genes 1.75 -_________________________________________ 1.76 +whose salient expression border seems to partially line up with the border of AUD; its weakness is that this includes genes 1.77 +which don’t express over the entire area. Genes which have high rankings using both pointwise and border criteria, such as 1.78 +Aph1a in the example, may be particularly good markers. None of these genes are, individually, a perfect marker for AUD; 1.79 +we deliberately chose a “difficult” area in order to better contrast pointwise with geometric methods. 1.80 +Areas which can be identified by single genes 1.81 +todo 1.82 +Areas can sometimes be marked by underexpression 1.83 +todo 1.84 +Specific to Aim 1 (and Aim 3) 1.85 +Forward stepwise logistic regression todo 1.86 +__ 1.87 9“WW, C2 and coiled-coil domain containing 1”; EntrezGene ID 211652 1.88 10“mitochondrial translational initiation factor 2”; EntrezGene ID 76784 1.89 11For each gene, a logistic regression in which the response variable was whether or not a surface pixel was within area AUD, and the predictor 1.90 @@ -301,28 +341,11 @@ 1.91 12For each gene the gradient similarity (see section ??) between (a) a map of the expression of each gene on the cortical surface and (b) the 1.92 shape of area AUD, was calculated, and this was used to rank the genes. 1.93 1.94 - 1.95 - 1.96 -Figure 1: Upper left: wwc1. Upper right: mtif2. Lower left: wwc1 + mtif2 (each pixel’s value on the lower left is the sum 1.97 -of the corresponding pixels in the upper row). Within each picture, the vertical axis roughly corresponds to anterior at the 1.98 -top and posterior at the bottom, and the horizontal axis roughly corresponds to medial at the left and lateral at the right. 1.99 -The red outline is the boundary of region MO. Pixels are colored approximately according to the density of expressing cells 1.100 -underneath each pixel, with red meaning a lot of expression and blue meaning little. 1.101 1.102 1.103 Figure 2: The top row shows the three genes which (individually) best predict area AUD, according to logistic regression. 1.104 The bottom row shows the three genes which (individually) best match area AUD, according to gradient similarity. From 1.105 left to right and top to bottom, the genes are Ssr1, Efcbp1, Aph1a, Ptk7, Aph1a again, and Lepr 1.106 -whose salient expression border seems to partially line up with the border of AUD; its weakness is that this includes genes 1.107 -which don’t express over the entire area. Genes which have high rankings using both pointwise and border criteria, such as 1.108 -Aph1a in the example, may be particularly good markers. None of these genes are, individually, a perfect marker for AUD; 1.109 -we deliberately chose a “difficult” area in order to better contrast pointwise with geometric methods. 1.110 -Areas which can be identified by single genes 1.111 -todo 1.112 -Areas can sometimes be marked by underexpression 1.113 -todo 1.114 -Specific to Aim 1 (and Aim 3) 1.115 -Forward stepwise logistic regression todo 1.116 SVM on all genes at once 1.117 In order to see how well one can do when looking at all genes at once, we ran a support vector machine to classify cortical 1.118 surface pixels based on their gene expression profiles. We achieved classification accuracy of about 81%13. As noted above,
2.1 Binary file grant.odt has changed
3.1 Binary file grant.pdf has changed
4.1 --- a/grant.txt Mon Apr 13 23:17:40 2009 -0700 4.2 +++ b/grant.txt Tue Apr 14 02:23:38 2009 -0700 4.3 @@ -183,7 +183,7 @@ 4.4 == Preliminary work == 4.5 4.6 === Format conversion between SEV, MATLAB, NIFTI === 4.7 -We have created software to (politely) download all of the SEV files from the Allen Institute website. We have also created software to convert between the SEV, MATLAB, and NIFTI file formats, as well as some of Caret's formats. 4.8 +We have created software to (politely) download all of the SEV files from the Allen Institute website. We have also created software to convert between the SEV, MATLAB, and NIFTI file formats, as well as some of Caret's file formats. 4.9 4.10 4.11 === Flatmap of cortex === 4.12 @@ -200,6 +200,8 @@ 4.13 * A 2-D matrix whose entries represent the regional label associated with each surface pixel 4.14 * For each gene, a 2-D matrix whose entries represent the average expression level underneath each surface pixel 4.15 4.16 +We created a normalized version of the gene expression data by subtracting each gene's mean expression level (over all surface pixels) and dividing each gene by its standard deviation. 4.17 + 4.18 To move beyond a single average expression level for each surface pixel, we plan to create a separate matrix for each cortical layer to represent the average expression level within that layer. Cortical layers are found at different depths in different parts of the cortex. In preparation for extracting the layer-specific datasets, we have extended Caret with routines that allow the depth of the ROI for volume-to-surface projection to vary. 4.19 4.20 In the Research Plan, we describe how we will automatically locate the layer depths. For validation, we have manually demarcated the depth of the outer boundary of cortical layer 5 throughout the cortex. 4.21 @@ -210,6 +212,32 @@ 4.22 4.23 4.24 4.25 +=== Feature selection and scoring methods === 4.26 + 4.27 + 4.28 +\vspace{0.3cm}**Correlation** 4.29 +Recall that the instances are surface pixels, and consider the problem of attempting to classify each instance as either a member of a particular anatomical area, or not. The target area can be represented as a binary mask over the surface pixels. 4.30 + 4.31 +The features and the target area are both functions on the surface pixels; alternately, they can be thought of as images which can be displayed on the flatmapped surface. One class of feature selection scoring method are those which calculate some sort of "match" between each gene image and the target image. Those genes which match the best are good candidates for features. 4.32 + 4.33 +One of the simplest methods in this class is to use correlation as the match score. We calculated the correlation between each gene and each cortical area. 4.34 + 4.35 +todo: fig 4.36 + 4.37 +\vspace{0.3cm}**Conditional entropy** 4.38 +An information-theoretic scoring method is to find features such that, if the features (gene expression levels) are known, uncertainty about the target (the regional identity) is reduced. Entropy measures uncertainty, so what we want is to find features such that the conditional distribution of the target has minimal entropy. The distribution to which we are referring is the probability distribution over the population of surface pixels. 4.39 + 4.40 +The simplest way to use information theory is on discrete data, so we discretized our gene expression data by creating, for each gene, five thresholded binary masks of the gene data. For each gene, we created a binary mask of its expression levels over pixels using each of these thresholds: the mean of that gene, the mean minus one standard deviation, the mean minus two standard deviations, the mean plus one standard deviation, the mean plus two standard deviations. 4.41 + 4.42 +Now, for each region, we ran a forward stepwise procedure which attempted to find pairs of gene expression binary masks such that the conditional entropy of the target area's binary mask, conditioned upon the pair of gene expression binary masks, is minimized. 4.43 + 4.44 +This finds pairs of genes which are most informative, at least at these discretization thresholds. 4.45 + 4.46 +todo: fig 4.47 + 4.48 +\vspace{0.3cm}**Gradient similarity** 4.49 +todo 4.50 + 4.51 4.52 4.53 4.54 @@ -227,14 +255,7 @@ 4.55 \caption{Upper left: $wwc1$. Upper right: $mtif2$. Lower left: wwc1 + mtif2 (each pixel's value on the lower left is the sum of the corresponding pixels in the upper row). Within each picture, the vertical axis roughly corresponds to anterior at the top and posterior at the bottom, and the horizontal axis roughly corresponds to medial at the left and lateral at the right. The red outline is the boundary of region MO. Pixels are colored approximately according to the density of expressing cells underneath each pixel, with red meaning a lot of expression and blue meaning little.} 4.56 \end{figure} 4.57 4.58 -\vspace{0.3cm}**Correlation** 4.59 -todo 4.60 - 4.61 -\vspace{0.3cm}**Conditional entropy** 4.62 -todo 4.63 - 4.64 -\vspace{0.3cm}**Gradient similarity** 4.65 -todo 4.66 + 4.67 4.68 \vspace{0.3cm}**Geometric and pointwise scoring methods provide complementary information** 4.69