Whole transcriptome RNA-Seq allelic expression in human brain

Background Measuring allelic RNA expression ratios is a powerful approach for detecting cis-acting regulatory variants, RNA editing, loss of heterozygosity in cancer, copy number variation, and allele-specific epigenetic gene silencing. Whole transcriptome RNA sequencing (RNA-Seq) has emerged as a genome-wide tool for identifying allelic expression imbalance (AEI), but numerous factors bias allelic RNA ratio measurements. Here, we compare RNA-Seq allelic ratios measured in nine different human brain regions with a highly sensitive and accurate SNaPshot measure of allelic RNA ratios, identifying factors affecting reliable allelic ratio measurement. Accounting for these factors, we subsequently surveyed the variability of RNA editing across brain regions and across individuals. Results We find that RNA-Seq allelic ratios from standard alignment methods correlate poorly with SNaPshot, but applying alternative alignment strategies and correcting for observed biases significantly improves correlations. Deploying these methods on a transcriptome-wide basis in nine brain regions from a single individual, we identified genes with AEI across all regions (SLC1A3, NHP2L1) and many others with region-specific AEI. In dorsolateral prefrontal cortex (DLPFC) tissues from 14 individuals, we found evidence for frequent regulatory variants affecting RNA expression in tens to hundreds of genes, depending on stringency for assigning AEI. Further, we find that the extent and variability of RNA editing is similar across brain regions and across individuals. Conclusions These results identify critical factors affecting allelic ratios measured by RNA-Seq and provide a foundation for using this technology to screen allelic RNA expression on a transcriptome-wide basis. Using this technology as a screening tool reveals tens to hundreds of genes harboring frequent functional variants affecting RNA expression in the human brain. With respect to RNA editing, the similarities within and between individuals leads us to conclude that this post-transcriptional process is under heavy regulatory influence to maintain an optimal degree of editing for normal biological function.


Background
Identifying cis-acting functional genetic and epigenetic factors affecting RNA expression from trans-acting influence remains challenging. Two approaches have emerged to offset the influence of trans-acting factors, in search of causative cis-acting factors. The first approach uses large sample sizes (hundreds to thousands of tissues) to dissect the influence of trans-versus cisacting factors influencing expression by correlating RNA transcript expression, as a quantitative trait, with single-nucleotide polymorphisms (SNP) genotyped with genome-wide arrays. Correlations between RNA expression levels and SNPs yield expression quantitative trait loci (eQTLs) located in cis (cis-eQTLs) or trans (trans-eQTLs) and have been examined across a variety of tissues [1][2][3][4][5][6][7]. eQTL analysis still leaves some ambiguity regarding the cis-or trans-acting nature of a polymorphism, as cis-acting factors can be hundreds of kilobases away from the genes they regulate [7]. As an alternative to eQTL analyses, our group and others have utilized allelic RNA expression imbalance (AEI), which compares the relative expression of two alleles in the same individual as a phenotype influenced only by cisacting genetic variants [8][9][10]. Because AEI is an accurate and sensitive phenotype proximal to the functional genetic variant, this approach facilitates the detection of cis-acting regulatory polymorphisms affecting any mechanism that measurably changes RNA expression, even when those polymorphisms reside at a distance from the affected gene or in regions of high linkage disequilibrium [11]. Allelic RNA expression ratios, when measured specifically in splice variants or alternatively expressed untranslated regions, can identify genetic variants affecting RNA processing [12][13][14]. In addition to identifying cis-acting regulatory variants, AEI is a powerful phenotype for assessing the extent of RNA editing [15,16], loss-of-heterozygosity or monoallelic expression in cancer [17], and allele-specific epigenetic programming [18]. For example, directly measuring allelic-specific RNA expression in brain tumors revealed a dramatic increase in monoallelic expression of multiple oncogenes, the extent of which correlated with tumor progression and prognosis [19].
Genome-wide allelic RNA expression ratio measurements are possible by adapting genotyping array technology for quantitative measurement, demonstrating high sensitivity for detecting AEI in human cell lines and peripheral blood cells [20,21]. Multiple researchers have since used this genome-wide approach to uncover cis-acting regulatory variants in a variety of tissues [22][23][24]. However, array-based quantitative allelic analyses lacks the ability to measure AEI at rare or de novo SNPs and yields limited information about transcript isoform expression. The advent of massively parallel DNA sequencing technologies presents an opportunity to collect qualitative and quantitative aspects of gene expression in a single experiment, including splice isoform expression, genetic variants, cis-eQTLs, RNA editing, and allelic ratios [10,14,16,[25][26][27][28]. However, significant experimental and analytic challenges need to be addressed and results compared to traditional methods before RNA-Seq is deemed a reliable complement (or alternative) to existing allelic measurement techniques.
Previous characterizations of allelic ratios using RNA-Seq are subject to a number of caveats, most notably high read depth requirements [29][30][31] and underrepresentation of variant versus reference alleles [32,33] resulting from alignment algorithms penalizing variant alleles as mismatching errors when compared to the reference genome. Bioinformatic attempts to correct variant allele underrepresentation bring allelic ratios closer to unity and result in a higher number of mapped reads [33], but may not increase reliability of allelic ratio estimates [32]. Incorporating genomic information into allelic ratio measurements, for example by normalizing allelic RNA expression ratios to matched DNA ratios or by constructing personalized reference genomes for mapping, greatly improves allelic RNA ratio estimates [14,26,27]. Still, allelic expression ratios measured by RNA-Seq are yet to be systematically compared against targeted allelic expression methods to determine the reliability of RNA-Seq to measure allelic ratios. Here, we measured allelic RNA expression ratios in 9 autopsied brain regions from a single individual, using multiple alignment strategies and comparing RNA-Seq derived allelic ratios with a highly sensitive allelic quantitation method (SNaPshot). After identifying factors affecting allelic ratio estimates by RNA-Seq, we extrapolated our methods with varying stringency to a new set of wholetranscriptome RNA-Seq samples from the dorsolateral prefrontal cortex (DLPFC) of 14 different individuals, identifying tens to hundreds of genes displaying AEI in more than one individual, indicative of frequent cisacting regulatory variants. A number of these genes have evidence for harboring functional variants from cis-eQTL studies. In addition to identifying genes harboring likely cis-acting functional polymorphisms, we also surveyed sites of known RNA editing, asking whether we observed greater variability across brain regions in a single individual or across multiple individuals in the same brain region, shedding light on the degree to which RNA editing is regulated in the brain.

Results and discussion
Allelic RNA expression ratios across different alignment methods Tissue characteristics and mapping statistics from 5500 SOLiD Sequencing (Life Technologies, Grand Island, NY) in the 9 brain tissues are presented in Table 1. We used the Ovation RNA-Seq System v2 (NuGen) for cDNA synthesis, which provides coverage at non-coding (ncRNAs) and non-polyadenylated RNA transcripts in addition to protein-coding mRNAs while reducing ribosomal RNA conversion to cDNA. Given the known alignment biases in allelic RNA expression ratios in RNA-Seq [14,26,27,32,33], we compared allelic ratios at heterozygous exonic SNPs following three different alignment methods: 1) alignment to the standard NCBI Build 37/hg19 reference genome, after which reference allele counts were directly compared to variant allele counts, 2) a targeted allele-switching method requiring construction of a new "reference" genome whereby the wild-type nucleotide at 187 SNP locations in 58 genes in NCBI Build 37/hg19 were replaced with the variant nucleotide (Additional file 1: Table S1), remapped, and corresponding "reference" allele counts from both alignments used to calculate allelic ratios, and 3) a single genome-wide alignment using a modified hg19 reference genome that incorporates International Union of Pure and Applied Chemistry (IUPAC) ambiguity codes at SNP locations catalogued in dbSNP Build 135, after which reference allele counts were directly compared to variant allele counts. Construction of an "enhanced reference genome" by adding additional loci incorporating polymorphic sites is also a viable alternative [33], as is the use of personal genomes where polymorphic sites are known [14,26], or directly sequencing the genomic DNA [27], although these were not explicitly tested here.
Using either alignment method that was applied in a genome-wide manner (standard or IUPAC), the number of SNPs (or genes) available for allelic ratio analysis diminishes exponentially, as higher allelic depth is required (Additional file 2: Figure S1). We limited comparisons across the three methods to those heterozygous SNPs where allelic ratio measurements were present in all three methods, for a total of 800 independent measures across 183 SNPs in 57 genes, with depth ranging from 24 to 409 reads per SNP (Additional file 3: Table S2). To estimate the magnitude of allelic ratio correction, log-transformed allelic ratios from both correction methods (IUPAC or allele-switched) were regressed against the logtransformed allelic ratios from standard alignment ( Figure 1). IUPAC and allele-switch corrected data performed similarly, each reducing allelic ratio estimates compared to standard alignment (i.e. a 3-fold standard alignment allelic ratio corresponds to a 1.96fold IUPAC allelic ratio and a 2.03-fold alleleswitched allelic ratio).

Allelic RNA expression ratios across cDNA synthesis methods measured with SNaPshot
The two cDNA synthesis approaches used here are methodologically different. Gene-specific priming (GSP) is a strand-specific strategy, while NuGen is strandindependent and more similar to random hexamer priming. Divergent allelic ratios between these two methods can result from an admixture of plus and minus strand-encoded RNA transcripts in the NuGen cDNA, while GSP cDNA enriches for only one strand, compelling a direct comparison between the two methods. For this comparison, we individually measured allelic RNA expression ratios at 36 different SNPs in 21 genes using SNaPshot, for a total of 186 comparisons across the 9 tissues (Additional file 4: Table S3). Overall, log-transformed allelic ratios using the two cDNA syntheses were highly correlated (r 2 = 0.68, Additional file 2: Figure S2), although NuGen cDNA tended to yield higher allelic ratios, on average (i.e. a 3-fold allelic ratio in NuGen cDNA corresponds to a 2.1-fold allelic ratio in GSP cDNA). Importantly, when AEI was indicated in the GSP cDNA (>1.5-fold difference in expression between two alleles), NuGen cDNA also indicated AEI >1.5 for 17 of 21 SNPs. Similarly, when allelic ratios were <1.5 in GSP cDNA, NuGen cDNA AEI were also <1.5 at 155 of 165 SNPs. The general agreement in allelic ratios between the two cDNA synthesis methods indicates that the NuGen cDNA synthesis method used to produce the RNA-Seq libraries yields allelic ratios similar to those obtained with gene-specific priming.

RNA-Seq allelic RNA expression ratios compared to SNaPshot
Next, we compared allelic ratios resulting from any of the three alignment methods (standard, IUPAC, Allele-switching only conducted on a limited number of SNPs listed in Additional file 1: Table S1. allele-switch) to the single-gene SNaPshot measures of allelic expression (NuGen or GSP). Because we do not know the haplotype phasing of our samples, all allelic ratios were transformed to positive allelic ratio values using the formula |Log 10 (ratio)|, which ensures uniformity of SNPs with ambiguous strand alignment (C/G and A/T) and allows multiple allelic ratios in the same gene to be combined. For example, 2-fold and 0.5-fold allelic ratios at different SNPs in the same gene both represent a 2-fold relative difference between alleles and yield a 2-fold allelic ratio when combined within that gene. Allelic ratios, when compared at single SNPs, were similarly correlated between IUPAC-or allele-switch ratios versus either NuGen or GSP SNaPshot measures, while standard allelic ratios from RNA-Seq were much less correlated with either SNaPshot measure (Additional file 2: Figure S3). These correlations dramatically improved across all alignment methods, when allelic ratios were averaged at multiple SNPs in the same gene, although allelic ratios from the standard alignment were still much less correlated with SNaPshot measures (Figure 2).
Attempts at linear modeling, performed as a metaanalysis comparing ratios from the different methods using the metafor R package [34], did not return the theoretically expected level of agreement between observed allelic ratios from any alignment methods when compared against SNaPshot in either cDNA synthesis method (data not shown). Therefore, we considered pairwise logistic models (Additional file 2: Table S4) as an empirical meta-analytic approach for predicting whether RNA-Seq allelic ratios by any alignment method would meet a ≥1.5-fold threshold by our SNaPshot method in either NuGen or GSP libraries. Comparisons were only performed where corresponding data were present for both methods. RNA-Seq allelic ratios were converted into a logit score: log (greater number of reads / smaller number of reads), which was used as a predictor in each model [logit(AllelicRatio)]. Overall, the IUPAC alignment produced the best predictions by Akaike Information Criterion (AIC), while standard alignment performed the worst (lower scores corresponding to better model fit; Additional file 2: Table S4). For IUPAC ratios compared to SNaPshot ratios in the GSP library, the inclusion of two covariates with the logit(AllelicRatio) gave the best AIC value: the number of additional SNPs in in the gene times the logit (AllelicRatio) (as an "interaction term") and the heterogeneity among the allelic ratios as reported by the R meta-analysis. For IUPAC ratios compared to SNaPshot ratios in the NuGen library, these covariates did not decrease the AIC, consistent with overfitting, although they also do not result in a significantly different AIC score when included in the model. Alternative approaches, including incorporation of error estimates or derivatives as covariates, did not improve the performance of the predictor. Therefore, based on logistic regression, IUPAC allelic ratios resulted in a better model fit for allelic ratios measured by SNaPshot in either cDNA synthesis method.

Alignment correction applied across the whole transcriptome
Given similar ratios between the two alternative alignment methods and the results of our meta-analytic approach, we employed the IUPAC method on a transcriptome-wide basis in the 9 different brain regions (mapping statistics in Table 1). After transcriptomewide IUPAC alignment, we compared any annotated SNP in which both alleles had a depth of at least 3 reads (6 total reads) and the lower expressed allele constituted at least 5% of total reads at the SNP, for a total of 23,085 SNPs in 3247 genes across the 9 tissues. While 6 total reads is well below the number necessary for determining statistically significant AEI at any one SNP [30], at this point we are testing the implementation of the IUPAC alignment across the entire dataset for obvious bias. The magnitude of allelic ratio correction by IUPAC alignment versus the standard alignment was similar to that observed in the smaller dataset above (i.e. a 3-fold uncorrected allelic ratio corresponds to a 1.96-fold IUPAC-corrected ratio; Figure 3). In the IUPAC-aligned dataset, fewer SNPs were above a given allelic ratio threshold as compared to standard alignment (Figure 3 inset), decreasing in both datasets in an exponential fashion as allelic ratios increase. This alone has significant implications for examining allelic RNA expression ratios following standard alignment. When comparing the number of SNPs displaying allelic ratios >2 between alignment methods, we see a 30-50% reduction in the number of SNPs in IUPAC versus standard alignment. Therefore, the number of SNPs displaying allelic ratios with potential biological consequence is greatly overestimated just as a consequence of standard alignment methods.
As before, we combined allelic ratios at multiple SNPs in the same gene to attenuate error in single SNP allelic ratios. Similar to the single SNP analysis, the overall number of genes displaying allelic ratios above 2 is considerably higher in the standard versus IUPAC dataset. When only requiring one SNP, 13,786 gene × tissue combinations were represented in this analysis, with allelic ratios ranging from 1-to 13.3-fold. Requiring more than one SNP per gene reduces the number of gene x tissue combinations for analysis to 4,667 and also reduces the number of genes displaying major allelic ratios greater than 5-fold (<1%). We observe 641 gene × tissue combinations with ≥2-fold allelic ratios for IUPAC alignment, as compared to 1255 combinations for standard alignment. Of the 641 IUPAC ratios ≥2, 422 (66%) were represented in the standard alignment dataset. Restricting the analysis to SNPs with at least 10 counts per allele and genes with at least 2 measurements only marginally increases coincidence of genes with allelic RNA expression ratios ≥2 between the two alignment methods (69%). So, not only do the two methods give different allelic ratio estimates, but they also produce gene pools that are only 66% similar for allelic ratios ≥2. Increasing stringency in this manner also does not appear to improve the accuracy of estimated allelic ratios when compared to SNaPshot. Five of the 26 genes that overlap between IUPAC and standard alignment (DAD1, KCNQ3, NHP2L1, SCN1A, SCN4B) with allelic ratios ≥2 in the IUPAC dataset were measured with SNaPshot and only one had allelic ratios ≥2 (NHP2L1).
The variability of allelic RNA expression ratios across multiple SNPs in a single gene is another metric that can guide our search for allelic expression imbalance. To further eliminate likely false-positives indicated by high within-gene allelic ratio variability, we can ask whether any gene-wise allelic ratio remains above a certain threshold after adjusting by the standard deviation for all SNPs within that gene. For example, 109 gene × tissue combinations (98 genes) have an allelic RNA expression ratio ≥1.5 after subtracting two standard deviations from the original allelic ratio. Of those 109 genes × tissue combinations, we measured allelic ratios in 8 using SNaPshot, 7 of which displayed allelic ratios >2. These strict requirements do exclude a number of samples where SNaPshot allelic ratios are ≥2, demonstrating the tradeoff between capturing AEI with greater probability and allowing too many false positive allelic ratios.
Allelic RNA expression ratios across brain regions Now that we have characterized the sensitivity of RNA-Seq in detecting allelic ratios and have estimates of the false discovery rate as compared to our single-gene methods, we can begin to provide a meaningful interpretation of the allelic ratios observed across the 9 tissues. Taking into account the factors that best improved concordance between RNA-Seq allelic ratios and those measured by SNaPshot, we used IUPAC aligned gene- Figure 3 Standard versus IUPAC-aligned allelic RNA expression ratios in all 9 brain regions. IUPAC alignment systematically reduces allelic ratios by approximately 0.3-fold per fold change in Standard alignment allelic ratios (i.e. a 2-fold Standard allelic ratio corresponds to a 1.4-fold IUPAC allelic ratio). Inset: Following IUPAC alignment, fewer SNPs are above any given allelic ratio threshold compared to Standard alignment, suggesting uniform reduction of allelic ratios across the spectrum of allelic values.
wise allelic ratios averaged at multiple annotated SNPs, leaving us with the 4,667 gene x tissue allelic RNA expression ratios, 641 of which were ≥2, as noted above. We have chosen allelic ratios ≥2 in these IUPAC-aligned samples as being indicative of AEI. This allelic ratio value from IUPAC-aligned gene estimates should approximately correspond to a 1.5-fold allelic ratio by SNaPshot, according to our analysis above.
One hundred forty-nine genes had allelic ratios measurable in all 9 brain regions by our methods, none of which displayed AEI in all 9 tissues. Two genes, NHP2L1 and SLC1A3, displayed AEI in 8 regions and were consistent with SNaPshot allelic ratios. High allelic differences were also observed in the ninth region for each gene, but were not included in the overall analysis as the excluded tissues had only one informative SNP for AEI measurement. NHP2L1 encodes a protein that is a highly conserved component of the spliceosome, but the biological significance of altered mRNA expression for this gene is unknown. SLC1A3 is the high-affinity glial glutamate transporter (also known as Eaat1 or GLAST in rodents). Clinical phenotypes are evident for altered SLC1A3 function, including ataxia or epilepsy, but this may be a consequence strictly resulting from proteincoding mutations, as known disease-linked mutations are presumed to act in a dominant negative fashion in the assembled homotrimeric transporter [35]. Changes only to mRNA expression, as seen in heterozygous Slc1a3 knockout mice, do result in some behavioral abnormalities [36], but the applicability of these findings to humans is unclear. The other genes exhibiting ubiquitous AEI where measured included ANK1, EDEM3, FAM164A, LOC338651, PTK2B, SC5DL, SEC22B, TUBA1C, and ZNF675.

Identifying common cis-acting regulatory variants in DLPFC
A primary purpose for measuring allelic ratios is to identify common cis-acting regulatory polymorphisms. Extrapolating our findings to a set of 14 DLPFC RNA-Seq samples from different donors, which includes the DLPFC of the donor of the other 9 brain regions, we can begin to ask which genes exhibit evidence for harboring common cis-acting regulatory variants and further ask which genes show the strongest evidence. Our approach includes not only protein-coding mRNAs, but also ncRNAs, which are gaining widespread appreciation for their cis-regulatory roles in gene expression and other important biological actions [37]. Considering the high probability that SNPs alter the conformational properties of RNA [38] and the already known importance of structure-function relationships in large classes of ncRNAs (transfer RNAs, ribosomal RNAs, etc.), the inclusion of ncRNAs here presents interrogation of an additional layer of cis-acting regulation absent when only selecting protein-coding mRNAs for RNA-Seq or subsequent analysis.
Out of 14 samples, we expect to detect functional variants with a minimum heterozygosity of~15%, assuming that 2 samples displaying AEI in the same gene suggests a shared cis-acting functional variant. This analysis carries many caveats, including the assumption that only a single functional variant per gene is driving AEI, that we are able to measure allelic ratios in the same gene in all 14 samples, and that these samples do not display batch effects or artifacts associated with the library preparations or sequencing methods. Demographics and sequencing statistics are listed in Table 2.
With a cutoff of at least 3 reads per allele (6 total reads) in annotated exonic SNPs following alignment with the IUPAC reference, we calculated allelic ratios at 25,837 polymorphic sites across 7524 genes and ncRNAs in the 14 samples, for a total of 53,107 SNP x gene combinations. Using permissive parameters, we asked which genes had ≥2-fold AEI when averaged across multiple SNPs in the same gene and therefore show evidence for harboring a cis-acting regulatory variant. Over half (4083 of 7542) of all genes were excluded from further analysis because none of the 14 samples had more than one informative allelic ratio. In the remaining 3441 genes, we observe AEI in more than one sample for 500 genes (Additional file 5: Table S5). Specifically with respect to ncRNAs, we calculated 838 allelic ratios in 285 unique transcripts, of which, 49 exhibited allelic ratios ≥2 in two or more samples. We expect that in this analysis, we are likely overestimating the number of genes with AEI, especially given the lack of power to detect statistically significant AEI at low coverage [30]. In some cases, it is possible that AEI observed in many samples for the same gene is an artifact driven by the presence of a pseudogene or another family member with high sequence homology. For example, the gene/pseudogene SEC22B exhibits AEI in all 14 tissues and also displayed AEI where measured in each of the 9 brain regions. Of the 9 genes with AEI in 6 or more samples, only 2 (ANK3 and LMO7) can be excluded from obvious interference by pseudogenes or highly homologous related family members. Interference from homologous family members assumes both genes are expressed in the same tissue to a level detectable by RNA-Seq and does not necessarily disqualify putative AEI without further study.
Given our permissive parameters for designating AEI and the possibility that gene homology is contributing to overrepresentation of AEI, we increased the stringency for designating AEI based on the variability of allelic ratios between SNPs in the same gene, to ask which genes have strong evidence for harboring cis-acting regulatory variants. As above, we subtracted two standard deviations of the within gene allelic ratios from the total gene allelic ratio for each sample and used a 1.5-fold allelic ratio as a cutoff for designating AEI. This yielded only 52 genes in which AEI was observed in multiple samples ( Table 3). Three of the 52 genes were the same as those identified by the permissive analysis as having associated pseudogenes or homologous family members. Three of the 52 genes are identified as ncRNAs, but we can only exclude one ncRNA (LINC00461) from interference by pseudogenes or RNA editing (see below). Crossreferencing these 52 genes with cis-eQTLs identified by another study [1], 14 of the 44 genes where data is available exhibit evidence for harboring a common functional SNP affecting RNA expression ( Table 3). As another approach, we excluded genes in which the standard deviation between SNPs in the same gene was greater than one-third of the total allelic ratio for that gene, keeping a ≥2-fold threshold for AEI. This analysis yielded 71 genes with putative AEI, 46 of which had no significant homology with the rest of the transcriptome (Additional file 6: Table S6). With respect to RNA editing described below, only PAR-SN and PDIA3P exhibited significant AEI (Table 3) and evidence for RNA editing.
Surveying RNA editing within and across brain regions From these single base allelic ratios we can readily detect instances of RNA editing, as one type of posttranscriptional modification, and survey the variability of editing at single sites and the extent to which they are edited across different brain regions and across different individuals in the DLPFC (Table 4). In this analysis, DLPFC3 is included with the other 9 brain regions, as it originates from the same donor brain (see Table 2). We are specifically interested in determining whether RNA editing is more variable across brain regions in a single individual or within a single brain region across many individuals. Answering this question yields insight into the regulatory factors guiding this process. For example, greater variability across regions suggests that each region has a unique complement of trans-acting proteins guiding this process that is stable across individuals, while greater variability across individuals suggests that each region contains a common complement of trans-acting factors that can vary across individuals, among possible interpretations.
Cross-referencing our RNA-Seq data with known RNA editing sites from the DAtabase of RNa EDiting in humans (DARNED) [39], we find 2,358 and 3,249 sites noted in DARNED where we observe expression of the variant allele in the 10 different brain regions and 13 DLPFC, respectively (1,271 overlapping). We applied stringent criteria (see Methods for more information) to maximize the likelihood we are capturing true instances of RNA editing and also required at least 5 of 10 regions and 9 of 13 DLPFC to exhibit RNA editing at any one site for comparisons, leaving only 12 RNA editing sites in 8 genes ( Table 4). The magnitude of editing at each of the 12 sites is highly correlated (r 2 = 0.88) across the 10 regions and 13 DLPFC. In addition, the variability of  RNA editing is not significantly different within the 10 regions as compared to the 13 DLPFC, suggesting RNA editing is tightly regulated at these 12 sites. These findings could be driven by our stringent criteria for designating RNA editing sites, but other studies have found similarly consistent levels of intra-versus interindividual adenosine-to-inosine editing [40] and at greater depth in the brain [41]. RNA editing does impact some single-gene allelic ratio estimates. A total of 43 genes had annotated polymorphisms (with assigned rs numbers) where RNA editing is also reported in DARNED, including 13 of the 574 genes exhibiting ≥2-fold AEI in our less stringent analysis (Additional file 5: Table S5 Notes). Four of these 13 genes would not meet criteria for AEI analysis in one or more samples if the putative editing site was excluded, lacking the minimum 2 sites we required for allelic ratio estimates. Four of the 9 remaining genes have at least one sample previously exhibiting ≥2-fold AEI which now exhibits allele ratios <2-fold after excluding the putative RNA editing sites. Most significantly-associated SNP with any probe for the corresponding gene reported in BrainCloud [1]. N/A = no data available for gene. * = significant cis-eQTL association. RNA-Seq to detect AEI at single SNPs will likely continue to improve as read length increases, sequencing error rate decreases, and additional strategies are developed to account for variant alleles. Single SNP analysis is necessary to evaluate posttranscriptional transcript modifications, such as RNA editing. Our survey of RNA editing sites in the brain is consistent with previous reports of RNA editing conducted at much greater depth [41], suggesting that single position resolution of allelic RNA expression ratios can be reliably measured at lower depth using RNA-Seq. Further, the lack of variable editing across brain tissues and across individuals argues that RNA editing in the brain is critically maintained at an optimal level, supported by observations of dysregulated RNA editing in cancer [52]. The most obvious candidates for regulating this process globally are the adenosine deaminase enzymes, ADAR and ADARB1. Although we find some correlation between mRNA expression of ADAR or ADARB1 and RNA editing at these 12 sites, a more comprehensive analysis is necessary to further speculate on this relationship across the brain. While our study helps establish intra-and inter-individual differences in RNA editing in the human brain, the speed and breadth of genomic sequencing technologies is driving studies of RNA editing beyond simple quantitative levels, even revealing differences in subcellular editing events [53]. Our studies and others make it evident that RNA-Seq advances our ability to interrogate multiple aspects of the transcriptome in a single experiment, including allelic RNA expression ratios, as compared to singlegene approaches.

Tissue preparation and library construction
The 9 brain regions for this study were collected 12 hours post-mortem (pH 6.9) from a 20-year old African-American male smoker with no known neurological or neuropsychiatric disorders. Individual brain regions were dissected by a trained neuropathologist. The additional 14 DLPFC samples were collected in a similar manner (demographics and tissue characteristics in Table 2). RNA from each of the brain regions was TRIzol-chloroform extracted and purified with RNeasy Mini Kit spin columns (Qiagen, Germantown, MD), following standard protocol for on-column DNase treatment. DNA was isolated from the tissues using a 'salting out' method [54] supplemented with additional sodium dodecyl sulfate for lipid-rich brain tissue. Following nucleic acid isolation, 10 ng of total RNA was converted to cDNA using the Ovation RNA-Seq System V2 (Nugen Technologies, Inc., San Carlos, CA). This cDNA was used to construct libraries for massively parallel sequencing using the NEBNext DNA Library Prep Set for SOLiD (New England Biolabs, NEB, Ipswich, MA) and also for AEI measurements using SNaPshot, described below. Gene-specific primed cDNA used for SNaPshot was reverse transcribed from 500 ng of total RNA, using SuperScript III Reverse Transcriptase (Life Technologies).

RNA-seq alignments and allelic counting
For the nine brain regions from the single individual, paired-end sequenced reads from a 5500 SOLiD System (LifeTechnologies) were mapped to the human genome with LifeScope Genome Analysis Software v2.5.1 (Life Technologies) using three different methods. First, all reads from each region were mapped to the NCBI Build 37/hg19 genome using the default LifeScope RNA-Seq parameters. Single nucleotide variants were identified with Samtools v0.1.16 [55], which provides a count of the aligned reads containing the reference or variant allele. Identified SNP locations were annotated based on UCSC annotation databases and dbSNP using annovar annotation software [56]. Based on annotation, each SNP was assigned to a location within a gene locus-whether exonic, intronic, intergenic, UTR, or upstream/downstream (within 1 kb of the coding region). Exonic allelic counts, including UTRs, for each polymorphic site were used to calculate allelic expression for this first alignment, yielding standard alignment AEI values. Allelic ratios for multiple polymorphisms residing within a 100 basepair window were averaged and treated as a single observation, since they likely do not represent independent observations they likely reside on the same sequenced library fragment. Second, at 187 heterozygous polymorphisms from 53 genes expressed in at least one brain tissue (800 total instances), we built a custom reference genome by replacing the reference allele with the variant SNP allele in NCBI Build 37/hg19 using the GATK FastaAlternateReference tool (http://www.broadinstitute.org/gatk/gatkdocs/org_ broadinstitute_sting_gatk_walkers_fasta_FastaAlternate ReferenceMaker.html) and remapped all reads using the same parameters as used for the standard reference genome. AEI was then calculated as the ratio of the reference allele count from the standard alignment versus the reference allele count at the switched alleles in the modified genome alignment. Finally, all reads from each brain region were mapped to a third genome containing IUPAC ambiguous nucleotide characters for each annotated SNP in dbSNP 135, downloaded from the UCSC Genome Browser (http://hgdownload.cse.ucsc.edu/goldenPath/hg19/ snp135Mask/). IUPAC-corrected AEI was calculated as a ratio of the reference versus variant alleles. The 14 DLPFC libraries were made from cDNA synthesized by the NuGen Ovation RNA-Seq System, sequenced with the SOLiD 4 System or SOLiD 5500 System (Life Technologies), and aligned to the IUPAC references, as described above.

AEI measurement by SNaPshot
Allelic mRNA expression ratios were measured by SNaPshot in NuGen or GSP cDNA libraries in duplicate by first PCR-amplifying a region surrounding the heterozygous SNP in both cDNA (5 ng of NuGen or 12.5 ng of GSP) and 25 ng genomic DNA (gDNA) with 2× Taq Master Mix (NEB) for 30 cycles in a 10 μl total reaction. Primers (0.3 μM) used for AEI are within a single exon to allow amplification of equivalent cDNA and gDNA molecules. Following amplification, excess single-stranded primers in the PCR reaction are digested by simultaneous Exonuclease I and Antarctic phosphatase (NEB) treatment. Subsequently, 2 μl of the amplified product is added to a 5 μl total SNaPshot reaction, consisting of 1.5 μl of SNaPshot Multiple Kit reagent (Life Technologies), 1 μl of 2 μM extension primer, and 0.5 μl of water. Extension primers for the SNaPshot reaction are immediately adjacent to the SNPs, which direct incorporation of a single fluorescent dideoxynucleotide at the SNP position in the PCR amplicons, with each nucleotide represented by a different fluorophore. Following SNaPshot, unincorporated fluorescent nucleotides are digested by Calf Intestinal Phosphatase (NEB) and the resultant fluorescent SNaPshot product is separated and detected by capillary electrophoresis on an ABI3730 DNA Analyzer. Peak heights for the different fluorescent products calculated using GeneMapper 4.0 software (Life Technologies) in cDNA and gDNA are used to calculate allelic ratios (reference/ variant allele). Finally, cDNA ratios are normalized to gDNA ratios (representing a 1:1 relationship), yielding estimated allelic mRNA ratios, which indicate AEI if ratios significantly deviate from unity.

Survey of RNA editing
Following IUPAC alignment in all tissues, sites deviating from the reference allele (i.e. SNPs) were cross-referenced with known RNA editing sites from DARNED [39]. To be included in the RNA editing analysis, we required at least 5 of the 10 regions and 8 of the 13 DLPFC to exhibit variant allele reads and for the average depth across the regions and individuals to be greater than 10 reads. We also excluded locations with ambiguity in mapping due to pseudogenes. After applying these filters, we observed some instances where the reference allele used for mapping was likely incorrect relative to our population (i.e. all samples demonstrate expression of only the variant alleles). In these instances, locations where a 95% confidence interval constructed from read count distribution across samples encompassed complete (100%) mapping to the variant allele were further excluded.