The systematic bias in measures of ASE correlates with the density of differentiating sites
As described above, Degner et al. [14] found that allele-specific reads mapped preferentially to the reference allele when using a single reference genome to quantify ASE. The alignment parameters they used allowed two or fewer bases within each read to differ from the reference genome. Reads perfectly matching the reference genome were assigned to the reference allele, while reads with at least one difference from the reference genome were assigned to an alternative allele. We hypothesized that the inability to map reads with more differences from the reference genome than mismatches allowed underestimated the abundance of the alternative allele and caused measures of ASE to be biased toward the reference allele.
To test this hypothesis, we generated an equal number of reads from two genotypes in silico, combined them, and measured the relative abundance of allele-specific reads. These sequences were derived from 52,370 non-overlapping constitutively-expressed exons in Drosophila melanogaster (Additional file 1; [15]). The annotated D. melanogaster genome (dm3) was used as the “reference” allele, and an edited version of this genome with 93,781 coding sites altered to match alleles in a line of D. melanogaster from the Drosophila Genetic Reference Panel [23, 24] was used as an “alternative” allele. We generated 36-base reads from each allele starting at every possible position in each exon and repeated this process for both strands of DNA because RNA-seq is usually performed using double-stranded cDNA (Figure 1). This process generated 93,395,272 reads, representing ~3.4 Gb of sequencing data. Importantly, this approach guaranteed that reads from each allele were present in equal amounts. To quantify relative allelic abundance as a proxy for relative ASE, we aligned each read to the reference genome using Bowtie [10], excluding reads that mapped to multiple locations, and evaluated the number of reads assigned to the reference and alternative alleles at each differentiating site using SAMtools [13].
Initially, we allowed one mismatch to the reference genome during the alignment step, which is the minimum number required to align a read from the alternative allele. We found that 50.9% of differentiating sites had unequal measures of allelic abundance, 99.3% of which were biased toward the reference allele. To determine whether this bias was influenced by the density of differentiating sites, we calculated the maximum number of sites that differed between the two alleles among all possible 36-base reads overlapping each differentiating site (Figure 1). Of all sites considered, 49.8% had at least one neighboring differentiating site (i.e., at least one other differentiating site within an overlapping read). Of these sites, 99.8% showed more reads assigned to the reference allele than to the alternative allele. Furthermore, the extent of bias toward the reference allele increased with the number of neighboring differentiating sites (Figure 2A). This bias was caused by the failure of reads simulated from the alternative allele to align to the reference genome more often than those simulated from the reference allele. Aligning reads to only the alternative allele produced complementary results (Additional file 2). These findings are consistent with our hypothesis that the density of differentiating sites complicates the mapping of reads and leads to biased measures of relative ASE.
To decrease the impact of neighboring differentiating sites on allelic assignment, we allowed two or three mismatches when aligning our simulated reads to the reference genome. We found that increasing the number of mismatches improved measures of allelic abundance: 80.2% and 91.9% of differentiating sites were inferred to be equally abundant when two and three mismatches, respectively, were allowed. A bias toward the reference allele was still observed, but only for sites where the number of neighboring differentiating sites was greater than or equal to the number of mismatches allowed during the alignment step (Figure 2B,C). Increasing the number of mismatches allowed reduced the bias toward the reference allele, but increased the percentage of reads that failed to map uniquely: allowing one, two, and three mismatches, 2.2%, 2.5%, and 2.9% of all reads failed to map uniquely, respectively.
For comparison, we aligned the simulated reads independently to the reference and alternative genomes with the same parameters used when aligning reads to the single reference genome except that zero mismatches were allowed. This is analogous to aligning reads to the maternal and paternal genomes, which is a strategy that has previously been shown to produce unbiased measures of relative ASE [15, 18–20, 25]. We found that 99.0% of differentiating sites showed equal representation of the two alleles, with the rest showing no systematic bias toward either allele (Figure 2D). Only 1.9% of all reads were excluded because they failed to map uniquely to at least one genome.
Read length and the amount of sequence divergence can also affect allelic bias
Given the observed impact of neighboring differentiating sites on allelic assignments, we hypothesized that longer reads might produce less accurate measurements of allele-specific abundance because they should overlap more neighboring differentiating sites. To test this hypothesis, we repeated our simulation with 50-base reads, determining the maximum number of sites that differed between the two alleles among all possible 50-base reads overlapping each differentiating site. We found that 40.6%, 73.0%, and 88.9% of differentiating sites showed equal representation of the two alleles when aligned to a single reference genome with one, two or three mismatches allowed (Figure 2E-G). Increasing the number of mismatches allowed when aligning the 50-base sequence reads to be more similar to the ratio of mismatches allowed for the 36-base sequence reads eliminated this difference, however. 91.9% and 92.1% of differentiating sites showed equal allelic abundance for 36- and 50-base reads when three and four mismatches, respectively, were allowed (Additional file 3). By contrast, 98.8% of differentiating sites showed equal representation when reads were aligned to the maternal and paternal genomes with zero mismatches allowed (Figure 2H).
Increased sequence divergence is also expected to affect measures of relative allelic abundance because it should increase the average number of neighboring differentiating sites within each read. To test this hypothesis, we simulated 36-base reads from two different Drosophila species (D. melanogaster and D. simulans;[16]) and analyzed them as described above, using the D. melanogaster exome as the single reference genome. Sequences from 60,040 orthologous exons with 1,130,435 differentiating sites were used for this simulation, which is an order of magnitude more differentiating sites than between the two strains of D. melanogaster analyzed. As predicted, we found that the bias toward the reference allele was higher for the interspecific comparison than for the intraspecific comparison when reads were aligned to a single reference genome (Figure 2, compare I-K with A-C). When aligning reads to both parental genomes, however, sequence divergence had a negligible impact: the intra- and interspecific datasets produced nearly identical results (Figure 2, compare L with D).
Allele-specific differences in mappability and insertions/deletions affect measurements of ASE
Differences between alleles in sequences that appear more than once in the genome can also cause reads to be excluded for one allele but not the other [14]. Assuming the number of such differentiating sites is similar between alleles, differences in allele-specific mappability should not systematically favor one allele or the other, but will still cause errors in relative ASE. To examine the impact of mappability on measures of relative allelic abundance derived from our simulated data, we used software from the GEM library [26] to calculate a mappability score for each differentiating site by averaging the mappability scores of all possible reads that included that site. In each case, mappability scores were calculated using the same number of mismatches allowed during read alignment. Differentiating sites with an average mappability score < 1 were considered to have imperfect mappability when using a single reference genome. When using parental genomes, we summed the average mappability scores for each allele, and mappability scores < 2 were considered to have imperfect mappability.
We then compared relative allelic abundance for sites with perfect and imperfect mappability in all three simulated datasets (Figure 3), excluding sites with more neighboring differentiating sites than the number of mismatches allowed when aligning to a single reference genome. For both the 36- and 50-base reads simulated from the two D. melanogaster genotypes, > 97.9% of sites with perfect mappability showed the expected equal abundance of the reference and alternative alleles under all mapping conditions (Figure 3A-H). For the 36-base reads simulated from the D. melanogaster and D. simulans genomes, 99.9% of sites with perfect mappability showed equal abundance when reads were aligned to both parental genomes (Figure 3L), but only ~94% of sites with perfect mappability showed such equal abundance when reads were aligned to a single (D. melanogaster) reference genome (Figure 3I-K).
We hypothesized that this decrease in accuracy after aligning D. melanogaster and D. simulans reads to a single reference genome might be caused by the presence of insertions or deletions (indels) between D. melanogaster and D. simulans that are located near differentiating sites (i.e., within the length of a read from the differentiating site). Such indels can prevent the alignment of D. simulans reads to the D. melanogaster genome. Consistent with this hypothesis, we found that sites with perfect mappability that had an indel nearby showed more reads assigned to D. melanogaster than D. simulans allele when reads were aligned to only the D. melanogaster genome, whereas sites with perfect mappability that lacked such an indel did not (Figure 4A-C). When reads were aligned to both parental genomes, sites with perfect mappability showed equal representation of the two alleles regardless of the presence or absence of nearby indels (Figure 4D). Indels were not a factor in our comparisons of the two D. melanogaster strains because the alternative allele was constructed by changing only single nucleotides in the reference allele.
Aligning real sequencing data to a single genome can produce reliable measures of relative ASE
Assessing the accuracy of relative ASE measurements derived from RNA-seq data is challenging because the true value of relative ASE is rarely known. Independent empirical methods for measuring relative ASE such as Pyrosequencing and qPCR can be used to validate RNA-seq data for individual genes, but they are not suitable for quantifying relative ASE on a genomic scale. Therefore, instead of using real RNA-seq data to evaluate factors affecting measures of relative ASE, we used sequence data that was collected in a comparable manner from genomic DNA extracted from F1 hybrids, in which all maternal and paternal alleles are expected to be present in equal amounts.
Specifically, we used 36-base reads from genomic DNA extracted from female F1 hybrids that were produced by crossing inbred strains of D. melanogaster and D. simulans[16]. These strains had the same genotypes as the D. melanogaster and D. simulans sequences used for the interspecific simulation described above. Reads were aligned to the D. melanogaster exons allowing one, two, or three mismatches, as well as to both the D. melanogaster and D. simulans exons allowing zero mismatches. Because real sequencing data involves stochastic sampling, the proportion of the reference allele observed was not always expected to be 0.5. Therefore, after aligning reads, we excluded differentiating sites with fewer than 20 overlapping reads and used binomial exact tests with a false discovery rate threshold of 0.05 to test each differentiating site for a statistically significant difference in relative allelic abundance [15, 27].
As described above, our simulated datasets showed that reads containing (1) as many or more neighboring differentiating sites as mismatches allowed during alignment, (2) imperfect mappability, and/or (3) an indel(s) between alleles can cause inaccurate measures of relative allelic abundance. Differentiating sites with an excess of neighboring differentiating sites were the most common of these three types of problematic sites in both intra- and interspecific simulations (Figure 5A). To determine the relative impact of each of these factors on measures of allele-specific abundance derived from real sequencing data, we filtered the differentiating sites based on each factor sequentially and determined the percentage of differentiating sites retained that had no statistically significant difference in abundance between alleles (hereafter referred to as “equal allelic abundance”) for each alignment strategy.
Prior to excluding any sites, 70.4%, 88.9%, and 93.3%, respectively, of all differentiating sites showed equal allelic abundance when reads were aligned to a single genome with one, two, or three mismatches allowed. After aligning reads to both parental genomes, 96.9% showed evidence of equal allelic abundance. Excluding differentiating sites with at least as many neighboring differentiating sites as the number of mismatches allowed increased this percentage to 96.3%-96.6% when aligning to a single reference genome (Figure 5B). Further restricting the set of differentiating sites to those with perfect mappability increased these percentages ~0.1%, and subsequently excluding differentiating sites with indels nearby increased the percentage of genes with equal allelic abundance an additional ~0.1% (Figure 5B). After filtering out these problematic sites, measures of relative allelic abundance derived from aligning reads to a single reference genome were similar to those produced by aligning sequence reads separately to the maternal and paternal genomes (Figure 5C-E).
Excluding selected differentiating sites maintains ability to measure relative ASE for most exons
We focused on measures of relative ASE for individual sites in this study, but most researchers are more interested in relative ASE for individual exons and/or genes. The major consequence of excluding sites based on the density of differentiating sites, mappability, and/or indels is that fewer allele-specific reads will be successfully mapped for each exon and for each gene. After filtering based on the number of neighboring differentiating sites, we found that 46.6%-86.9% and 8.3%-50.5% of differentiating sites were retained in the 36-base intra- and interspecific simulations, respectively, when the reads were aligned to a single reference genome and one, two, or three mismatches were allowed (Figure 6). By comparison, 81.8%-91.8% and 66.3%-95.2% of exons contained at least one of these reliable differentiating sites when the same alignment conditions were used in the intra- and interspecific simulations, respectively. Excluding additional differentiating sites with imperfect mappability in both datasets, as well as sites with one or more nearby indels in the intraspecific dataset, had little effect on the proportion of differentiating sites and exons retained (Figure 6). The retention of more differentiating sites and exons in the intraspecific simulation than in the interspecific simulation (Figure 6) is consistent with the lower sequence divergence within than between species. Analyses using real and simulated reads to compare the same sets of alleles retain the same sites and exons when aligned to the same reference genome because differentiating sites are excluded based only on the genome sequence(s).