Measuring isoform diversity in single cells
We selected six single cells for which cDNA was available from an earlier experiment [22]. Two cells were vascular and leptomeningeal cells (VLMCs), and four cells represented stages of oligodendrocyte maturation: Oligo1 (immature oligodendrocyte) and Oligo5 (mature myelinating oligodendrocyte). The cDNA was previously prepared using STRT/C1 [23], which resulted in full-length cDNA normally sequenced from the 5′ end, to indicate only the transcription start site. Here, we instead sequenced each cDNA sample using Pacific Biosciences Single Molecule Real Time (PacBio SMRT) technology [24], which generated long reads often comprising the entire length of each cDNA molecule. Known adapter sequences were trimmed off each end and their presence was used to confirm the full-length nature of each read. Two PacBio runs were performed, the second of which used an enrichment step for long molecules and an improved sample preparation method and yielded longer reads; reads from both experiments were pooled. We found that a large number of reads in the long data set consisted of concatemers of shorter molecules (33% of all molecules in the long data set contained three or more detectable cDNA ends, as shown by the presence of adapter sequences). This phenomenon was also present in the short data set, albeit much less frequently. Since samples were pooled after PCR amplification but before circularization, and since fragments were always found ligated end to end, we conclude that the concatemerization must have happened during the circularization reaction. We therefore split such reads into individual subreads using the adapter sequences. In order to ensure read length was not limiting, we removed all reads that did not include the polyadenylation tail as well as the first exon (Additional file 1: Figure S2A-B). Due to this, the relatively low read depth and the generally low transcript capture efficiency of single cell RNA sequencing protocols, the results below are lower-bound estimates of isoform diversity. A summary of the six sequenced cDNA libraries is given in Additional file 1: Figure S1, Additional file 2: Table S1 Reads per cell, Additional file 3: Table S2 Summary Conservative Isoforms, Additional file 4: All isoforms and Additional file 5: Isoforms and coding isoforms per gene.
We next analyzed the technical performance of our methods, to determine their quantitative accuracy. PacBio sequencing of single-cell RNA requires extensive amplification so there was a concern that the amplification would cause bias in the data. We used unique molecular identifiers (UMIs) to label individual cDNA molecules before PCR, and hence to identify and merge redundant PacBio reads originating from the same original cDNA molecule. Due to the low read depth from PacBio sequencing a large number of transcripts constituted singletons, i.e., UMIs observed only once (61% of the total). We opted to not remove such molecules since due to the low sequencing depth, they are likely to represent true isoforms. However, we made use of those cases where UMIs were sampled more than once to assess the technical reproducibility of our methods. In this way, we were able to correct for unequal amplification, as well as correct sequencing errors that otherwise would have resulted in spurious false-positive isoforms. Since each read with the same UMI came from the same molecule it was also possible to assess these technical artefacts by analyzing how much variability there was between reads with the same UMI.
We found very few errors at exon junctions (<1% of reads offset by 1 bp; Additional file 1: Figure S3A-C), however the variability at the 5′ and 3′ ends was higher, but mostly restricted to an offset of 1 bp (<15% of reads offset by 1 bp for ERCC reads and slightly higher for endogenous genes). The variability at the 5′ end was slightly lower than at the 3′ end, perhaps reflecting the presence of a polyguanine stretch at the 5′ end of these cDNAs (introduced during cDNA synthesis). Overall however, we found that we were confidently able to measure 5′ and 3′ ends of cDNAs, as well as exon-exon junctions with an accuracy of about ±1 bp. Note that the errors reported here are raw errors before UMI correction. We merged all reads with the same UMI by taking the consensus start and end position of each exon, thus reducing the error.
Another possible source of error is reverse transcription. When two identical mRNA molecules are reverse transcribed, it is possible that one of them does not result in a full-length cDNA, which could be mistaken for a true mRNA isoform. UMIs cannot correct for such errors, since UMIs are introduced during cDNA synthesis (and will thus label the two cDNA molecules with two distinct UMIs). To measure this source of technical error, we examined the ERCC (External RNA Controls Consortium) spike-in control RNA, which comprise 92 commercially available in vitro transcribed synthetic mRNAs, ranging from 255 to 2007 bp in length. ERCC transcripts have known start and end positions and can therefore be used as a benchmark for how frequently cDNAs include the proper 5′ and 3′ ends. We calculated the offset from the expected 5′ and 3′ positions for each ERCC transcript. As expected, it was greater than the sequencing error alone, with most reads falling within ±5 bp (Fig. 1a and b) both at the 5′ and the 3′ ends. The 3′ end was slightly more accurate than the 5′ end, probably reflecting premature termination of reverse transcription resulting in a shorter molecule with correct 3′ but truncated 5′ end. Ninety-two percent of all reads were aligned to within 5 bp of the 5′ end, and 98% within 5 bp of the 3′ end (Fig. 1b and c).
In previous work using PacBio sequencing [15], a marked drop in read length was noticed for ERCC molecules longer than 1.5 kb, where a median of 377 base pairs was missing. However, in our present dataset the median number of missed nucleotides for all ERCC transcripts, for both 5′ and 3′ positions, was zero, and the number of transcripts that deviated from the correct ERCC starting position increased only slightly with increased transcript length, as shown in Additional file 1: Figure S4. In conclusion, technical sources of error introduced an uncertainty of around ±1 bp at each exon boundary and less than ±5 bp transcript 5′ and 3′ ends. Conservatively, we therefore considered all variability within these boundaries as technical artifacts, which were omitted from all analyses below, and we restricted our analysis to transcripts that both covered an annotated first exon and contained a poly-A tail to ensure full length isoforms are studied.
Isoform structure in single cells
First, we examined the transcription start and termination sites. As expected, ERCC control RNAs were nearly all full-length and both 5′ and 3′ ends mapped to the extremes of each transcript (Fig. 1c). In contrast, for endogenous genes only around 30% of 5′ ends of transcripts were located near the annotated 5′ end, with a large number of truncated transcripts aligned to the 3′ UTR. This was in agreement with our previous finding that endogenous genes tend to be truncated at their 5′ ends, probably partly representing ongoing mRNA degradation [25], partly the presence of unannotated alternative transcription start sites, and partly due to strand invasion during reverse transcription, which has been shown to contribute with template switching artifacts [26].
Similarly, only around 70% of 3′ ends were located at the annotated 3′ end of genes, with the remainder distributed mostly in the 3′ UTR but away from the annotated transcription termination site. Thus, most truncated 3′ ends could be attributed to alternative polyadenylation sites in the 3′ UTR (or to degradation from the 3′ end). However, interestingly almost 5% of all transcripts ended close to the annotated 5′ end of the gene (within the first 15% of the total gene length, Fig. 1c), thus likely representing short prematurely terminated transcripts.
To illustrate the extent of isoform diversity, and the structure of common isoforms, we visualized isoforms of the Mbp gene (Myelin basic protein; Fig. 1d). Because this gene is highly expressed in oligodendrocytes, it clearly shows a number of commonly occurring features. It showed multiple different TSSs in the first exon, as well as probably degradation from the 5′ end (although it cannot be excluded that some of those events are alternative TSSs). The heterogeneity in the 3′ UTR was even greater, in the form of 3′ truncations as well as exclusion of internal segments of the 3′ UTR. As seen in Additional file 1: Figure S5, this phenomenon appears also in other highly expressed genes like Plp1 and Cnp. There were a number of different exon cassette inclusion events. Interestingly, Mbp showed evidence of exon connectivity [11], where exons 4 and 5 were almost always either both included or both excluded (P < 0.001 by Fisher’s exact test, two-sided). This was in contrast to exons 2 and 3, which were independently excluded or included (P = 0.31). Overall, nearly every Mbp transcript was different (for example 61 transcripts distributed over 33 isoforms for Oligo 1.1, as seen in Additional file 5), and this diversity existed within individual oligodendrocyte cells.
We validated that the identified exon isoforms weren’t an artefact of the sequencing process by Sanger sequencing a total of 26 isoforms (represented by more than UMI) from 5 genes. To see if the isoform results were reproducible in a cell that hadn’t been sequenced an “independent” cell was added to the validation experiments. Additionally, to verify that the isoforms identified weren’t an artefact of the amplification process, an amplification-free library was created of bulk material from oligodendrocyte rich areas of the brain. The amplification-free sequencing could verify full-length isoforms, whereas Sanger sequencing verified inclusion/exclusion of specific exon. The results are shown in Additional file 1: Figure S5A-J. As an example, Additional file 1: Figure S5E shows the Mbp gene, the primer pairs (PP) used in the PCR, the length of the PCR product for two cells and the mapping of the Sanger sequenced PCR products. A number of conclusions can be drawn from these validation experiments. First they confirm the results from PacBio sequencing: Exon cassette 2 can either be included or excluded (PP 1, 2 and 6), exon cassettes 3, 4 and 5 seem to be included or excluded in combination (PP 2), exon cassette 6 can either be included or excluded (PP 4), and extensive heterogeneity is seen in the 3′ UTR, where some isoforms excludes a large part of exon 7 (PP 5). Interestingly, the two cells used for validation gave very similar results, the most obvious difference being for PP 5 where oligo 1.1 has two bands and oligo 1.2 has three. One of the shorter products (PP5-C in Additional file 1: Figure S5E (A and D)) for oligo 1.2 was verified by Sanger sequencing, as well as the shorter product for oligo 1.1, and those two products were different. Unsurprisingly this heterogeneity was in the UTR region. Thus, the differences in isoforms observed by PacBio sequencing between cells oligo 1.1 and oligo 1.2 are probably partly due to incomplete sequencing of the latter cell, although some isoforms were clearly cell-specific. This also shows that the number of isoforms identified in this study is a lower-bound estimate, especially for cells with fewer sequencing reads.
Unfortunately, the amplification-free library for PacBio sequencing had markedly fewer accepted Zero-mode waveguides (ZMW) compared to the libraries from single cells (in average 7.500 compared to 22.500 for the long read long read and 28.500 for the short read library), using standard quality filtering. The library was sequenced two times and the loading concentration was increased the second time, but that didn’t increase the number of accepted ZMWs. We therefore lowered the calling stringency from one pass (reads with adapter sequence at both ends) and a quality score of 90, to zero pass and a quality score of 75 for the amplification-free library, since large structures like exons still would be identified even with lower quality sequences. This increased the number of accepted ZMWs to in total 45.100. Still some highly expressed genes in the single-cell data set had no or very few transcripts in the amplification-free data set, like Mobp, were not a single isoform could be confirmed.
Generally, exon structures with many transcripts could be verified with both Sanger sequencing and amplification-free Pacbio sequencing. Sanger sequencing could verify more isoforms than amplification-free sequencing probably due to the low read depth of the amplification-free library, in combination with that the library was made of bulk material, including many non-oligodendrocyte cells, and was prepared from the brain stem and striatum (due to high oligodendrocyte content) compared with the single cells that came from hippocampus and the cortex. Of 26 exon isoforms identified in the 5 genes that were Sanger sequenced (single transcript exon isoforms not counted), 8 could be identifies by both methods, 9 more could be identified by Sanger sequencing and 9 couldn’t be verified by any method. Interestingly the existence of 3′UTR introns for some genes could be verified by both Sanger sequencing and amplification-free Pacbio sequencing.
Not all genes were as highly expressed as Mbp. Average number of transcripts per gene were just three, and average number of UMI per isoform two (Fig. 1e). To quantify the sources of isoform diversity in single cells, we counted cases of alternative “TSS” (5′ end of the cDNA), “TTS” (3′ end of the cDNA), “position” (differences in the start or end position of an exon) and “cassette” (exon cassette inclusion/exclusion). We occasionally observed intron retention, but they were very rare events and were therefore excluded from further analysis. As shown in Additional file 1: Figure S3, the variability at exon 5′ and 3′ borders was similar, and we therefore combined these into a single “position” category. Alternative TSS and TTS variation represented more than 70% of all isoform-generating events (Fig. 1f). In contrast, exon cassette exclusion and exon position events affected only about 10–20% of all events. Like Mbp, many genes were affected by all sources of isoform diversity, which led to a very large number of distinct isoforms in single cells. There was a high variation in number of events per cell, which reflected differences in the total number of mRNA molecules. However, the ratio between the events was stable among the cells.
Thus it is clear that the combination of several sources of diversity leads to a great heterogeneity of mRNA isoforms, even in single cells. Intriguingly, we found that for many genes nearly every single transcript represented a distinct isoform (e.g., Mbp). As gene expression levels increased, the number of isoforms increased almost as rapidly. This was true for pooled data as well as within individual single cells (Fig. 2a), however it was not true when considering exon cassette isoforms only, where no such trend could be discerned.
Few isoforms were shared between cells as shown in Fig. 2b. Only 23% of all detected isoforms were shared between any two cell-types, and only 7% of all detected isoforms were shared between all cell-types (considering only isoforms belonging to genes represented by at least one transcript in all cell types). These are lower-bound estimates, because of the limited depth of sequencing, and (as noted above) validation by Sanger sequencing showed a greater proportion of shared splicing events.
For exon cassette isoforms, almost 60% of all detected isoforms were shared. The expression level was generally higher for shared isoforms (Fig. 2c). However lowly expressed isoforms have a higher probability to be missed due to the low mRNA capture rate, so it is possible that lowly expressed isoforms are shared among the different cell types too. The major isoform constituted in average around 50% of total gene expression (considering only genes with more than 10 transcripts, Additional file 1: Figure S6A), suggesting that some isoforms were preferred. Neither the number of annotated gene exons nor the overall gene expression had a major impact on the percentage of major isoform expression (Additional file 1: Figure S6 B-C).
In order to estimate the true number of isoforms in each single cell, we made use of a recently published Bayesian method to accurately extrapolate the complexity of DNA libraries (Preseq, [27]). We found that when a cell over time has transcribed 600,000 molecules of mRNA, it will have generated between 5 and 15 conservative isoforms per gene, and between 2 and 4 exon cassette isoforms (Fig. 2d). Both VLMC cells showed a low estimated number of isoforms per gene, which is reasonable considering that VLMCs are small and express a smaller total number of mRNA molecules.
The observation of a great diversity of isoforms in single cells naturally leads to the question of how this may affect the repertoire of proteins expressed. Isoform diversity was not limited to non-coding regions, as can be seen in Fig. 3b, which shows isoform diversity considering only coding isoforms (Methods). Thus, even in single cells, each gene can be expected to give rise to multiple distinct protein isoforms, greatly expanding the coding repertoire.
We hypothesized that isoforms that would affect protein-coding sequence would be more tightly regulated, leading to a reduced diversity at these sites. To examine this, we repeated the analysis leading up to Fig. 1a, splitting the dataset into coding and non-coding events (Fig. 3a; see Methods). There was a clear difference between coding and noncoding splicing events (Fig. 3a and Additional file 1: Figure S7). The observed variation in coding exon junctions was limited to ±1 bp, in the same range as the technical variation due to amplification or sequencing errors (Additional file 1: Figure S3 and S7). In contrast, at non-coding exon junctions the variation was larger, extending well outside the ±1 bp region, and sometimes as far as hundreds of base pairs. The difference between coding and non-coding junctions was statistically significant (P < 0.001 Wilcoxon signed-rank test, two-sided) both at the start and end of internal exons. This suggests that coding exon splicing has evolved to be under stricter control than non-coding exon splicing, likely to prevent the generation of anomalous protein products.
Despite the stricter control of coding variants at each splice junction, since genes contain multiple exons, we found that a large number of coding isoforms were present in single cells (Fig. 3b). Although most genes had fewer than five isoforms, many had more and even under conservative estimates some genes showed up to 25 distinct coding isoforms. These results point to an underappreciated richness of alternative protein forms being simultaneously present in individual cells.