Teleost introns can be divided into long and short introns
Bimodal distributions across the vertebrates
We made use of the Ensembl database to obtain gene coordinates from 172 different vertebrate species and calculated the distributions of intron sizes for all canonical transcripts from each species (Fig. 1 and S2). Teleost distributions were very variable, but the majority of species had bimodal distributions with an antimode (trough) at approximately 256 (28) bp separating short introns peaking at around 76 bp from a more broadly distributed population of longer introns.
Notably, this included species both from the dominant Percomorphaceae clade (42 of 54 teleost species analysed) and others (eg. Danio rerio, Denticeps clupeoides). Interestingly, this typical distribution was not observed in Hucho hucho and its most closely related non-salmonid species Esox lucius, as well as some other species (eg. Clupea harengus, Electrophorus electricus). In smaller genomes the short intron peak (< 256bp) was almost completely dominant obscuring the much smaller peak of long introns (Fig. 1A, S2, additional file 1).
A preliminary analysis of the species in Ensembl version 104, that includes many additional salmonids appears to support the notion of a clade specific distribution for the salmonids and Esox lucius that doees not have the typical teleost antimode (additional file 2).
We also observed bimodality in the distributions of genomes belonging to both the Mammalia and Sauria clades with a minor peak (76 bp) of introns shorter than 150 bp (Fig. 1 B, C, S2 and additional file 1, 2) as previously reported [22].
In addition the intron size distributions in the Sauria could be clearly divided into two groups corresponding to Aves and non-Aves species. Aves genomes contained shorter introns consistent with their generally smaller genome sizes (Fig. 1C).
The bimodal distribution observed in the teleosts was not seen in other aquatic vertebrates. The most closely related non-teleost analysed here, Lepisosteus oculatus, had a distribution more similar to mammals and birds then to teleosts. We also observed a very clear bimodal distribution in the jawless vertebrate Eptatretus burgeri (hagfish), but this distribution was different to that observed in teleosts (additional file 1). We thus infer that the bimodality observed across the teleosts is teleost specific.
In general, the vast majority of intron sizes reported were larger than 75 bp and it is likely that this represents a lower bound for the size of a vertebrate intron. The proportion of very small (<32 bp) introns varied across the vertebrates in a manner unrelated to phylogeny. Genomes with higher proportions of such introns were usually from highly fragmented assemblies with short scaffold lengths (Fig. S3). Hence, intron sizes reported to be smaller than this lower-bound are likely to represent annotation errors.
The proportion of long introns reflects genome size
Since the short intron peak seemed to dominate the small teleost genomes, we plotted the proportion of introns shorter than 256 bp against genome size to determine whether this represents a general property (Fig. 1 D). Indeed a strong correlation between genome size and the fraction of short (< 256 bp) introns was readily observed across the teleosts. This suggests that the relationship observed between median intron and genome size [16] does not simply reflect differences in the size of all introns, but may also arise from a binary change in intron size status.
First introns are more likely to be long in teleosts
The first intron of a transcript is generally both longer and more likely to contain regulatory elements [23–25] and we asked whether first introns were more likely to belong to the longer class. Indeed, first introns were markedly more likely to be longer than 256 bp across all teleosts (Fig. 1E and S2). Although first introns were in general longer across the vertebrates we did not observe a major partition around 256 bp for non-teleost species and the differences in distributions were greater for teleost species (Fig. S4). Interestingly, the fractions of long first introns in teleosts were comparable to the fractions of long introns in Aves species with similar genome sizes (Fig. 1E).
Evolution of vertebrate intron sizes
An intron orthology
To investigate how intron size has evolved across the vertebrates we constructed an intron orthology by pairwise alignments of orthologous genes. To simplify the orthology we concentrated on gene families that mostly have only a single orthologue in each species. Since we are primarily interested in intron evolution in teleosts we made use of an orthology based on alignments to Danio rerio transcripts as these are derived from the most complete teleost genome annotation. This intron orthology contains a total of 63,068 introns from 5,752 genes. Although we did not find orthologues in all species for all Danio rerio introns, the majority of introns were represented by at least 156 species and 90% (154/172) of species had orthologues to at least 47,609 Danio rerio introns (Fig. S5).
Long teleost introns predict long mammalian introns
We first asked to what extent intron size in teleosts predicts intron size in other clades by comparing the median lengths of orthologue sets in mammals and teleosts (Fig. 2 A-C). The median (across species) intron size in teleosts and mammals were highly correlated for introns whose teleost median lengths were above 1000 (210) bp (Fig. 2 A-C). This represents 20% of the full intron set (Fig. 2 D); 73% of the mammalian orthologues of these introns had median intron lengths longer than the 50th percentile (i.e. median of median lengths) and 20% were longer than the 95th percentile (Fig. 2E). In general, the longer the teleost median length, the stronger the prediction (observed / expected) of long intron size in mammals (Fig. 2F).
These observations suggest that introns which have remained long across the majority of teleosts do not represent a random sub-set, and that these contain functional sequences which are conserved in mammalian species.
Intron size mutual information
To determine to what extent intron size correlates between species we calculated the mutual information in intron size between pairs of species. For this analysis we excluded genomes where the fraction of predicted introns shorter than 32 bp was larger than 2.5% (Fig. S3) since they obscured otherwise clear general trends. We observed clearly significant levels of mutual information for all pairs of species indicating some conservation of intron sizes across all vertebrates (significance judged by Monte-Carlo sampling of 10,000 random permutations). The amount of mutual information between species was clearly a function of taxonomic grouping with high mutual information observed within both the mammalian and Sauria clades (Fig. 3A).
Interestingly, the extent of mutual information between teleost species was much lower than that observed for mammal and sauria pairs. This is suggestive of an increased rate of intron size evolution in teleosts. To ensure that this was not merely a function of the evolutionary distances of the represented teleost species we also calculated pairwise Kimura-two factor distances for all species based on pairwise alignments of 472 (gene) orthologues. The mutual information for a given Kimura distance was lowest for pairs of teleost species and highest for pairs of mammalian species (Fig. 3B). This argues for an increased rate of intron size evolution in teleosts.
Intron size reduction in teleostei and aves
Teleost genomes and intron sizes are in general smaller than those of other vertebrates, suggesting a decrease in genome size that is specific to teleosts (i.e. a synapomorphy). However, it is also possible that the common vertebrate ancestor possessed a small genome with small introns and that this expanded in size in non-teleost clades. To evaluate the likelihood of these alternatives we used Sankoff maximum parsimony on a neighbour-joining tree (Fig. S6) created from pairwise Kimura two factor distances (as used in Fig. 3) to infer the sizes of introns in ancestors. This inference (Fig. 4) is consistent with intron sizes in the common vertebrate ancestor having been more similar to extant mammalian and Sauria species, and that intron sizes in the teleost clade decreased drastically both before and after divergence from Lepisosteus oculatus.
Three teleost species have unusually large genomes and corresponding introns; Danio rerio, Astyanax mexicanus and Pygocentrus nattereri. All three are derived from a branch diverging early in teleost evolution and A. mexicanus and P. nattereri have an exclusive common ancestor. Nevertheless, all these three species appear to have experienced independent increases in intron sizes since their common ancestors suggesting an ongoing process of genome size expansion presumably through the accumulation of transposable elements (TEs) [26].
Perhaps most striking is the apparent intron size reduction in avian species; avians are members of the Sauria clade, but compared to other Sauria species they have much smaller intron sizes. The majority of this intron size reduction appears to have occurred in the evolution of their common ancestor with smaller changes after divergence which is consistent with an inference of small genomes in the extinct pterosaur lineage [27] and in extant alligators [28]. Other Sauria species appear to have experienced intron size expansion rather than contraction with Sphenodon punctatus having one of the largest genomes in the data set.
We also asked to what extent intron size minimisation is an ongoing process by inspecting the difference in intron size between extant species and their most recent inferred ancestor. Most species appear to have experienced recent increases in intron size (Fig. S7). However, the teleost species with the smallest genomes had shorter introns than their most recent ancestors, suggesting that introns in these species are likely to become shorter in their descendants.
Importantly, there is relatively little conservation in intron size within the teleosts suggesting an ongoing process of intron size diversification. Although the majority of decrease in intron size seems to have occurred prior to the most recent common teleost ancestor, most teleost species have introns considerably smaller than this. This suggests a subsequent parallel decrease in intron size across the teleost clade.
Size dependent intron size evolution
We next considered how intron size has evolved in different vertebrate clades by comparing extant intron sizes with those inferred for the ancestor of the jawed vertebrates (JVA, node 259, Fig. S6). Introns inferred as shorter than ca. 2000 bp in the JVA were bimodally distributed in Danio rerio, with two peaks at around 76 and 2000 bp (Fig. 5A,B). The shorter of these corresponds to what is likely to be the minimal size of vertebrate introns. Introns longer than 2000 bp in the JVA had distributions in Danio rerio with peaks corresponding to their JVA size. In general, the shorter the JVA length the larger the proportion of minimised introns in Danio rerio.
In contrast, introns shorter than 2000 bp in the JVA were predominantly minimised in Takifugu rubripes, with longer introns having broad distributions peaking at less than half of the JVA size (Fig. 5C). The pattern in Mus musculus was more similar to that of Danio rerio with a bimodal distribution for short introns (Fig. 5D).
We calculated the distributions of intron size change between the JVA and the extant species as a function of JVA size and traced the mode lines (i.e. peaks of distributions discretised by their size in the JVA; Fig. 5E). Dual modes of size change were observed for short introns across the vertebrates, with the lower mode at the likely vertebrate minimal intron size (76 bp). The range of JVA sizes for which bimodality and minimisation was observed was much larger in the teleosts, with peaks of minimised introns observed for introns up to 2000 bp long in the JVA. The second mode line was biphasic with introns shorter than 2000 bp in the JVA having a mode associated with a constant extant size of ca. 2000 bp (upper solid lines in Fig. 5B-D). Introns longer than this in the JVA had extant modes correlating with JVA size; these were longest in the mammals and shortest in the teleosts.
Introns which were inferred to be short in the JVA were more likely to be minimised across the vertebrates. However, the rate of minimisation was much higher in teleosts, especially for introns inferred to be long in the JVA (Fig. 5F). The overall rate of minimisation was fairly constant in the birds and mammals, but varied extensively with genome size in the teleosts (Fig. 5F,G).
We note that the distribution of inferred intron size in the JVA does not have a peak of minimally sized introns (Fig. 5B-D). This is unlike all extant distributions and is likely to reflect the limitations of the ancestral state inference. The JVA is likely to have had a peak of minimal introns, but it may not be possible to infer these using maximum parsimony if their state has not been maintained in a sufficient number of extant species.
These observations suggest that introns have evolved differently depending on their size with longer introns being more conserved in size. If this is true, then we should see less conservation in extant intron length for short introns compared to long ones. To assess this we first plotted the intron size in Oryzias latipes against Danio rerio (Fig. 5H). Although Danio rerio has one of the larger genomes with a smaller proportion of minimised introns it nevertheless has a large number of minimal introns that are not shared with Oryzias latipes, suggesting that minimisation has occurred in parallel in the two lineages giving rise to these species. In general, there appears to be very little correlation in intron size for short introns whereas longer introns appear well correlated between the two species. This is consistent with a scenario where short introns evolve quickly, with relatively rapid shuffling between minimised and non-minimal states, whereas longer introns are less likely to change. This is also consistent with the relative lack of minimised introns inferred in the JVA.
To test whether this pattern is general we compared the correlation coefficients between introns short and long in Danio rerio and their orthologues in other vertebrates (Fig. 5I). The correlation coefficients for long introns were between two to six times larger than those for short introns. This difference increased with phylogenetic distance as would be expected if the evolution of long introns is more constrained than that of short ones (for identically sized intron sets the correlation will be the same; as evolution proceeds, the loss of correlation proceeds faster in the shorter set and the difference increases). Similar results were obtained when using a range of species to define the short and long intron sets.
Taken together these observations suggest that short and long introns have evolved in different manners and that long intron size is more likely to represent ancestral size than short introns. This difference in mode of evolution may have resulted from the presence of function in long introns or as a side-effect of the molecular mechanisms underlying intron size evolution.
Functional association
Genes with long introns are enriched for specific biological functions
Retention of intron sequence or length across the vertebrates suggests the presence of some regulatory function. If this is the case, then genes which do not contain any long introns should be less likely to be highly regulated and this ought to be reflected in their biological and molecular function. To test this we selected nested sets of genes containing at least one intron that is long (>Ti where T was a series of thresholds) across the teleosts and performed gene ontology (GO) enrichment analysis using human annotation. Sets of genes containing long introns were strongly enriched for specific biological process (BP), molecular function (MF) and cellular compartment (CC) terms. The enriched BP and MF terms were related to system development, cell signalling and transcriptional regulation (data not shown). This is consistent with genes containing long introns being more regulated than other genes and previous observations of short introns in highly expressed and house-keeping genes [15, 29].
Since this gene selection could have been biased towards genes containing large numbers of introns we also selected genes by randomly selecting introns from the intron orthology and performed the same analyses. Although such genes were strongly enriched for specific GO terms as would be expected (due to preferential selection of genes with many introns), these terms did not overlap with those found by selecting genes with long introns.
To further investigate the functional associations we ordered genes by their longest minimal teleost intron length, determined the number of genes belonging to enriched GO terms for all nested sets (from short to long) and calculated depletion statistics (Fig. 6, Fig. S8-10, additional file 3). For most terms the strongest depletion probability occurred between 28-29 bp (Fig. S8-10, additional file 3). This position roughly corresponds with the observed antimode observed in teleosts lengths and is consistent with distinct functional roles for short and long introns in teleosts.
Long introns are enriched for conserved sequences
If conservation of intron length is related to the retention of regulatory sequences then this should be reflected in an increase in sequence conservation. To test this we aligned sets of intron sequences from Danio rerio chosen by their lengths across the teleosts to their orthologues in other species. Because regulatory function may be present in several regions across the intron we performed local alignments and extracted all non-overlapping alignments to Danio rerio sequences. High scores were more easily found from alignments of introns that were consistently long across the teleosts (Fig. S11, additional file 4) than for variable length introns (Fig. S12, additional file 5). These included alignments across the full length of Danio rerio introns, as well as single and multiple windows of conservation. Although high scoring alignments were most common to other teleosts, a number of Danio rerio regions could be aligned across the vertebrate clade.
To compensate for the fact that alignments of long sequences to each other are intrinsically more likely to yield high scoring alignments, we made use of alignments to non-orthologous introns to create models representing operational null hypotheses (Fig. S13) relating the frequency of maximally scoring alignments to the total search space (see Fig. S15-S24 and additional files 6–15 for examples of alignments). Sequences from introns long in all teleosts were much more likely to diverge from these models (Fig. 7 and Fig. S14). This effect increased with minimum teleost intron length with significant local alignments to Danio rerio sequences against teleost and mammalian introns found in up to 60% and 20% of introns longer than 2000 bp.