Here, we show that large insert MP sequencing is a versatile tool for analysing genomes at the structural level and providing long-range information for genome scaffolding. Our results show that the addition of MP sequencing can dramatically increase contingency of mammalian genome references. In all analyses, insert sizes of >8 kb were shown to be essential because of their ability to bridge the longer and more abundant LINE and LTR elements. The analysis where the fraction of long repeats that is spanned by each MP library is determined shows that large insert MPs are capable of spanning ~90% of the annotated long repeats. The remaining approximately 10% of elements that could not be bridged by any of the MP reads can likely be explained by a highly repetitive nucleotide context around the repeat elements themselves. When a repeat element is surrounded by other repeats (mostly at centromeric or telomeric regions) one or both reads of the pair that would span such region can not be mapped uniquely to the genome and can thus not be included in the analysis. In agreement with this, our data show that even a combination of all libraries in this study fails to span 4–5% of repetitive elements larger than 3 kb in size (Figure 2c and d). Because rat and other vertebrate genomes contain tens of thousands of repeat elements that exceed the routinely used paired-end insert sizes (up to 500 bp), but include the very common LINE elements, we conclude that the inclusion of mate-pair libraries with insert sizes of 8 kb and above are instrumental for comprehensive reconstruction of genome structures.
The largest insert libraries (20–25 kb) were instrumental for increasing the N50 of scaffolds to megabase levels. Because the draft rat genome is already of relatively high quality, the improvements presented here have only mild effects. However, we anticipate that large insert MP sequencing will be very useful for finalizing low-pass capillary sequenced or NGS-based genomes like those of most primates as well as many of the vertebrate genomes. Genomes with large fractions or large segments of repeats, like that of the zebrafish or certain plants, might benefit even more from large insert mate-pair data as their genomes have a very high repeat content in combination with recently duplicated sequences. Furthermore, most ongoing genome sequencing projects employ next-generation sequencing techniques, and because de novo genome assembly based on short-reads is still in its infancy, contig sizes for vertebrate genomes are typically in the kilobase range [19, 28, 29]. Although paired-end data with insert sizes up to 500 bp are now commonly included in these processes, our results demonstrate that longer-range information as provided by the large insert MPs described here is essential for comprehensive genome assembly. It should be stressed that the structure of every genome of interest is unique and variable in complexity. Therefore, the optimal combination of MP insert sizes will vary as well. A quick examination of the repeat size and distribution could aid in determining which MP insert size combination is expected to be optimal, but experimental optimization or a broad range of libraries such as used here might be required.
In the analyses presented here, we focused on the application of large insert MPs for genome sequencing efforts, but the findings could be extrapolated to the detection of structural variation. Previous analyses of whole human genomes have shown that SVs affect more base pairs than single point mutations, yet the field has struggled to find a suitable approach for comprehensive detection of such events . Hillmer et al. concluded that the most optimal insert size for SV detection is approximately 10 kb, although a thorough examination of the value of insert sizes above 10 kb was not described . In unravelling the structure and organization of ultra-complex clustered mutation events, like the recently described chromothripsis, larger insert sizes (20–25 kb) may extend the detection limit and help to complete the overall picture [31–34]. It should be noted, however, that a “mate-pair only” approach also comes with disadvantages: small insertions, inversions, duplications, and deletions may be missed due to the broad size distribution and relatively low coverage at the base level.
Large insert MP sequencing represents a good alternative for the more traditional bacterial artificial chromosome-end sequencing because the sequencing libraries can be produced by relatively simple and scalable procedures without the need for laborious cloning and colony picking. Furthermore, the protocol can be fitted to all existing NGS platforms by changing the oligonucleotide adapters that are used. The mate-pair library construction protocol is relatively laborious compared to standard fragment library construction protocols, but with the latest improvements of the mate-pair protocol (SOLiD 5500 version), the procedure takes ~14 hours of hands-on work. More importantly, robustness of the protocol has been increased and the required input genomic DNA was reduced to only 1–5 μg for a standard ≤ 3-kb library, compared to 5–20 μg for the SOLiD V4 protocol (Additional file 9). The removal of column-based clean-up steps and the increased circularization efficiency (via the implementation of intra-molecular hybridization instead of circularization to an internal adaptor) are the main factors that allow for a reduced amount of input DNA. Nevertheless, our results show that limiting the amount of input DNA can strongly affect the complexity of the resulting library. For larger insert libraries it is therefore recommended to start with maximized amounts of DNA (>20 μg).
Although large insert MP libraries must be sufficiently complex, high physical genome-wide coverage is readily obtained at relatively low sequencing depth of tens of million read pairs. Alternative large insert approaches, like fosmid di-tag sequencing , have been documented to suffer from low library complexity, which may be overcome by using larger amounts of input material, but they have an additional disadvantage as they are restricted to a fixed insert size of approximately 40 kb [16, 20, 35, 36]. Our data clearly demonstrate the added value of medium-sized insert libraries for genome structure analysis, a conclusion that was supported by Hampton et al. , who had to use supporting 4–6 kb mate-pair data to obtain essential long-range information that could not be obtained by fosmid di-tags alone. Using the MP protocol presented here, small, medium and large insert MP libraries can be generated in one go. Nevertheless, we did not generate libraries of equal size to 40-kb fosmid clones, so we could not determine if inserts of 25 kb are sufficient to fully replace 40 kb fosmid clones or if 40 kb pairs would span the last 4-5% of repeats that could not be covered by any of the MPs used here.