PacBio, Illumina, and combined sequencing assembly statistics
DNA extracted from three soil samples collected from the Ürümqi Glacier No. 1 in the Tianshan Mountains at 2200, 2750 and 3700 m altitudes was sequenced using both PB and IL platforms. The general information for each sample, including metagenome size (Clean Data Length), number of contigs or scaffolds, GC content, and scaffold N90 values are listed in Table S1. The IL platform out-performed PB with respect to sensitivity, which was exemplified by the total number and length of the contigs (Fig. 2A-B). However, the PB read lengths were much longer, with N50 lengths ranging from 37,986 to 47,542 bp, and the longest single read was 607,831 bp. The PB reads were mostly 5000 to 100,000 bp long (92.63%). In contrast, the IL contigs from the three soil samples had N50 lengths of 709, 691, and 706 bp. The majority of the contigs (89.53%) were shorter than 1000 bp (Fig. 2C-D, Table S1).
The longer PB reads facilitated the profiling of longer genes. Thus, the libraries from the two platforms were combined and assembled, which achieved better sensitivity and integrity of the soil DNA (Fig. 2). Combining the two sequencing libraries increased the length distribution of the contigs with N50 lengths of 3913, 2702, and 2626 bp for the soil samples collected at high, medium, and low altitudes, respectively (Fig. 2C-D). Compared with IL, the number of contigs > 1000 bp in length increased by 17.14–25.67%, and compared with PB, PI generated more contig number and longer total contig length. The GC content of the PacBio reads (61.32–65.19%) was lower than that of the Illumina contigs (64.20–65.52%) (Table S1). This is because the short reads with higher GC content might be difficult to assemble, and the contigs contained many unassembled GC-rich reads. The GC contents of the PI ranged from 62.01–64.27%, harnessing thereby the unique advantages of the two sequencing methods.
In addition, the clean data in PB (6.62–10.90 Gb) are bigger than the one in IL (4.53–4.54 Gb) (Table S1). We normalized the contig number with clean data size and found that, using per unit of clean data (1 Gb), IL could generate 264,364 ± 9774 contigs/Gb and PB only generated 108 ± 24 contigs/Gb (Fig. S1A). This result again indicated that the IL platform is more sensitive than PB in the number of contigs. The PI has larger clean data, but the contig number of PI per unit of clean data assembled (21,512 ± 1645) is still less than IL performance (Fig. S1A).
Gene prediction from the assembled contigs
The assembly performance of the three different methods was further evaluated by examining the predicted genes. The Illumina platform performed better in terms of sensitivity compared to PB, as shown by gene numbers (Fig. 3A) and total gene lengths (Table S2), while PB sequencing gave a larger proportion of longer gene sequences (Fig. 3B, S2, Table S3), the proportion of gene sequences number longer than 1000 bp was 5.82% in PB, while IL was 1.86%. Only third-generation sequencing assemblies for the high-altitude sample resulted in a 3.18-fold (PB, medium altitude) increase in the number of genes (Fig. 3A). This showed that the PB method had relatively unstable sequencing results. According to the gene length distribution line graph (Fig. 3B, S2, Table S3), except the high-altitude sample, the PB and PI exceeded the IL assembly by 7.34- and 2.14-fold, respectively, in the number of genes with lengths ≥2000 bp. In addition, the number of genes ≥2 kb in the PI was 2142, compared to 2214 and 474 in the IL and PB assemblies, respectively.
The orthologous genes were analyzed to compare the common and unique genes predicted by the three assembly methods. The genes from the PB assembly accounted for only 43.60% of the total gene pool, while the genes from the IL and PI assemblies accounted for 99.62 and 92.27% of the total genes, respectively (Fig. 3C-D). The PI covered 82.06, 82.53 and 82.57% of the total gene set in the high, medium, and low altitude samples, respectively. The 151,923, 157,139, and 137,589 genes in the three samples were only revealed by the PI assembly (Fig. S3). However, the Venn diagram also pointed out genes that were lost from the PI assembly predicted by the contigs, resulting from assembly errors.
Annotation of the predicted genes
The predicted genes were annotated using the COG and KEGG [26,27,28] databases to evaluate the quality and quantity of the predicted proteins from the three sequencing/assembly strategies. As shown in Fig. 4A and Fig. S4, the functional gene distribution of the three assemblies were similar, i.e., they were all enriched in genes involved in energy production and conversion (C), transport and metabolism of amino acids (E), carbohydrates (G), coenzymes (H), Inorganics (P) and lipids (I), translation, ribosomal structure and biogenesis (J), transcription (K), and signal transduction (T) (Fig. 4A). However, the PI assembly showed stable functional gene numbers (PB: 31,772 ± 13,546 vs. IL: 975,330 ± 31,417 and PI: 171,836 ± 14,892), and in general, the number were higher in the IL and PI assemblies than in the PB assembly (Fig. S4).
Metabolic pathway analysis based on the KEGG database suggested that all three strategies covered most of the predicted Ko-ids (38.70%) (Fig. 4B, S5A-C). There were total of 537 pathways in the KEGG Pathways database, although PI assembly predicted 394 (366 ± 14) pathways, less than the 443 (438 ± 3) predicted pathways for IL, and 415 (373 ± 25) predicted pathways for PB assembly, PI assembly results was more stable than PB assembly (Fig. 4C). In addition, there were obvious differences in the number of genes that were mapped to the metabolic pathways. The PI assembly matched a higher number of genes than the PB assembly (PB: 16,687 ± 7090 vs. IL: 484,291 ± 20,103 and IL: 66,130 ± 6714) (Fig. 4D).
Evaluation of the different assembly techniques predicting natural product biosynthesis genes
The above observations demonstrate the advantage of including long-read data in retrieving long gene sequences from soil DNA samples, which is relatively difficult with the fragmented libraries resulting from the pyrosequencing platforms. The PI sequencing data obtained by correcting the wrong bases in the PB sequence frame and IL sequence is particularly advantageous for predicting the natural product biosynthesis core genes. Modular assembly enzymes, such as polyketide synthases (PKSs), nonribosomal peptide synthetases (NRPSs), and their hybrid enzymes catalyze the most important and diverse classes of natural products that can theoretically code for a nearly infinite diversity of unique structures [8]. The genes that encode PKSs and NRPSs are generally long and consequently pose a major challenge to metagenomic DNA sequencing.
A total of 707, 21,780, and 4758 predicted genes respectively, were found in the PB, IL, and PI assemblies, using HMM (protein profile Hidden Markov Model) searches of the conserved domains (ketosynthase domain for PKS; adenylation and condensation domains for NRPS) (Fig. 5A). The average value of the relative abundance of PKS and NRPS genes was (2.28 ± 2.09) × 10− 6, showing their low abundance and copy numbers. The small number of genes identified using PB data revealed its weakness in recovering low-abundance sequences from the metagenome. However, despite the fact that the total number was low, the quality of gene-carrying scaffolds generated by PB long reads was outstanding compared to IL, as exemplified by the average length (55,139 bp compared to 797 bp), and the number of BGCs (biosynthetic gene clusters) identified by antiSMASH (62 compared to 31). This also showed how difficult it is to obtain the full-length sequences of these long genes by assembling IL short reads into long contigs; instead, combining both PB reads and IL contigs is one solution to balance the sensitivity and gene integrity. As shown in Fig. 5B, the number and length of genes, as well as the gene-containing scaffolds, were significantly improved. The PI assembly generated 4351 gene-carrying scaffolds with an average length of 2.596 kb, within which 122 BGCs were identified by antiSMASH. This was much better than the IL-only contigs (21,777 contigs, average length 797 bp, 31 BGCs) and outcompete PB reads for the number of scaffolds (707 scaffolds).
Comparison of the microbial diversity
The Tianshan mountain range is a rich source of diverse bacterial and fungal communities. Comparing the three sequencing/assembly methods used in this study in general, the PI assembly was superior relative to the other two methods in both predicting the number of genera and the number of genes (Fig. 6). Based on the number of genera predicted by the three sequencing/assembly methods for the soil samples collected at the different altitudes, in all the assemblies, the high-altitude soil seemed to contain the most genera, followed by the low and medium altitude samples. The number of genera predicted from PI assembly showed a relatively stable trend (Fig. 6) that was higher than those observed in the other two assemblies. According to the Venn diagrams in Fig. 6, the numbers of genera that could only be predicted by PI assembly (433, 686, and 457) were larger than other two assembly methods (IL: 285, 186 and 332 and PB: 216, 87 and 178) indicating that the PI assembly was more sensitive detecting microbial diversity. In addition, in the abundance analysis (PI), we could find that the abundances of the lineages of “Bacteria; Proteobacteria; Alphaproteobacteria; Sphingomonadales (Order); Sphingomonadaceae (Family); Sphingomonas (Genus)” and “Bacteria; FCB group; Gemmatimonadetes; Gemmatimonadetes; Gemmatimonadales (Order); Gemmatimonadaceae (Family); Gemmatimonas (Genus)” are all in the top 7 (Fig. S8), indicated that they are widely distributed on Tianshan mountain area.
In order to know how IL sequencing contributed to PI assembly, we employed the tool MegaBLAST to analyze the differences in gene sequences between IL and PI. The top-score sequences were selected and 95% identity was applied as the threshold to clean the sequences before performing microbial annotation. As shown in Fig. S7, PI assembly could predict many unique genera, and most of the genera annotated by IL assembly were included in the PI genera. The unique genera were matched by few gene sequences, and majority of gene sequences belonged to the genera shared among the assemblies (90.50, 81.97 and 84.64% in high, medium and low altitudes, respectively). These results indicated that PI assembly included the sensitivity of IL assembly and became more sensitive to microbial annotation. Venn figures (Fig. 6, S7) suggested that some microbial information could be lost in both short- and long- metagenomic sequencing and this information will be restored by combining these data.
We also normalized the genera number with clean data size and found that the number of genera predicted by PI per unit clean data (1 Gb) are 106 in high altitude, 116 in medium altitude and 89 in low altitude, which are much smaller than IL (272, 166 and 271 in high, medium and low altitudes, respectively) (Fig. S1B). However, PI ultimately showed better performance and stability in species annotation than IL (Fig. 6). The phenomenon indicated that PI was obviously affected by the size of clean data in the process of standardization, but it is suitable for microbial diversity research.
The Unipept pipeline assigned the taxonomic identity on average of 39.8 ± 5.3%, of the NRPS, PKS, and hybrid genes. The majority belonged to Bacteria (93.9 ± 2.1%), and the remaining were from Eukaryota, Archaea, and Viruses (Fig.S8). The major taxonomic classifications in the PI with relative abundance of > 0.5% are represented in Fig. 7A and B shows the distribution at the phylum level in the three assemblies. The most abundant genera represented were Proteobacteria (32.6 ± 6.8%) and Actinobacteria (7.6 ± 3.8%). Indeed, soil-dwelling cultivable Actinobacteria and Proteobacteria, represented by Streptomyces and Pseudomonas spp., respectively, have been the most prolific sources of the bioactive natural products [8, 14, 30]. The taxonomic classification also highlighted the prokaryotic phyla Bacteroidetes (6.9 ± 4.7%), Acidobacteria (4.3 ± 0.8%), Chloroflexi (2.9 ± 1.4%), and Candidatus Rokubacteria (2.8 ± 1.2%) as well as the eukaryotic phylum Fungi (1.3 ± 0.4%) as the potential sources for natural product discovery.
The distribution of natural product (NP) biosynthesis genes at the phylum level varied slightly across the three assemblies (Fig. 7B). In the kingdom Bacteria, Bacteroidetes was enriched in the PB assembly (12.14%) compared to 3.0% (IL) and 5.7% (PI). In contrast, Actinobacteria and Proteobacteria were more abundant in the IL and PI assemblies than in the PB assembly.
Analysis of genes encoding polyketide synthases (PKSs) and non-ribosomal peptide synthases (NRPSs)
In general, PKSs and NRPSs share high levels of sequence identity among the conserved domains such as adenylation (AD) and condensation (C) domains in NRPSs and the ketosynthase (KS) domain in PKSs. In turn, a high degree of identity among these domains predicts that the corresponding biosynthetic gene clusters (BGCs) are involved in the biosynthesis of structurally-related small molecules. By extension, domain sequences with no close relatives might have arisen from BGCs that produce structurally novel classes of metabolites [31]. AD, C, and KS domains have been successfully used as sequence tags to identify and classify NRPS or PKS enzymes in previous studies [5, 13, 16, 18, 19, 32,33,34,35,36], and thus were used in this study to assess the novelty of NP biosynthesis in the sampled soils.
The orthoMCL analysis of the predicted proteins containing AD, C and/or KS domains in the three assemblies showed that most (86.6%) were covered by the PI assembly (Fig. 8A). To assign possible catalytic functions to these proteins, the predicted C and KS domain sequences were submitted to the web-based analysis platform NaPDoS (Natural Product Domain Seeker) [37]. Most of the C domains were classified as LCL domains (75.9%) that catalyze the formation of a peptide bond between two l-amino acids, followed by the DCL domains (13.6%) that are located immediately downstream of epimerization domains and thus catalyze the condensation reaction between a d- and an l- amino acid residue (Fig. 8B) [40]. Most KS domains belonged to the fatty acid synthase (FAS) class (54.7%), the modular class (18.2%), and the PKS-NRPS hybrid class (7.9%), while the iterative PKS class represented only 0.9% (Fig. 8C).
A similarity network was constructed using the amino acid sequences from the Tianshan metagenomes and 1651 reference AD, C, and KS domains, and MCL clustering was then used to identify clades of related nodes based on the sequence identity matrix. The resulting profiles of biosynthetic domain sequences from Tianshan mountain soil microbiomes revealed that most of the NRPSs and PKSs from the metagenomes (10,739 out of 12,567 domains) did not cluster with any known NP-encoding genes based on the network and MCL analyses, except for the well-known conserved fatty acid synthase (FAS) and acetyl-CoA ligase clades (Fig. 8D). This suggested the potential to discover novel classes of NP genes from soil DNA libraries. However, most of the predicted sequences were far from satisfactory to enable the recovery of the full biosynthetic cassettes in order to address novel product biosynthesis.
Two examples of the predicted biosynthetic gene clusters are shown in Fig. 9A for NRPSs and Fig. 9B for PKSs. They both originated from Pedobacter cryoconitis (Bacteroidetes). Although these clusters span 44.1 Kb and 29.6 Kb, respectively, they both had an incomplete border on one end (upstream in Fig. 9A and downstream in Fig. 9B). The gene cluster in Fig. 9A contains 16 open reading frames (ORFs) that encode the core enzymes (NRPS and NRPS-like) with nine complete modules of adenylation, condensation and peptidyl-carrier protein domains. There are seven modules that also include epimerization domains to convert l-amino acids to their d-isomers, which could then be linked to another l-amino acid residue by the corresponding condensation domains. Single genes for ornithine cyclodeaminase and cysteine synthase are located upstream in this gene cluster and may participate in the modification of the substrate amino acids or peptide products. The adenylation (AD) and condensation (C) domains in this gene cluster are similar to those in proteins that produce linear polypeptide intermediates such as gramicidin [41], surfactin [42], bacitracin [43] based on the similarity network in Fig. 8C. The Fig. 9B cluster (BGC2) is a hybrid of modular PKS and NRPK-like core enzyme and modification enzyme genes. Interestingly, none of the domains could be found in the similarity network in Fig. 8C, suggesting a high probability of a new product. This gene cluster has a heterocyst glycolipid synthase-like PKS (hglE-KS) [44, 45] that belongs to a group of assembly-line PKSs frequently found in heterocyst-forming cyanobacteria, and is involved in nitrogen fixation; however, the domain composition of this protein and the protein components of this gene cluster are very different than in the hgl cluster. Thus, BGC2 is likely to produce a glycolipid, starting with a reduced polyketide chain with hydroxyl groups catalyzed by a type I PKS, two ketoreductases, and a PKS-like protein. The two types of NRPS-like proteins that contain CDCL and CLCL domains can further link two amino acids to the product which may be processed by the two cysteine synthases. The intermediate product can be processed by the alpha/beta hydrolase targeting one of the carbonyl groups and two glycosyl transferases that likely attach glycosyl groups to the hydroxyl groups on the polyketide chain. A gene of a predicted regulator is located within the cluster, but it is transcribed in the opposite direction to most of the other genes in the cluster. Both BGC 1 and 2 were only covered in the PI (Fig. 9C). This further emphasized the advantage of combining data from both the IL and PB platforms.