Genome of extreme halophyte Puccinellia tenuiflora

Background Puccinellia tenuiflora, a forage grass, is considered a model halophyte given its strong tolerance for multiple stress conditions and its close genetic relationship with cereals. This halophyte has enormous values for improving our understanding of salinity tolerance mechanisms. The genetic information of P. tenuiflora also is a potential resource that can be used for improving the salinity tolerance of cereals. Results Here, we sequenced and assembled the P. tenuiflora genome (2n = 14) through the combined strategy of Illumina, PacBio, and 10× genomic technique. We generated 43.2× PacBio long reads, 123.87× 10× genomic reads, and 312.6× Illumina reads. Finally, we assembled 2638 scaffolds with a total size of 1.107 Gb, contig N50 of 117 kb, and scaffold N50 of 950 kb. We predicted 39,725 protein-coding genes, and identified 692 tRNAs, 68 rRNAs, 702 snRNAs, 1376 microRNAs, and 691 Mb transposable elements. Conclusions We deposited the genome sequence in NCBI and the Genome Warehouse in National Genomics Data Center. Our work may improve current understanding of plant salinity tolerance, and provides extensive genetic resources necessary for improving the salinity and drought tolerance of cereals.


Construction and content Evaluation of genome size
Taxonomy characteristics of Puccinellia tenuiflora are available at Flora of China (http://www.efloras.org/ florataxon.aspx?flora_id=2&taxon_id=200026128). We surveyed the chromosome number of P. tenuiflora according to Kato et al. [29]. Total genomic DNA was extracted from fresh leaves. We used the conventional method to estimate the P. tenuiflora genome size. Briefly, we generated 49 Gb of high-quality short-insert Illumina reads to analyze the K-mer frequency of distribution [30]. Genome size was calculated using the following formula: Genome size = total K-mer number /Kmer depth [30,31], in which K-mer depth is the peak value of K-mer distribution. The chromosome number of P. tenuiflora is 14 (Fig. 1). Our K-mer analysis showed that the genome size of extreme halophyte P. tenuiflora was 1.303 Gb (2n = 14) and the genome was complex, with 1.56% heterozygosity and 65.5% repeat content (Table 1).

Genome sequencing
Illumina paired-end (PE) libraries were constructed with short insert sizes of 250 and 450 bp. Illumina mate-pair (MP) libraries were constructed with insert sizes of 2, 5, and 10 k bp (Table 2). We generated 209.13 Gb of raw data by the PE libraries, and 197.38 Gb of raw data by the MP libraries. The Illumina libraries were sequenced on Illumina HiSeq XTen platform. We also sequenced 56.12 Gb of PacBio long reads and 161.03 Gb of 10× genomics barcoded reads (Table 2).

Genome assembly
Because the P. tenuiflora genome is highly complex and repeated, its genome was assembled by a combined strategy of PacBio (third-generation), 10× genomic technique, and Illumina Hiseq (second-generation). We generated 312.6× reads of Illumina, 43.2× read of PacBio and 123.87× reads of 10× genomic. First the PacBio sequences were corrected for errors. The accurate sequences of Pac-Bio were assembled into primary contigs based on FAL-CON (Branch 3.1) [32] and FALCON-Unzip software (https://github.com/PacificBiosciences/FALCON_unzip). After treatment with FALCON-Unzip software, we corrected errors of these contigs using PacBio sequences based on quiver software [33] and using Illumina data based on pilon software [34], and finally obtaining consensus sequences of high quality. Next, we used Illumina long reads of 2, 5, and 10 kb to elongate and combine the preassembled contigs into scaffolds based on SSPACE software [35], and then used 10× genomics linked-reads to further elongate and combine the scaffolds based on 10×  Table 1 Results of K-mer analysis. The K-mer was defined as 17 bp to assess P. tenuiflora genome size by the following formula: total K-mer number/K-mer depth. The heterozygous ratio was determined by the number of heterozygous K-mer/total K-mer number  (Table 3).

Annotation of replicate sequences
Transposable elements (TEs) of the P. tenuiflora genome were annotated. We used two methods to find the TEs. The first method was RepeatMasker (version 3.3.0) to discover TEs in an integrated known replicate sequence library (Repbase 15.02) and the de novo replicate sequence library constructed by RepeatModeler (Version 1.0.5) [36,37], RepeatScout [38], and LTR_FINDER [39]. The second method detected TEs in the P. tenuiflora genome using RepeatProteinMask by searching against the TE protein database [37]. We identified 691 Mb transposable elements (62.44% of the total sequence), including 580 Mb of LTR retrotransposons (52.43%) ( Table 4).

Annotation of protein-coding genes
A combined strategy (de novo-, homolog-, and RNAseq-based predictions) was used to annotate proteincoding genes in the P. tenuiflora genome using the following software: Augustus (version 3.0.2) [40,41], Genescan (version 1.0) [42], Geneid [43], GlimmerHMM (version 3.0.2) [44], and SNAP [45]. The homologous sequences of six species (Zea mays, Sorghum bicolor, Brachypodium distachyon, Setaria italica, Arabidopsis thaliana, and Oryza sativa) were aligned against the repeat-masked P. tenuiflora genome with TBLASTN (Evalue ≤10-5) [46], and then Genewise software 2.2.0 was used to predict the gene models [47]. Two strategies were used to assemble the RNA-seq reads to the unique transcripts. First, we mapped the RNA-seq reads to the P. tenuiflora genome with Tophat 2.0.8 [48] and Cufflinks 2.1.1 software [49] (http://cufflinks.cbcb.umd.edu/   ). Afterward, we used Trinity [50] to assemble the RNAseq reads, and then used PASA [51] (http://pasapipeline. github.io/) to improve the structure of the assembled genes. We generated non-redundant gene sets using EVidenceModeler (EVM) [52] via integrating gene prediction results of all methods. Finally, the predicted genes were filtered by three criteria: coding region length of ≤50 amino acids; FPKM < 5; and supported only by de novo strategy. Functions of the proteincoding genes were annotated by BLASTP program (best hit with E-value ≤1E-05) against three public protein databases: TrEMBL [53], Swiss-Prot, and NR. The protein domains were analyzed by InterProScan software (4.8) via searching against InterPro databases 29.0 [54], and the GO term information was collected from the Inter-Pro annotation results [55]. Moreover, we also conducted KEGG annotation for all genes [56]. On the basis of P. tenuiflora genomic sequences, we predicted 39,725 protein-coding genes (Tables 5). Of the 39,725 predicted protein-coding genes, the protein sequences of 39,470 genes (99.4%) were similar to sequences of known proteins and could be annotated ( Table 6). The average gene length was 2818.5 bp, and the average CDS length was 1082.0 bp. The average exon number per gene was 4.2, with an average exon length of 260.5 bp and average intron length of 550.8 bp ( Table 5).

Annotation of non-coding RNA
The tRNA genes were discovered with tRNAscan-SE software [57]. The rRNA, miRNA, and snRNA were predicted by INFERNAL software [58] against the Rfam database 9.1 [59]. We annotated non-coding RNA and identified 692 tRNAs, 68 rRNAs, 702 snRNAs, and 1376 microRNAs in the P. tenuiflora genome (Tables 4 and 7). The average lengths of microRNAs, tRNAs, rRNAs, and snRNAs were 124.89 bp, 75.27 bp, 207.79 bp, and 118.21 bp, respectively (Table 7). We deposited the genome sequence in the Genome Warehouse in National Genomics Data Center [60]. Table 5 General statistics for feature of predicted protein-coding genes of P. tenuiflora genome. Protein-coding genes were predicted through the annotation strategy of de novo prediction and evidence based on homology and transcriptome data. The gene model was integrated with EVM and corrected by PASA to obtain the final set of protein-coding genes

Assessment of genome quality
We assessed genome quality using the following methods: Burrow-Wheeler Aligner (BWA), Core Eukaryotic Genes Mapping Approach (CEGMA), and Benchmarking Universal Single-Copy Orthologs (BUSCO). First, in order to assess the quality of genome assembly, we aligned the high-quality Illumina short reads to the assembly using BWA (http://biobwa.sourceforge.net, parameters '-o 1 -i 15') [61]. According to BWA method, 87.41% of raw reads were mapped to the genome with 93.34% coverage (Table 8). Next, we used CEGMA and BUSCO to estimate completeness of the assembly. CEGMA is a set of conserved protein families for a wide range of eukaryotes, and is used to identify exon-intron structures of these conserved protein families in a new genomic sequence [62]. CEGMA analysis revealed 223 out of 248 ultraconserved eukaryotic genes (89.9%) in the P. tenuiflora genome indicating integrity for the core genes in the assembly (Table 9). Moreover, completeness of the assembly also was assessed using BUSCO [63] combined with TBLASTN [46], Augustus (version 3.0.2) [40,41], and HMMER (version 3.1b2) [64]. The BUSCO analysis showed that our assemblies contained 86.8% complete and 1.7% fragmented embryophyta orthologs, suggesting that the assembly quality was high (Table 10).

Description of database
The genome assembly of P. tenuiflora consisted of 14, 036 contigs with a total size of 1.095 Gb. Finally, we assembled 2638 scaffolds with a total size of 1.107 Gb, contig N50 of 117 kb, and scaffold N50 of 950 kb. On the basis of P. tenuiflora genomic sequences, we predicted 39,725 protein-coding genes, and identified 692 tRNAs, 68 rRNAs, 702 snRNAs, 1376 microRNAs, and 691 Mb transposable elements. We assessed the quality and completeness of the assembled genome through BWA, CEGMA mapping, and BUSCO mapping (Tables 8,9,10). The results showed that our assembly had high quality. All raw data for genome assembly are deposited at NCBI. The genome sequence is deposited in the Genome Warehouse in National Genomics Data Center (https://bigd.big.ac.cn/gwh) (accession number GWHABHL00000000).

Significance of database
Halophytes belong to several families and are distributed among multiple clades; this broad distribution pattern suggests that the salinity tolerance mechanisms of halophytes have evolved numerous times or have multiple origins [2]. As a result, halophytes not only exhibit a wide range of salinity tolerance but have also evolved diverse molecular and physiological mechanisms for salinity tolerance [2]. This diversity complicates discovery of the salinity tolerance mechanisms of halophytes. To date, almost all known molecular mechanisms of salinity tolerance were characterized in glycophytes such as rice,   wheat, and Arabidopsis [4][5][6]. Glycophytes only provide limited insights into mechanisms of salinity tolerance, and extreme halophytes may have enormous values for improving our understanding of salinity tolerance mechanisms. The genome sequence of extreme halophytes will unlock their molecular studies in salinity tolerance. The Gramineae is an important plant group because it includes many important food crops, such as rice, wheat, maize, and barley. P. tenuiflora, an extreme Gramineae halophyte, is closely related to barley and wheat. Zhang et al. (2013) reported that P. tenuiflora can grow normally for 6 days under 900 mM NaCl and survive at pH 11 [23]. Wang et al. (2006) found that P. tenuiflora survived 670 mmol/L NaCl [13]. A growing number of molecular biology studies have focused on this species owing to its strong salinity tolerance and high genetic value for cereal improvement [16][17][18][19][20][21][22][23][24][25][26][27][28]. In the present study, we sequenced and assembled the P. tenuiflora genome (2n = 14, size 1.107 Gb). Our work may improve current understanding of salinity tolerance and provides genetic resources for cereal improvement.