Plant genomes encode variable sizes of VQs
To better understand the general profile of VQs in the low and higher plant kingdom, we identified VQs in 50 completely sequenced genomes from algae, moss, gymnosperm and angiosperm by Hidden Markov Model (HMM) and BLASTP searches (Methods). All these genome sequencing data have been published and related information was listed in Additional file 1: Table S1. The searches showed that all 6 genomes from Chlorophyta encode no VQs, indicating the possible absence of this gene family in green algae. We then downloaded protein databases from 12 other algae genomes including Porphyridium purpureum (http://cyanophora.rutgers.edu/porphyridium/), Cyanidioschyzon merolae (http://merolae.biol.s.u-tokyo.ac.jp/), Bigelowiella natans, Ectocarpus siliculosus, Galdieria sulphuraria, Gracilaria chilensis, Gracilariopsis lemaneiformis, Guillardia theta, Phaeodactylum tricornutum, Porphyra pulchra, Thalassiosira oceanica, and Thalassiosira pseudonana from the NCBI database (http://www.ncbi.nlm.nih.gov/genome/). No VQs were detected by the HMM searches against these 12 genomes. Thus, the VQ family is not necessary for these Chlorophyta genomes. We then focused on the remaining 44 plant genomes. These genomes encode variable sizes of VQs ranging from 7 to 74 (Glycine max) (Fig. 1). The locus names, physical positions and protein sequences were listed in Additional file 2: Table S2. We also downloaded the Marchantia polymorpha protein sequences from the NCBI database (https://www.ncbi.nlm.nih.gov/genome/?term=Marchantia+polymorpha) and then carried out the HMM and BLASTP searches. We have detected 6 VQs in the liverwort genome. Thus, the VQ family is presented in liverwort, moss, gymnosperm and angiosperm but not in Chlorophyta.
VQ motif coding genes were also detected in non-plant species
Except for Chlorophyta, VQs could be detected in all tested plants. To our knowledge, no data were reported on the presence of VQs in non-plant genomes. We have carried out a genome-wide identification of VQs in 43 nematode genomes by the HMM searches. These genomes were from 43 species including 10 free-living, 10 human parasitic, 13 animal parasitic (2 for entomopathogenic) and 10 plant parasitic nematodes (Additional file 3: Table S3). No VQs were detected in 10 human parasitic nematodes. Totally, we have identified 14 VQs in 11 out of 42 genomes (Additional file 3: Table S3). Most of them contained only partial VQ motifs. These 11 genomes were from 5 free-living, 1 animal parasitic, and 5 plant parasitic species. All the 5 free-living and 1 animal parasitic nematodes encoded only one VQ in each species. For the plant parasitic nematodes, 2 of them encoded only one VQ and the remaining 3 species encoded two VQs in each genome. On the other hand, a total of 249 fungus species, whose genomes were completely sequences, were used for the genome-wide identification of the VQ family. Their annotated protein sequences were downloaded from the EnsembleGenomes database (http://ensemblgenomes.org/info/access/ftp) and were then submitted to the HMM searches. We have identified a total of 34 VQs from 29 fungus species and 5 of the species have 2 VQs in each genome. Similar to the nematode VQs, most of these VQs encoded only partial VQ motif sequences. VQ motif sequences from the nematodes and fungi were aligned by Clustal X2 (http://www.clustal.org/clustal2/) and we found that only two amino acid residues “VQ” were conserved among all VQ motifs (Fig. 2). The aligned sequences were then used for phylogenetic analysis and the constructed tree was shown in Fig. 2. In the VQ motif-based tree, not all VQs from the nematodes were clustered together and some of them were grouped with the VQs from fungi (Fig. 2). The results showed the inconsistence between the evolution of VQs and their species divergence. Besides nematodes and fungi, we have identified several putative VQs from some bacterial species but no VQs were identified from viruses. Protein database from a total of 50 bacterial genomes were submitted to the HMM searches. As a result, 8 VQs were identified from 8 bacterial species including Aeromonas diversa, Enterobacter sp. BWH64, Lactobacillus camelliae, Lactobacillus manihotivorans, Lactobacillus paracasei, Leptospira terpstrae, Phaeospirillum molischianum and Weissella oryzae. Their NCBI accession numbers, VQ motif positions and their sequences were listed in Additional file 4: Table S4. Thus, our data showed that VQs could be detected not only in plants but also in nematodes, fungi and bacteria.
The VQ family in plants could be clustered into 5 distinct groups
Previous studies showed that the VQ family was classified into 4–9 groups depending on different species and methods used [2, 4, 6, 19]. However, the classification was based on VQs from single or several species. Due to the difficulty in sequence alignment, we selected VQ motif sequences from 12 species including 5 monocotyledons (monocots), 5 eudicotyledons (dicots), 1 gymnosperm and 1 moss for phylogenetic analysis (see Methods). Based on the phylogenetic tree, the VQ family could be classified into 5 distinct groups and they were named as Group 1, 2, 3, 4 and 5 (Fig. 3a and Additional file 5: Figure S1). These 5 groups of VQs were presented in all detected species from angiosperm, gymnosperm and moss. Each group of VQs underwent different evolutionary histories and evolved into different subgroups, which were clustered in different species. While all groups showed the common and conserved motif structure FxxhVQxhTG (the conserved residues F, V, Q, T and G were labelled with green fonts in Fig. 3b), each group differentiates from the remaining groups with 1–4 additional conserved residences (Fig. 3b). For example, in Group 1, the residences T (cyan), V (black) and D (yellow) were conserved; however, they might not be conserved in other groups (Fig. 3b). Interestingly, not all VQs contain the conserved residues VQ. Some of VQs encode VH instead of VQ, which were detected only in Group 2 of VQs (Fig. 3b). These VQs with conserved VH were presented only in monocot plants and no further evidence is available for why they have evolved into different residence.
Both monocots and dicots exhibit difference in their expansion histories
As no VQs were detected in all tested Chlorophyta species, we surveyed the evolutionary history of the VQ family among moss, gymnosperm and angiosperm. The phylogenetic tree was broken down into ancestral units [21] to estimate the most recent common ancestor (MRCA). Due to possible inaccurateness in estimation of lost genes and pseudogenes, they were excluded in this analysis, which might under-estimate the MRCA members. A total of 5 ancestral units were estimated among moss, gymnosperm and angiosperm; thus, their MRCA might encode only 5 VQs (blue filled circles in Fig. 3a and c). No expansion occurred before the divergence of seed plants (red filled circles). After the origin of seed plants, nearly double VQs (9) were required in the MRCA of gymnosperm and angiosperm (pink filled stars). MRCAs of dicots and monocots required different numbers of VQs and they evolved into 14 and 20 VQs, respectively, from 9 VQs. For dicots, a large-scale of VQ expansion was detected mainly during species divergence. However, for monocots, large-scale of VQ expansion occurred during the divergence between dicots and monocots followed by species divergence.
Segmental duplication significantly contributes to the family expansion
To explore the possible mechanisms of VQ expansion, we first surveyed the contribution of tandem duplication to the family expansion. In Oryza sativa, VQs were localized on all 12 rice chromosomes with uneven distribution. We detected a total of three tandem clusters covering 7 VQs (Fig. 4a), which was also observed in most of other rice species belonging to the genus Oryza (Additional file 6: Figure S2). The fact suggested that these tandem duplication events occurred before the divergence of these rice species and some of them lost in some species during long evolutionary history. Interestingly, the tandem duplication could be detected only in the genus Oryza but was not in other four grass species. Thus, all these three tandem duplication events occurred after the divergence of rice genus from other grasses but before the presence of various rice species. Furthermore, we have also detected additional three tandem duplication events which occurred only in one species. One was observed on chromosome 1 in the species O. meridionalis and another was on chromosome 9 for the species O. punctate (Additional file 6: Figure S2). The fact suggests that these duplication events occurred after species divergence from the Oryza genus and were species specific and the duplicated VQs might take part in species divergence. We then analysed the contribution of mobile elements to the family expansion. We examined the contribution of various mobile elements to VQ expansion including LTR-retrotransposon, retrogene, MULE, CACTA, and hobo/Ac/Tam3 (hAT) and Helitron elements. We found that one VQ gene 04 g57030 (blue fonts in Fig. 4a) was related to one of rice MULEs and another VQ gene 09 g20020 (green fonts in Fig. 4a) is a retrogene. Thus, limited contribution of mobile elements was detected to the VQ family expansion.
As only 9 rice VQs were related to tandem duplication or mobile elements, we further investigated the contribution of segmental duplication to the family expansion. We found that a total of 20 VQs (50%) were located on segmental duplication fragments (locus names with pink fonts, Fig. 4a). The result suggested the significant contribution of segmental duplication to the family expansion. These VQs might be segmentally duplicated in multiple historic periods as some of their orthologs could be detected in not only dicots but also monocots including different rice species; however, some of the orthologs were observed only in the rice species. To survey the contribution of segmental duplication to VQ expansion in other species, we also identified segmentally duplicated VQs in all 12 species which were used for the phylogenetic tree construction (Fig. 3a). Our data showed that 44.1–100% of VQs were located on segmentally duplicated regions in these 12 species (Fig. 4b). The results suggested that segmental duplication should be regarded as the main mechanism for VQ gene expansion.
Limited contribution of gene conversion to the evolution of the VQ family
Duplicated genes in a family provide possible DNA fragments for gene conversion, which might contribute to the evolution of the gene family. We implemented the program GENECONV version 1.81 [22] to detect the possible gene conversion events in this gene family. In the rice genome, genome-wide gene conversion events have been reported [23] and no VQ motif coding genes were identified to be involved in the gene conversion events. We implemented the GENECONV program using aligned VQ motif coding sequences from the rice genome and confirmed that no gene conversion events were detected in the rice VQ gene family. Additionally, we further detected the gene conversion events in additional 11 species as listed in this study and found that a total of 13 gene conversion events were detected (Additional file 7: Table S5). These gene conversion events were from 5 species including Brachypodium distachyon (2 events), Brassica rapa (4 events), Glycine max (2 events), Selaginella moellendorffii (3 events) and Zea mays (2 events). The results suggested the limited contribution of gene conversion to the family evolution in these 5 species. For the remaining 6 species, no gene conversion events were detected in the VQ gene family.
Positive selection occurred during species divergence
Our data showed that segmental duplication was a main driver for the family expansion followed by tandem duplication (Fig. 4). Thus, we investigated their selection forces after segmental duplication. Both VQ motif coding region and the full-length VQs were separately subjected to nonsynonymous substitutions per site (Ka) and synonymous substitutions per site (Ks) and their ratio (Ka/Ks) evaluation (Fig. 5a). For each species, the Ka/Ks ratio at VQ motif coding region was significantly lower than that from the full-length VQ, suggesting the functional conservation of VQ motifs during long evolution. For both regions, no Ka/Ks ratio was larger than 1, suggesting a purifying selection for these segmentally duplicated genes in all species. We further examined whether these segmentally duplicated genes were under functional constraint by C-value test [24]. Generally, VQ motif coding regions for all segmentally duplicated genes were under functional constraints. However, functional divergence might have occurred when the full-length VQs were used for Ka/Ks estimation followed by C-value test although they were under purifying selection.
To survey the selection forces of this gene family among species, we calculated the Ka/Ks ratios between orthologous VQ pairs. We have identified a total of 339 orthologous VQs among 10 rice species (Additional file 8: Table S6) that belong to the Oryza genus. These form 1169 pairs of orthologous VQs. We submitted these pairs of VQs for Ka/Ks analysis. Our analysis showed that most of Ka/Ks ratios were less than 1, indicating the purifying selection during and after species divergence. However, we have detected at least 6 pairs of VQs (0.5%) showing Ka/Ks > 1 (Fig. 5b and c). These 6 pairs of VQs were from 2 orthologous loci as shown in Fig. 5b and c, respectively. One of them consists of 6 VQ members (Fig. 5b) and another locus contains only two VQs (Fig. 5c). For the 6 VQ members (Fig. 5b), we further investigated the selection force in each amino acid site using the SLR program (see Methods). Totally, we detected 8 amino acid sites (highlighted by blue colour) with Ka/Ks > > 1 with strong statistical supporting (Adj.Pval < 0.05) (Fig. 5d). Two of them were located on VQ motif region. The analysis reveals that strong positive selection occurred not only in non-VQ motif region but also in conserved VQ region. Thus, positive selection might significantly contribute to the family evolution and neofunctionalization as well as species divergence.
VQs were expressed in various tissues and frequently regulated by abiotic / biotic stresses and phytohormones
Among 42 rice VQs, 7 of them (16.7%) showed no detectable expression (Fig. 6a and Additional file 9: Figure S3a). The remaining 35 VQs were expressed in either one or multiple tissues. Among them 10 genes (23.8%) were expressed only in vegetative tissue (T1 to T4). Statistical analysis showed that other 7 genes out of the 35 rice VQs were preferentially expressed in vegetative tissues. Their average express abundance in four vegetative tissues (T1-T4) was at least two times’ higher than the expression level in T5 to T11. The remaining 18 VQs showed expression signals in multiple tissues. Generally, around half of the rice VQs were preferentially expressed in vegetative tissues. On the contrary, higher percentage of Arabidopsis VQs showed reproductive tissue preferred expression patterns (Additional file 10: Figure S4a).
Expression data also showed that rice VQs were frequently regulated by phytohormones, biotic and abiotic stresses (Fig. 6b, Additional file 9: Figure S3b-d). Under ABA treatment, 40.5% of rice VQs were up- or down-regulated. Similarly, 47.6% of VQs were regulated by JA. Under both bacterial and fungus pathogen treatments, 31–78.6 of VQs were up- or down regulated. Under drought and high salinity stresses, 59.5 and 31% of rice VQs were either down- or up-regulated, respectively. In Arabidopsis, VQs were frequently regulated under JA treatment. However, less VQs were down- or up-regulated either by other phytohormones or under various abiotic and biotic stresses in Arabidopsis (Additional file 10: Figure S4b).
Some of VQs were co-expressed with abiotic and biotic stress-related genes
We identified 10 out of 40 VQs with co-expression modules in at least 3 out 14 datasets. A total of 2526 genes were identified to co-express with these 10 VQs (Fig. 6c). The VQ gene 07 g48710 was co-expressed with only 29 genes while a total of 403 genes were found to co-express with 06 g41450. Co-expressed genes with each of these 10 VQs were separately submitted to rice Gene Ontology (GO) database for gene set enrichment analysis (GSEA; [25]). No over-represented genes were detected for two VQs 02 g07690 and 07 g48710. For the remaining 8 VQs, 3 VQs were detected with over-represented cellular component (C) (Additional file 11: Figure S5a). All 8 VQs were found with over-represented molecular function (F) (Additional file 11: Figure S5a). Interestingly, most of over-represented biological processes are responses to biotic or abiotic stress / stimulus, suggesting the roles of VQs in stress-related signalling pathways.
We are interested in co-expressed genes encoding transcription factors. We have detected 136 co-expressed genes encoding 20 families of transcription factors (Additional file 11: Figure S5b). For most of the families, less than 10 family members were shown to co-express with VQs. We detected a total of 5 transcription factor families, which consist of at least 15 co-expressed genes (Additional file 11: Figure S5b). These candidate transcription factors will be investigated in their interaction with VQs (see below).
Expression profiling of VQs from nematodes and fungi
Previously, no VQ motif coding genes was identified in nematodes and fungi. Here, we identified 14 nematode and 34 fungus VQs. These 14 nematode VQs were from 11 species and publicly available RNA-Seq expression data were collected from 5 species including Caenorhabditis brenneri, Caenorhabditis briggsae, Caenorhabditis elegans, Caenorhabditis japonica and Caenorhabditis remanei (Fig. 7a). RNA-Seq analysis showed that all these 5 VQs were expressed in either male or female with the lowest expression level in C. brenneri and the highest expression level in C. briggsae (Fig. 7a). In C. elegans, more expression data were available and we analysed the expression profiling of the VQ gene F23H12.2 among 17 different developmental stages of tissues (Fig. 7b). The analysis showed that the VQ gene was expressed in all tested tissues with varying expression abundance (Fig. 7b). The lowest expression level was detected the L1 larva and the highest abundance was detected in gastrulating embryo.
On the other hand, we have identified 34 VQs in fungi and all of them were annotated proteins. To verify the annotation, one of the fungus VQs were selected, which was from Gloeophyllum trabeum with the NCBI accession number XM_007869238.1. The genomic DNA was isolated from the species and was used as template to amplify the fragment by PCR. The PCR result (Fig. 7c) and sequencing data confirmed that the annotated gene was from the genome and its deduced protein sequence indeed contained the VQ motif structure. RT-PCR analysis showed that the full-length coding region was transcribed in the fungus strain (Fig. 7d). In addition, we further analysed the expression profiling by using RNA-Seq data. The specie was used to colonize wood wafer and 3 wood sections (0–5, 15–20, 30–35 mm) were sampled for RNA-Seq, representing early to late decay stages. The VQ gene was expressed in all 3 wood sections with similar expression abundance (Fig. 7e). All the experiments not only verified the annotated VQ gene but also provided its expression evidence, which further confirmed its presence in the genome.