The structure, functional evolution, and evolutionary trajectories of the H+-PPase gene family in plants

Background The H+-PPase (pyrophosphatase) gene family is an important class of proton transporters that play key roles in plant development and stress resistance. Although the physiological and biochemical functions of H+-PPases are well characterized, the structural evolution and functional differentiation of this gene family remain unclear. Results We identified 124 H+-PPase members from 27 plant species using complete genomic data obtained from algae to angiosperms. We found that all analyzed plants carried H+-PPase genes, and members were not limited to the two main types (type I and II). Differentiation of this gene family occurred early in evolutionary history, probably prior to the emergence of algae. The type I and II H+-PPase genes were retained during the subsequent evolution of higher plants, and their copy numbers increased rapidly in some angiosperms following whole-genome duplication (WGD) events, with obvious expression pattern differentiation among the new copies. We found significant functional divergence between type I and II H+-PPase genes, with both showing evidence for positive selection pressure. We classified angiosperm type I H+-PPases into subtypes Ia and non-Ia, which probably differentiated at an early stage of angiosperm evolution. Compared with non-Ia subtype, the Ia subtype appears to confer some advantage in angiosperms, as it is highly conserved and abundantly expressed, but shows no evidence for positive selection. Conclusions We hypothesized that there were many types of H+-PPase genes in the plant ancestral genome, and that different plant groups retained different types of these genes. In the early stages of angiosperm evolution, the type I H+-PPase genes differentiated into various subtypes. In addition, the expression pattern varied not only among genes of different types or subtypes, but also among copies of the same subtype. Based on the expression patterns and copy numbers of H+-PPase genes in higher plants, we propose two possible evolutionary trajectories for this gene family.

Most research on plant H + -PPases has focused on the type I H + -PPases, which are located on the vacuolar membrane and are thus known as vacuolar proton pyrophosphatases (V-PPase); e.g., Arabidopsis AVP1 (At1g15690) [6][7][8]. In plant cells, type I H + -PPases obtain energy through the hydrolysis of PPi to transport protons across the vacuole membrane, and adjust the pH in the vacuole and cytoplasm [9]. Type I proteins are widely involved in metabolic processes such as the enrichment of metal ions in the vacuole [10] and hormone and nutrient transfer [9]. Overexpression of type I H + -PPase genes can significantly enhance the ability of plants to cope with abiotic stresses, such as anoxia or chilling [11], lack of nutrition [12], drought, and high salt levels [13,14]. This can also promote plant vegetative growth and produce plants with large biomass [9,13]. There are differences in the copy number of H + -PPase genes in different plants. Different members of this gene family may have specific expression in different tissues, organs, or during different developmental stages, but there is currently no compelling evidence to support this [10]. Both type I and type II H + -PPases have the same active site but have significant differences in subcellular localization and expression levels. For example, Arabidopsis AVP2 (At1g78920, a type II H + -PPase) is located in the Golgi apparatus, and its expression level is much lower than that of type I H + -PPases [15].
The structural evolution and functional differentiation of this gene family have not been reported systematically. With the availability of increasing numbers of plant genomes and the continuous improvement of the available protein tertiary structure model [16,17], we currently have the ability to study the H + -PPase gene family from a wider perspective. In the present study, we selected 27 plant species with different taxonomic relationships to identify and study the structure of H + -PPase gene family members at the whole genome level. The evolutionary relationships and expression patterns of different members of this gene family were investigated. Further we performed functional diversity analysis and positive selection analysis to explore the evolution of their structure and function. Based on the research results, we provide a theoretical basis for further research on the function of H + -PPase genes in plants.

Cross-species distribution of H + -PPase genes in plants
Twenty-seven plants with relatively complete genome annotations were selected for the identification of H + -PPase gene family members. HMMER v 3.1 [18] was used to search for candidate genes in complete protein sequence data of different species (hidden Markov model number: PF03030). After identification and filtering, 124 H + -PPase gene family members were identified ( Table 1, Additional file 1). All plant species evaluated in the present study contained at least one member of the H + -PPase gene family. No algae contained more than three of these genes, and many contained only one H + -PPase gene (e.g., Cyanidioschyzon merolae, Dunaliella salina, Chlamydomonas reinhardtii, Volvox carteri) ( Table 1, Additional file 1). In contrast, the angiosperm species had several H + -PPase genes, with the eudicot upland cotton (Gossypium hirsutum), which reunited the A-and D-genomes in recent history [19], having as many as 16 H + -PPase genes. In the monocots, with ten genes, maize (Zea mays) had second highest number of H + -PPase genes. With just two H + -PPase genes, the magnoliid columbine (Aquilegia caerulea) was the angiosperm with the fewest H + -PPase genes. We observed that the expansion of the H + -PPase gene family was concentrated in the angiosperms.

Phylogenetic analysis of the plant H + -PPase gene family members
To map the phylogenetic relationships between 124 H + -PPase gene family members, two multiple alignment methods (ClustalW [20], MUSCLE [21,22]) and three phylogenetic inference methods (neighbor-joining, NJ; maximum likelihood, ML; minimum evolution, ME) were employed. In addition, the H + -PPase domain sequence and the full-length sequence were also analyzed separately. All resulting phylogenetic trees had similar topologies (Additional file 2). Considering the calculation time, the bootstrap value, and the subsequent analysis needs, the MUSCLE aligned full-length sequence and the NJ method were selected for further analysis. Among the plant H + -PPase gene family members identified in the present study ( Fig. 1a), only estExt_Genewise_ext.C_Chr_10614 in Ostreococcus lucimarinus was on an independent evolutionary branch. The other 123 members of the H + -PPase gene family belonged to type I or type II branches. The type I H + -PPase gene subgroup was the largest, and accounted for 69.4% of the genes observed, while type II genes accounted for the remaining 29.8%. This may be due to the greater demand for type I H + -PPase gene expression in plants, which contributed to the accumulation of these gene copies.
In the angiosperm branches of type I and type II genes, a large number of branch nodes had low bootstrap values. This phenomenon may be the result from few overall differences between the members on the related branches (Additional file 2, Fig. 1a). Among them, the type II H + -PPase protein members from the same species belonged to closely related branches, while in the type I group, the opposite was true. Among the type I H + -PPases, a large cluster from one branch of angiosperm type I genes with a high bootstrap value (the red background area in Fig. 1a) had structural differences with members on other branches, and the protein sequences of the internal members of this branch had very high similarity. In order to study the possible differences among type I H + -PPase genes, this group was classified as subtype Ia, and the remaining type I genes were classified as subtype non-Ia.
In addition, we explored the positions of 124 genes in the background tree (Including 323 seed sequences from database Pfam 32.0, Additional file 4). We found that Fig. 1 Phylogenetic evolutionary tree, protein motifs, and gene structures of H + -PPase gene family members. a A neighbor-joining (NJ) phylogenetic tree was constructed using the full-length sequence alignments of 124 H + -PPase genes identified using MUSCLE in MEGAX. Bootstrap supports are indicated by the color of the branches. OTUs are labeled as follows: red algae (red); Mamiellophyceae (dark blue); Chlorophyceae (light blue); Bryophytes (light green); Ferns (dark green); Angiosperm (black). Color blocks denote subtypes in angiosperms, with type Ia (red), type non-Ia (orange), and type II (green) denoted. b Motifs of the H + -PPase proteins. The rectangles indicate the length and positions of motifs. The different colors indicate 15 motifs (left panel). The sequence logo for each motif is shown in Additional file 3. c Gene structures of the H + -PPase genes. The lengths of rectangles and lines are scaled according to mRNA length. CDSs (green rectangles), UTRs (yellow rectangles), and introns (black line) are denoted type I and type II genes are located in the branches where eukaryotes are abundant, and only estExt_Gene-wise_ext.C_Chr_10614 in O. lucimarinus was located distantly from the eukaryotes. This indicates that estExt_ Genewise_ext.C_Chr_10614 may therefore represent an H + -PPase gene other than type I and type II.

Structural differentiation of the plant H + -PPase gene family members
The mRNA sequence length of the H + -PPase genes varied from 2246 bp (e_gwEuk.1.151.1 in O. lucimarinus) to 16, 779 bp (VIT_09s0054g00700 in grapes), and family members within the same cluster had similar genetic structures. Type I H + -PPase genes had relatively fewer exons that were nonetheless longer than those in type II H + -PPase genes. The full-length mRNA sequences of type II members were longer than those of type I, and their exons were more often interrupted by introns (Fig. 1c). This phenomenon not only confirms that the two types of members have experienced different evolutionary processes, but also may be one of the reasons for the low expression of type II members.
There were also significant differences in the amino acid sequences of the proteins encoded by the plant H + -PPase genes, the shortest of which containing 625 amino acids (Gohir.D10G001600 in cotton, with deletion of the first helix), and the longest containing 853 amino acids (AMTR_s00025p00194920 in Amborella trichopoda), with an average of 770 amino acids. We observed an average of 762 residues for type I proteins and 793 residues for type II (including Gohir.D10G001600). The average isoelectric point of type I proteins was 5.33 and that of type II was slightly higher at 5.71 (Additional file 1).
Protein sequence analysis revealed that all H + -PPase proteins shared motifs, including motif 1 located at core TM5-TM6, motif 2 located at core TM11-TM12, motif 3 located at TM13, and motif 6 located at TM9-TM10 (Fig. 1b). The K + ion-dependent determinant "GNxxAAIG" motif is located within motif 2 (Additional file 3). The difference between H + -PPase type I and type II proteins was mainly reflected in the TM1 helix position of the N-terminus, the motifs 9 (type I) / 15 (type II) in the TM7-TM8 region of the middle section, and the motif 4 (type I) / motif 10 (type II) in TM15-TM16 of the C-terminus (Fig. 1b).
By comparing the distribution of motifs and threedimensional models (SWISS-MODEL [23]), we found that the structure of gene estExt_Genewise_ext.C_Chr_10614 seems to be similar to that of type II genes (Fig. 1b, Additional file 5). Confoundingly, this gene also has a K + iondependent determining domain "GNTTAATG", which is similar to that of type I members (Additional file 1). These characteristics further confirm the uniqueness of gene estExt_Genewise_ext.C_Chr_10614 in O. lucimarinus.

Duplication events in H + -PPase genes in plants
After analyzing 27 plant species, only one pair of tandem repeats was found in moss (PHYPA_000091, PHYPA_000092; Table 1, genes highlighted by bold typeface), and no tandem repeats were identified in angiosperms with frequent duplication events. We searched the Plant Genome Duplication Database (PGDD, http://chibba.agtec.uga.edu/duplication/) for species with 7 or more copies of H + -PPase genes (i.e., corn, wheat, soybean, banana, and upland cotton). Although the genomic segmental duplication information of some species (wheat and upland cotton) has not been recorded in the PGDD database, eight pairs of segmental duplications were found in corn, banana, and soybean, two of which were type II H + -PPase genes, while the other genes were of the Ia subtype. The number of non-Ia subtype members is small, which may indicate that no fragments containing non-Ia subtype members underwent segmental duplication. The estimated separation time based on the effective synonymous substitution rate (Ks) value of fragment repetition was similar to the WGD date of the species (Table 3).
To further study the duplication events in the evolutionary history of the H + -PPase gene family, we used upland cotton as a model because it has the largest number of members from this gene family. According to species evolutionary relationships [19], we analyzed the genome collinearity among a primitive angiosperm (A. trichopoda), grape, cocoa, and a diploid cotton (Gossypium raimondii), and found that the number of H + -PPase genes increased following WGDs in these species, including three amplifications in branches A, B, and C (Fig. 2a). In upland cotton, the distribution of Ks values among different gene pairs for all H + -PPases formed four clusters. The first three clusters included the homomorphic gene pairs, while the fourth cluster was composed of heterogeneous H + -PPase gene family members ( Fig. 2b; Additional file 6). The Ks values between type Ia and non-Ia subtypes and between types I and II in the fourth cluster indicate a highly similar Ks distribution (Fig. 2b). The segregation period of H + -PPase members in upland cotton was estimated according to an average synonymous replacement rate of 2.6 bases per 10 9 years (λ = 2.6 × 10 9 ) [36] (Fig. 2c). This calculation indicated that the divergence events among homotypic members were similar to the predicted doubling times caused by three events that occurred in upland cotton as follows: A, the WGD shared by angiosperms (γ event) [37,38]; B, the WGD in the genus Gossypium, which occurred 57-70 Mya [30]; and C, the segregation of ancestors from chromosomes A and D in tetraploid cotton, which occurred between 5 and 10 Mya [36]. Regardless of the differences between type I and type II or between Ia and non-Ia subtypes within type I, similar divergence periods were estimated between heterogeneous H + -PPase genes, as indicated by their similar Ks distribution. This suggests that the Ks values between the Ia and non-Ia subtypes has reached saturation, and this phenomenon was also observed in species with a large number of H + -PPase genes (Additional file 6). Therefore, the Ks value between heterogeneous members can no longer be used to reliably estimate the separation time, indicating that the differentiation between type I and type II and between Ia and non-Ia subtypes within type I occurred at an early evolutionary stage.
In summary, PGDD data and analysis of H + -PPase genes in upland cotton suggest that WGDs have played the most important role in the accumulation of H + -PPases in higher plant species.

Expression patterns of H + -PPases in plants
To assess the possible functional differentiation among H + -PPases in plants, we compared the expression patterns observed in cotton and corn, which have multiple gene family members, and A. thaliana, which has fewer members (Fig. 3). In most tissues at different developmental stages, the highest gene expression of H + -PPases belonged to the Ia subtype. The expression levels of H + -PPases in type II were lower than those in type I, but we observed a smaller difference than what was previously reported [15].
Under typical conditions, compared to A. thaliana, the expression patterns in upland cotton and corn were more complex, and each subtype of H + -PPase genes had members with very low expression levels. However, most genes were highly expressed in at least one organ or at certain stages of development. Thus, differential expression patterns evolved among plant species with larger number of H + -PPase gene family members.
The differentiation trends illustrated in Fig. 3 can be more intuitively reflected by comparison of transcriptomes from five species with increasing numbers of gene family members: cucumber (GSM3048829-GSM30488 31, GSM1576573-GSM1576580), soybean (GSM170159 5-GSM1701597, GSM3714659-GSM3714661), poplar (GSM2565710, GSM2565711, GSM2565718, GSM2565 719), maize (PRJNA171684), and upland cotton (GSE7 0369) (Fig. 4). We found that the expression pattern varied not only among genes of different types or subtypes, but also among copies of the same subtype. Further, the more the number of copies, the more obvious was the extent of differentiation. We speculate that functions of new copies differentiated with unique roles in plant growth and development as the number of copies accumulated.

Functional divergence in the H + -PPases
The DIVERGE v3.0 program [39][40][41] was used to explore whether amino acid substitutions along different branches of the H + -PPase gene family led to the functional bifurcation in the two major branches. Since the estExt_Gene-wise_ext.C_Chr_10614 in O. lucimarinus is on a branch of its own, it was excluded from this analysis. There was significant functional divergence between type I and type II H + -PPases in our dataset, with seven Type-I and 80 Type-II functionally divergent sites identified ( Fig. 5; Additional file 7). This indicates that there were different selective constraints on the distribution of amino acid sequences between the two types of H + -PPase genes, and that a large number of conserved amino acid sites underwent radical substitution. In type I H + -PPases, 10 key amino acid residues were related to diphosphate hydrolysis and proton pump function [16,17], while five of these were replaced in the type II H + -PPases, two of which had undergone Type-II functional divergence (Fig.  5, AT1G15690.1 R246Q and E305A ). In terms of the gene structure, significant differences at key sites between these types are likely to cause functional differentiation and limit functional substitution among members. This would also partially explain why all higher plants have both type I and type II H + -PPase genes.

Positive selection in the H + -PPase gene family
We next investigated whether there was selective pressure for differentiation among members of the H + -PPase gene family on different phylogenetic branches. In the present study, 124 H + -PPase genes were analyzed by comparing the "free ratio" model, which assumes that each branch in the phylogenetic tree has different ω values, to the "one ratio" model, which assumes that the whole evolutionary tree has the same ω value. According to the likelihood ratio tests (LRT), the "free-ratio" model was significantly better than the "one-ratio" model, indicating that the different branches of the phylogenetic tree were affected by significantly different selection pressures (Table 4). Using type I and II H + -PPase branches as the foreground branches (Additional file 8), "Model A" and "Model A-null" models were compared using the branch site model. This analysis showed that the ω 2 values of the type I and type II H + -PPase branches were significantly higher than 1, and the "Model A" model of the two branches was significantly better than the "Model A-null" model in LRT detection (Table 4). These two major branches of the H + -PPase gene family could have been subjected to strong positive selection pressure. We also employed a Bayes Empirical Bayes (BEB) method to identify sites under positive selection with a posterior probability of more than 95%. One site was found amongst the type I genes, while 14 were found among the type II members with a posterior probability of more than 95%, one of which had a posterior probability of > 99% (396 N, AT1G15690.1 N284L ) ( Table 4, Additional file 9). These results suggest that plant type II H + -PPases were subjected to stronger positive selection pressure than type I genes.
In the interior of the angiosperm type I H + -PPase branch, we conducted branch site model analysis on the Ia subtype as the foreground branch (Additional file 10). We found that although "Model A" was better than "Model A-null", the ω 2 value of the Ia subtype branch was not higher than 1, suggesting that the branch was not positively selected for in the angiosperm type I H + -PPases. In addition, five positively selected sites were identified in the branches of the Ia subtype; however, the posterior probability was less than 95% (Table 4, Additional file 9), indicating that the conservation of Ia subtypes was stronger than that of the other plant type I branches.

Key structural sites in plant H + -PPase proteins
In order to further describe the structural characteristics of plant H + -PPases and screen for key sites, a multiple sequence alignment was performed ( Fig. 6) with five H + -PPase protein sequences from A. thaliana and the most primitive angiosperm, A. trichopoda.
We found that, of the 80 Type-II functional disproportionation sites, two were involved in proton pump function (Fig. 5,  ). In addition, these nine key sites (Fig.  6, amino acids highlighted by green triangle) are mostly located in the core functional helix (7/9), so amino acid substitutions at these sites may have a significant impact on the function of the proton pump. On the other hand, the key residues that regulate the K + requirement of H + -PPase protein are exactly at the positive selection sites for type II H + -PPase (Fig. 6, amino acids highlighted by black crosses,  . We therefore hypothesize that the ability to function independently of K + is an evolutionary advantage for the type II members. Using a newly published model [16,17], we found the key region that defines K + demand (GNxxAAIG) is located in TM12, which is slightly different from previous studies [4,5].
In summary, among different types, many key regions of H + -PPase, including proton transport and potassium iondependent determinants, are involved in functional divergence and positive selection to varying degrees. Therefore, different types of H + -PPase may play distinct roles.

Evolutionary processes of H + -PPase genes in plants
Although the atypical gene, estExt_Genewise_ext.C_Chr_ 10614 from O. lucimarinus, is an isolated observation in our analysis, it is effectively expressed in transcriptome data (GSM1134625) and can be mutually confirmed with previous results [42]. Therefore, we believe that the plant H + -PPase gene family contains at least three different types of protein that originated from the LUCA. Genetic segregation into these types occurred very early in evolutionary history, and each type has experienced a long period of independent evolution. This ancient genetic divergence is similar to that of the V-ATPases and their sibling homologous F-ATPases [2].
In the present study, H + -PPase gene family members have diversified ways of presentation in different species, such as: only type I in green algae of class Chlorophyceae; only type II in red algae; type I & type II in higher plants and some green algae. In addition, other type of H + -PPase genes have still been found in plants such as "estExt_Gene-wise_ext.C_Chr_10614". The genome assembly of the species involved in this study is reliable; however, the reference genome of any species could not be perfect, which also makes it unavoidable to eliminate the possibility of omission in the search results. Therefore, we are still not sure whether this gene family has undergone evolutionary events, such as horizontal gene transfer (HGT) and gene loss, as these events are not uncommon in the early stages of evolution [43].

Differentiation of angiosperm type I H + -PPase subtypes
In angiosperms, type I H + -PPase members may have undergone unique differentiation events. In the present study, angiosperm type I H + -PPases were divided into Ia and non-Ia subtypes. All angiosperms contained Ia subtype members, and 72% (13/18) of angiosperms had members from the non-Ia subtype. In the most primitive angiosperm-Amborella, only two type I H + -PPase genes are present, belonging to the Ia and non-Ia subtypes ( Table 2). In species that express more than seven H + -PPase members, such as upland cotton, the Ks values between the Ia and non-Ia subtypes were significantly higher than those in homotypic members, and reached saturation (Fig. 2b, c, and Additional file 6). There may have been different subtypes of the type I H + -PPase genes in the angiosperm ancestor, which gradually evolved to form the structural trunk made up of the present subtypes.
We also found that Ia subtype members had the highest sequence conservation, the highest copy numbers, and were distributed across all 18 angiosperms included in the present study (Table 2). Further, the members of this subtype had the most variable expression patterns, and the members with the highest expression levels were also from this subtype (Figs. 3 and 4). Based on these results, we hypothesize that the Ia subtype could be the dominant H + -PPase variant in angiosperms.

Two evolutionary trajectories of H + -PPase gene family
Among angiosperms, species with multiple H + -PPase genes and those with fewer than four H + -PPase genes follow different evolutionary trajectories. Over time, the new species separated from their ancestors and gradually formed two  trails. One trail could be characterized by new species evolved accompanying copy number inclement of H + -PPase gene (Fig. 7, light red arrow). Species on this trail (e.g., upland cotton) often experienced multiple WGD events (i.e., high numbers of duplicate genes), with multiple H + -PPase genes specifically expressed in different developmental stages and tissues. However, the differentiation of expression patterns was mainly concentrated among homomorphic gene types with similar sequences. The differentiation of expression patterns was more obvious as the number of gene family members in a species increased (Figs. 3 and 4). This describes an evolutionary trajectory in which genes have almost the same sequences but exhibit spatiotemporal differentiation, defined as the sub-functionalization.
In contrast, on the other trail, the copy numbers of H + -PPase genes in newly emerging species were low (Fig. 7, light blue arrow). Cucumber and A. thaliana can be found on this trail, and each of them have no more than three H + -PPase genes with stable relative expression levels in different tissues and developmental stages (Figs. 3 and 4). We describe this trajectory of responding to multiple transcriptional needs with one gene as multifunctionalization. Fig. 6 Multiple sequence alignment of H + -PPase protein sequences. AT1G15690.1, ERN14318, and ERN03082 belong to type I, among which ERN03082 is non-Ia and the rest are Ia. AT1G16780.1 and ERN12531 belong to type II. Transmembrane helices (TM) in the reference sequence (AT1G15690.1) are outlined and numbered, and the six core and ten outer TMs are indicated in black and white, respectively. Arrows indicate key sites in the reference sequence involved in proton transport. Dots indicate the amino acids responsible for functional divergence (Type-I: blue, Type-II: red). The red and blue outlined boxes indicate amino acids that might be responsible for positive selection of type I and type II H + -PPases, respectively. Functional disproportionation sites that are responsible for positive selection or involved in proton transport are indicated with green triangles. Black crosses represent key sites for K + demand Sub-and multi-functionalization have distinct characteristics during the evolution of a gene family. For example, sub-functionalization avoids the risk of mutation, while multi-functionalization carries a smaller genetic burden.

Conclusion
Among the 27 plant species examined in the present study, all possessed H + -PPase genes, with 124 different H + -PPase gene family members identified. The vast majority of these could be divided into two categories: type I and type II, with type I further differentiated into subtypes Ia and non-Ia. There were significant differences in the copy numbers of H + -PPase genes among different plant species, and the species with higher copy numbers were usually angiosperms. We also found that the accumulation of H + -PPase gene copies in angiosperms was mainly due to WGD events in each species. In lower plants (e.g., red algae and green algae), the different types of H + -PPase genes were unevenly distributed, while all higher plants (e.g., vascular plants) contained combinations of both type I and type II H + -PPase genes. Phylogenetic analysis, motif analysis, and the prediction of tertiary structures of different H + -PPase proteins indicated that "estExt_Genewise_ ext.C_Chr_10614" in O. lucimarinus is distinct from both type I and type II. We also confirmed significant differences in the expression patterns between type I and type II H + -PPase genes, and identified different expression patterns between homomorphic H + -PPase genes in species with multiple gene copies. We estimated the functional divergence between type I and type II H + -PPase proteins caused by amino acid substitution, and found that two of the ten functionally related key amino acid sites were related to Type-II functional divergence. We also found that both type I and type II H + -PPase branches were subjected to very strong positive selection pressures. However, there was no obvious positive selection among members of the Ia subtype in angiosperms. These results improve our understanding of the structural evolution and functional differentiation of the plant H + -PPase gene family, and provide a foundation for further exploration of the function and potential applications of this gene family.
Based on significant differences in the number of H + -PPase genes in angiosperms and the differentiation among homomorphic members, we propose two gene family evolutionary trajectories (sub-and multi-functionalization) that explain the observed evolutionary phenomena.

Data sources
Twenty-seven representative plants with relatively complete annotated genome data were selected as the research subjects from the APG taxonomy [44] and phylogenetic relationships. Taxonomic evolutionary relationships among species were visualized using the Timetree online tool (http://www.timetree.org/) [45,46] (Additional file 11). The genomic data were downloaded from the Ensembl Plants dataset (https://plants.ensembl. org) and the Plant JGI Database phytozome v12.1 (https://phytozome.jgi.doe.gov/pz/portal.html). For genome version information, see Additional file 12.
A curated seed alignment containing 323 representative H + -PPase proteins was downloaded from Pfam 32.0 [47] (http://pfam.xfam.org/). This seed alignment was used as a background to explore the genetic position of the plant H + -PPases.

Identification of H + -PPase gene family members
The hidden Markov model (HMM) (pfam number: PF03030) for the characteristic domain of H + -PPase proteins was downloaded from the Pfam database (http://pfam. xfam.org) [47]. HMMER v 3.1 [18] was used to search for candidate genes in the whole protein sequence data of each different species. Because the protein domain of the H + -PPase gene family is large (650 amino acids), the sequence coverage rate was more than 80%, and the e-value was less than 1 × 10 − 200 . Proteins with domain separation in intervals of no more than 50 amino acids, a sum of sequence coverage of more than 90%, and protein e-values of less than 1 × 10 − 200 for each section were used as candidates. The longest transcript of each gene was selected as the candidate member and submitted to SMART (Simple Modular Architecture Research Tool: http://smart.embl-heidelberg. de) [48] for verification. These results were used for downstream analysis of the H + -PPase gene family.

Analysis of H + -PPase gene and protein structure
The structural information of gene transcripts was extracted from GFF3 (Generic Feature Format Version 3) annotation files, and the protein motif structure was obtained using the online tool MEME [49,50] (http:// meme-suite.org/tools/meme 5.04). The main parameters were as follows: the search motif type was 15, the distribution number of each motif in the sequence was 0 to 1, the size range was 6 to 100, and the p value was less than 10 − 5 . The resulting data were compiled and submitted to the online tool iTol [51] (https://itol.embl.de / 4.3.2) to visualize protein structures.
The protein sequences were submitted to the SWISS-MODEL [23] website for tertiary protein structure prediction. Then, the predicted structure maps of different types of H + -PPase proteins were exported using PyMOL (http://www.pymol.org) software.

Establishment of phylogenetic relationships
We analyzed the results of the two multiple alignment methods (ClustalW and MUSCLE) and three phylogenetic inference methods (NJ, ML, and ME) in MEGAX [52] with 1000 bootstrap replicates to choose stable phylogenetic trees.
Because of the need for functional divergence analysis and positive selection analysis, whole protein sequences were used to construct phylogenetic relationships among H + -PPase gene family members. For more accurate subgroup division, we also performed phylogenetic analysis of specific functional domains as a reference.
According to the seed alignment presented in the Pfam dataset, MUSCLE aligned functional domain sequences was used to explore the position of the plant H + -PPases in the background tree.

Analysis of family member expansion
We focused on two replication mechanisms for H + -PPase gene family members: segmental duplication and tandem duplication [53]. We considered members of the H + -PPase gene family that were no more than 10 genes apart as tandem duplications [54]. Segmental duplication was determined by PGDD, and the corresponding repeat fragments and Ks values [55] were obtained. In addition, for upland cotton, analyses of the genomes of evolutionarily related species and a search for segmental duplications over different periods was conducted with MCscanX (v. python) [56] and blast2.7.1 software. The separation time between the corresponding genes was estimated according to the formula T = Ks/2 λ [55]. The Ks values among gene members in different species were calculated with CODEML [57].

Gene expression pattern analysis
Transcriptome data were mined from the Gene Expression Omnibus DataSets (GEO), the Sequence Read Archive (SRA) database, and the Bio-analytic Resource for Plant Biology (BAR) http://bar.utoronto.ca/. For species whose gene IDs were difficult to match to the ID in the high-throughput data, we used Salmon [58] (v 0.13.1) to analyze the expression levels of all genes in a selected species and extract the transcripts per million (TPM) value for the H + -PPase genes, after quality detection and filtering using FastQC (v 0.11.8) and Trimmomatic (v 0.38) [59].
In order to compare the expression of H + -PPase genes in specific tissues of different species, we calculated and compared the relative expression.
The following formula was used for calculating relative gene expression intensity: