Sub-grouping and sub-functionalization of the RIFIN multi-copy protein family

Background Parasitic protozoans possess many multicopy gene families which have central roles in parasite survival and virulence. The number and variability of members of these gene families often make it difficult to predict possible functions of the encoded proteins. The families of extra-cellular proteins that are exposed to a host immune response have been driven via immune selection to become antigenically variant, and thereby avoid immune recognition while maintaining protein function to establish a chronic infection. Results We have combined phylogenetic and function shift analyses to study the evolution of the RIFIN proteins, which are antigenically variant and are encoded by the largest multicopy gene family in Plasmodium falciparum. We show that this family can be subdivided into two major groups that we named A- and B-RIFIN proteins. This suggested sub-grouping is supported by a recently published study that showed that, despite the presence of the Plasmodium export (PEXEL) motif in all RIFIN variants, proteins from each group have different cellular localizations during the intraerythrocytic life cycle of the parasite. In the present study we show that function shift analysis, a novel technique to predict functional divergence between sub-groups of a protein family, indicates that RIFINs have undergone neo- or sub-functionalization. Conclusion These results question the general trend of clustering large antigenically variant protein groups into homogenous families. Assigning functions to protein families requires their subdivision into meaningful groups such as we have shown for the RIFIN protein family. Using phylogenetic and function shift analysis methods, we identify new directions for the investigation of this broad and complex group of proteins.


Background
Antigenic variants are proteins expressed by pathogenic organisms, which are usually exposed to immune pressure from a vertebrate host. The genes that encode these pro-teins can be single copy within the genome as is the case for viruses and the variability therefore exists between gene copies of individuals. This implies that the proteins they encode retain the same function. However, other organisms maintain several to many copies within the genomes of each individual [1,2]. Conversely to viral genes, these multicopy genes are not only under immune pressure but can also follow distinct evolutionary paths to differentiate into novel functional units.
The genomes of Plasmodium species contain numerous large multigene families that have been amplified via functional or immune pressures [2][3][4][5][6]. One important feature of these organisms is that they do not express the whole protein repertoire simultaneously [7-10]. These polymorphic families are predominantly situated in the sub-telomeric ends of chromosomes [2][3][4][5][6], where gene rearrangements are frequent [11,12]. They encode for proteins that presumably fulfill several functions and immune pressure has driven them to antigenically vary at the surface of the infected erythrocyte [13]. Empirical studies have shown that the Plasmodium falciparum Erythrocyte Membrane protein 1 (PfEMP1) can mediate cytoadhesion by interacting with various host receptors, resulting for example in sequestration of the infected erythrocytes in the host tissue or rosette formation with uninfected red blood cells [13]. The repertoire of PfEMP1 proteins is therefore shaped both by functional pressures for binding and by diversifying pressures to evade immunity [14]. Yet, such an accumulation of experimental data is missing for protein families in most parasite species.
We have studied the RIFIN protein family, a group suggested to be under immune diversifying selection. Their genes, repetitive interspersed family (rif), are the largest family in P. falciparum with 150 to 200 copies per haploid genome. They are small two-exon genes (≈1000 base pairs), with a conserved domain architecture [15,16]. Characteristically, RIFIN proteins are described as small polypeptides beginning with a putative signal sequence followed by a conserved domain, a variable region and a conserved C-terminal domain. Two transmembrane regions have been predicted on both sides of the variable region; with this stretch predicted to be exposed to immune pressure [9,15]. The proteins most closely related to RIFINs are of the Sub-Telomeric Variable Open Reading Frame (STEVOR) family [15], numbering 28 copies in the reference strain genome [2]. Although primary sequence similarity is limited [15], this relationship is emphasized by the existence of a RIFIN_STEVOR family (PF02009) in the PFAM database [17].
RIFIN proteins have been detected throughout the intrahuman life cycle of the parasite [8, [18][19][20][21]. Furthermore, RIFIN proteins are associated with a stable immune response over time and with rapid clearance of parasites from the circulation [22,23]. However, as for most protein families, little more is known and their function(s) remain(s) to be discovered. In this study, we propose a novel approach to understand complex protein families for which little data is available. We demonstrate the division of the RIFIN family into two groups, which we associate with published differential cellular localization. Finally, we correlate these differences with the prediction of a function shift between these sub-groups.

Phylogenetic classification of the RIFIN family
An alignment of 134 RIFIN protein sequences from the P. falciparum reference strain 3D7 (selection criteria detailed in Methods) was analyzed in order to detect divergences within the family. This revealed the existence of differences, prompting an initial division of RIFIN proteins into at least two major groups. The larger group, which we named the A-type RIFINs, represents ≈72% (97/134) of all analyzed proteins, while the second group, which we designated B-type RIFINs, makes up ≈28% (37/134). Although both groups have a common architectural structure [15,16], they differ by several features, as depicted in the alignment of representative A-and B-RIFIN sequences (Fig. 1A) and schematically (Fig. 1B). First, the A-type proteins are generally larger than the B-type variants (on average 350 and 330 amino acids, respectively). This difference is largely due to a 25 amino acid stretch present only in the conserved (C1) region of A-type RIFINs, as previously described [2]. It is located approximately 66 amino acids downstream of the Plasmodium export element (PEXEL motif) [24] and contains some highly conserved residues (Fig. 1). A second distinctive feature concerns the number of conserved cysteine residues (Fig.  1B arrows). A-type RIFINs are characterized by a total of 10 highly conserved cysteine residues, compared to 6 in Btype variants, 5 of which are common to both sub-types ( Fig. 1B grey arrows). Notably, two of the conserved cysteines typical for A-type RIFINs are found in the 25 amino acid stretch.
In order to substantiate this preliminary sub-grouping, we clustered rif sequences according to their similarities by constructing Neighbor Joining distance trees. The trees resulting from protein-derived cDNA alignments sorted the sequences into two major groups that were largely concordant with the above sub-grouping (Fig. 2). However, five sequences deviate from their predicted group (Fig. 2, stars): PFD0045c and PFI0050w, which are B-RIFINs, cluster with A-RIFINs; PFB0015c is an A-type which groups with B-RIFINs; and PFB0040c and PF10_0402 cluster together and separately from A-or B-RIFIN proteins. We find it noteworthy that the B-RIFIN group could be further subdivided into three subsets, namely B1, B2 and B3, whereas the A-RIFINs did not form any obvious clusters (Fig. 2). While B1 and B2 sub-clades formed a monophyletic group with a bootstrap value of 92%, the separation of the B clade from the A clade had a RIFIN proteins overview shading according to conservation. (B) Schematic RIFIN sub-group characteristics: Overall domain organization and classification into subtypes. The 25 amino acid stretch present only in the semi-conserved domain of A-type RIFINs is highlighted and depicted by a sequence logo. Grey arrows: common conserved cysteine residues; black arrows: sub-type specific cysteine residues; SP: signal peptide; PEXEL: Plasmodium export element; C1: semi-conserved domain, including the 25 AA insertion/deletion; C2: C-terminal conserved domain; TM1 and TM2: previously predicted transmembrane regions; V1: first variable domain; V2: second variable domain. Consevation   1  1  1  1  1  1   80  83  84  87  86  82   MK-VHYINILLFALPLNILI------YNQRNHYITRTPKATT---RTLCECELYAPSNYDNDPEMKAVMQDFDRQTSQRFEEYNERLLEN  MK-   Phylogenetic tree of rif cDNA Figure 2 Phylogenetic tree of rif cDNA. The tree shows the segregation of A-and B-rif genes (gaps considered as complete deletions). The B-rif group is further subdivided into B1, B2 and B3 clusters. Stars indicate sequences that group atypically. Bootstrap support, after 1000 replicates, is only shown for the branches separating the different groups, dots at nodes indicate bootstrap values above or equal to 60%.
weaker statistical support at 61%. This unexpectedly low bootstrap value together with the observation of relatively long branches in the B3 sub-group versus the shorter ones in the B1 and B2 sub-groups prompted us to examine the sequences more closely. Accordingly, we carried out independent phylogenetic analyses of the conserved C1 and the variable V2 domains (as shown in Fig. 1B). These trees show that the B3 sequences have an incongruent history (Fig. 3), which reveals probable recombination/gene conversion events. Specifically, the V2 domains of the B3 sub-set segregated with the A-RIFINs rather than with B-RIFINs, while the C1 domains of the same variants were of B-type (with the exception of PFE1630w). B3 sequences thus constitute hybrid variants composed of C1 domains of the B subtype and V2 domains of the A subtype. Overall, we observed long branches for sequences encoding Aand B3-RIFIN proteins, not seen for B1 and B2 sequences, clearly a direct consequence of the higher variability within the V2 region of these sequences.
Non-congruence of phylogenetic trees of RIFIN conserved (C1) versus variable (V2) domains The same tree construction method applied to the V2 domain showing that B3-RIFIN sequences do not cluster with B1-and B2-sequences. Bootstrap support, after 1000 replicates, is shown for values above 50%.
In addition to the analysis of the 3D7 strain, we have aligned the 3D7 sequences with 59 of the DD2 and 65 of the HB3 strain sequences (selection criteria detailed in Methods). The tree resulting from the protein alignment confirmed the results obtained with the reference genome analyses. The sequences sorted into the same two major clades with no strain specific grouping (see Additional file 1). The B-RIFIN clade is split into three groups; however the B1 and B2 clades contain few sequences from the DD2 and HB3 genomes.
It is noteworthy that the two B-RIFIN sequences, which cluster with A-RIFINs (PFD0045c and PFI0050w), have homologous sequences in both DD2 and HB3 genomes (see Additional file 1, stars).
Based on the knowledge that non-coding regions may contain motifs of significance in gene regulation and expression, we also analyzed 500 base pairs of non-coding upstream and downstream untranslated regions (UTRs) from the 3D7 rif genes. The phylogenetic analyses of these regions segregated the sequences into the same major Aand B-groups as the coding regions, which we have termed A-rif and B-rif UTRs (see Additional file 2). For both 5' and 3' UTR analyses, B-rif UTRs could be further divided into two groups, one of which included B1 and B2 variant UTRs, the other mostly B3 variant UTRs. As in the above analysis, some sequences did not segregate into their expected sub-group, for example a few B3 sequences were found in the B1/B2 subdivision and vice versa. Additionally, some A-rif UTRs clustered with B-rif UTRs and in this case, mostly with the B3 sub-group. In contrast to the coding sequences, the A-rif UTRs appear to cluster into sub-groups. Despite overall similarities in observations between both 5' and 3' UTR analyses, there was only partial congruence between these UTR clusters, in particular as far as A-rif UTRs are concerned.
A previous study has identified two transcriptional repression sites (TATGCAATGATT and CGCACAACAC) [25] upstream of 8 rif genes in a head to head orientation with UpsA var genes. An exhaustive search on all 14 chromosomes of the 3D7 strain shows that these two motifs are found in 20 and 19 copies, respectively. However, only 15 and 11 copies are upstream (either independently or in combination) of a total of 16 rif genes (see Additional file 2, indicated by #); the other copies are found up-or downstream, or sometimes in the coding region of other genes. Concordantly to this analysis, 13 of the 5' UTRs of these genes cluster together in our phylogenetic tree.
An analysis of chromosomal location reveals that only 6 of the 134 sequences (4.5%) used in this study are centrally located genes (data not shown). The other similarly positioned rif genes are annotated as pseudogenes or are truncated and none of these are grouped according to protein or UTR sequences (data not shown). The transcription of ≈70% of A-rif and all B-rif genes is telomere oriented. The A-rif genes with a centromeric transcription orientation (≈30%) do not cluster on the protein tree (data not shown), however they are mostly distributed within three sub-clades of the A-rif 5' UTR tree (see Additional file 2, crosses).

Function shift analysis of A-and B-RIFIN proteins
We sought for indications of functional differences between A-and B-RIFIN sub-groups by analyzing them for function shifts according to previously described methods [26]. Function shift analysis calculates the number of rate and conservation shifting sites (RSS and CSS, respectively) that exist between two given protein groups. RSS is measured by U-values, which indicate the likelihood that the mutation rate changes for each alignment position between the subfamilies under consideration. A site is considered rate-shifting (at 5% significance level) if its Uvalue is above a cut-off value of 4.0 [27]. CSS is measured by the Z-score, a normalized method to examine the similarity between two distributions of amino acids. Smaller Z-score values are associated with similar amino acid distributions in both subfamilies, while larger Z-score values are associated with very different distributions. The total numbers of positions are counted for both RSS and CSS calculations.
The results are compared to enzymatic protein families that have undergone a change in function, which belong to several functional categories including immunity related functions. The function shift model was benchmarked using organisms from all three kingdoms of life, namely Archea, Bacteria and Eukaryotes. This results in the estimation of the likelihood of sub-functionalization between the two groups. The function shift analysis of sub-group A against sub-group B (using standard cut-offs of 4 for RSS and 0.5 for CSS) resulted in the prediction of 81 rate shifting sites (RSS) (22% of all positions) and 60 conservation shifting sites (CSS) (17%) between them (see Additional file 3, rifins.html, for the full alignment). We computed the probability of the prediction as 83% based on RSS alone and 52% based on CSS alone. Considering comparable knowledge empirically gathered on the classification of shifts in function of known protein families, which combine the two measures [26], A-and B-subgroups are predicted to have functionally diverged from each other.
Listed in Table 1 and 2 are the top positions sorted according to their U-values for RSS (stringent cut-off of 15) and Z-scores for CSS (stringent cut-off of 2), respectively. Both RSS and CSS are mostly found in the conserved regions of RIFIN proteins (see Additional file 4, rifins_high.html, for the full alignment with stringent cut-offs). In Figure 4A we show these shifts in a portion of the N-terminus of a random selection of A-and B-RIFIN sequences. Figure 4B correlates CSS and RSS plots, along the alignment, with the predicted conservation of secondary structure of RIFIN proteins. The high stringency cut-offs used in this figure highlight the most significantly shifted sites (Fig. 4B  arrows). Notably, most of these shifts involve a change in the biochemical properties of the amino acid. We will specifically emphasize the shifts in positions Q31K, R32N, N33K and H34P, in a predicted loop region about 15 AA upstream of the PEXEL motif; positions C62S and Y67X approximately at the PEXEL motif; and positions C62S, C108R, C112R, and G167C which all involve cysteine residues, commonly engaged in disulfide bonds.
Limitations of function shift analyses lie in regions for which one group has amino-acid stretches that the other group lacks. In this case, RSS and CSS calculations give a null value; however this does not equate to an absence of impact on functional divergence of the two groups. One particular way of viewing such a site is to acknowledge it as a shifted site from a conserved motif to an absence of residues. The 25 AA stretch present in A-RIFIN sequences and absent from B-RIFINs can be viewed in this way, specifically due to the conservation of many of its residues as seen in Fig. 1B. Additionally, most of this motif is predicted to be a loop region, which could be involved in a functional site.

Discussion
Protein families with known functions have successfully been sorted into functionally different sub-groups using phylogenetic techniques [28,29]. However, which approach should be used with proteins of unknown function? We have combined phylogenetic and function shift analyses to study the Plasmodium falciparum RIFIN protein family. Our results demonstrated that these proteins could be subdivided into two major groups that we named A-and B-RIFIN proteins. We correlate these groups with different localization studies [19,21,30] based on proteins from each of these groups. Moreover, our function shift analysis points to the probability that these two groups of proteins have undergone neo-or sub-functionalization.
The 3D7 rif cDNA tree we constructed by the Neighbor Joining method distinguished A-and B-type RIFIN variants, the latter being subdivided into three groups (B1, B2 and B3). The additional analysis of combined rif sequences from three different strains (3D7, DD2 and HB3) confirms this grouping (see Additional file 1). However, most DD2 and HB3 sequences clustered in the A and B3 groups, with only four sequences in the B1/B2 group. Our strict inclusion criteria have resulted in the removal of over 45% of the DD2 and HB3 RIFINs, mainly truncated sequences. We do not know whether these are simply pseudogenes within these genomes or if they appear as truncated due to the difficulties in sequencing and assembling subtelomeric regions of P. falciparum parasites. Considering this latter case, we prefer not to draw genome wide conclusions from possibly incomplete genomes.
Upon further investigation of the 3D7 RIFINs, B3sequences showed to be hybrid variants that have B1/B2 features in their C1 domains but A-type features in their V2 domains. Vice versa, two A-variant hybrids carrying Aspecific C1 domains and B1/B2-specific V2 domains were also found (Fig. 3). Recombination events and gene conversion are likely to serve as explanations for the formation of such hybrid sequences. The former are essential for the generation of antigenic diversity [11] and previously proposed to be responsible for the diversity of the var gene family [31]. These authors argue for recombination events restricted between genes grouped according to their chromosomal location and transcription orientation. In contrast to the var genes, there is no evidence for such specific recombination within the A-and B-rif gene groups: ≈70% of the A-rif and all B-rif genes have the same telomeredirected transcription orientation; the remaining ≈30% of A-rif genes do not cluster in our gene tree. Also, over 95% of all rif genes analyzed here are subtelomeric. Theoretically, recombination can thus occur between A-and Btypes of the same orientation. DePristo et al. showed that low-complexity regions are preferred sites for recombination events to occur in var genes [32]. Since low-complexity regions are commonly found within RIFIN sequences at the boundaries of the variable region, it is tempting to suggest these sites to have a role in the generation of such hybrid sequences. Gene conversion has been observed in P. falciparum [11,33,34] and is the other possible explanation for these sequences. However, gene conversion has a homogenizing effect that is not detected between B3-rif V2 regions and the sequences showing highest identity to them (66,6% average sequence identity). This might be an indication in favor of recombination events or, simply, that gene conversion is not as frequent as suggested for falcipain genes [34].
Whichever mechanism, both recombination and gene conversion events are known to interfere with phylogenetic reconstruction [35]. Another factor that influences the resolution of phylogenetic analysis is long branch attraction [36,37]. We have seen that A-and B3-RIFIN sequences have long branches (Fig. 2), which could also interfere in our phylogeny. To further confirm our proposed sub-grouping, we constructed phylogenetic trees of the UTRs of rif genes. Previous analysis of gene families has shown that long-term survival of paralogous genes allows for changes in the regulatory regions of those genes respectively, according to alignment position. The predicted consensus secondary structure is plotted with pink and green bars representing helices and loops, respectively. The heights of the bars indicate conserved predictions. Arrows correlate the highest scoring shifted sites with secondary structure predictions. [38]. Our analysis of rif gene UTRs demonstrated a significant segregation of these non-coding regions into similar A-and B-rif UTR groups (see Additional file 2). Taking all these facts into consideration, we conclude that despite a seemingly low bootstrap value of 61%, RIFIN proteins can be divided into A-and B-RIFIN proteins.
One question arises at this point: could there be an alternative grouping of rif/RIFIN sequences? var, the other major family in P. falciparum has been classified according to 5' UTR and genomic position [2,39,40]. Their classification into 3 major sub-groups (A ≈17%, B ≈42% and C ≈40%) mainly relies on the following features: (i) 5' UTR grouping (UPSA, B and C); (ii) gene position (A and B telomeric, C central); and (iii) transciption orientation (A and C towards the telomere, B towards the centromere) [39]. However, PfEMP1 proteins are more complicated than RIFINs by the fact they are modular. Recognizable  signatures allow for the identification of each module but intra-module similarity is limited [2]. The overall function of these proteins is accepted as adhesion to host receptors and is highly module dependent (reviewed in [13]).
A parallel analysis of rif genes shows that, on one hand, very few are not sub-telomeric and no obvious pattern regroups these sequences. In the absence of more conclusive evidence, we do not think this is a good criterion for sub-grouping rif genes. On the other hand, rif UTR sequences can be grouped into sub-clusters. Also, the 5' UTRs of A-rif genes transcribed towards the centromere are non-randomly distributed (see Additional file 2, crosses). These observations confirm previous reports of differential regulation of A-rif expression within the same parasite strain [21]. However the clustering of these A-rif UTR sequences is not congruent with the clustering of the protein-derived cDNA sequences. A recent study of yir genes, the largest P. yoelii yoelii multigene family, shows that some yir genes undergo alternative splicing events [41], which implies regulatory signals in addition to those controlling gene activation and silencing. Therefore, although it is tempting to further the sub-grouping of Arif genes, we believe additional experimental evidence of differential transcription is required to ascertain these sub-divisions.
A recent study has shown that the intracellular distribution of RIFIN molecules in the infected erythrocyte is more diverse than previously envisaged [21]. In order to address the issue of cross reactivity of the antisera used in this study, Petter et al. [21] tested recognition of the anti-RIF29 and anti-PFI0050c antisera against other recombinant proteins of each group. Also, their western blot analyses show that neither A-RIFIN antisera are cross-reactive. A-type RIFINs, detected by an antiserum directed against PFB1035w [8] as well as an antiserum directed against RIF29 [23] (both A-type RIFINs), are transported to Mauer's clefts and towards the surface of the infected cell [19,21], while B-type RIFINs, detected by an antiserum directed against PFB1040w [8] and an antiserum directed against PFI0050c [30] (both B-type RIFINs), are expressed inside the parasite [21], which is consistent with this group's previous report [30]. Additionally, both Aand B-RIFIN proteins were detected in merozoites, here again with different sub-cellular distributions [21]. The localization of B-RIFINs is concordant with the lower variability they exhibit in their V2 region, at least for the B1and B2-RIFIN proteins (shorter branch lengths in Fig. 2). This would be expected of sequences not exposed to the immune system for long periods of time, as they would be at the infected erythrocyte surface.
Although all RIFIN variants bear a motif for directing proteins onto the secretory route, out of the parasite and into the cytoplasm of the host cell, referred to as the Plasmodium Export Element (PEXEL) or Vacuolar Transport Signal [24,42], additional factors not yet characterized might enhance or interfere with protein export. Bioinformatics analyses of biochemical properties of the PEXEL motif and surrounding amino acids suggest possible modulations of the role of this motif (J. Hiss, J. Przyborski, F. Schwarte, K. Lingelbach and G. Schneider, personal communication). Alternatively, presence or absence of conserved motifs distributed elsewhere in the protein, such as the 25 AA stretch present in A-RIFINs, and/or different native 3D conformations of A-and B-RIFIN variants due to the highly conserved subtype specific cysteine residues (possibly involved in disulfide bonding), could impose restrictions on the export signal carried by the PEXEL motif. A previous study of synthetic constructs of the gene PFI0050c (a B-RIFIN) fused to a green fluorescent protein shows that this protein is retained in the parasite when its full length is expressed [30]. However truncated versions, notably when lacking the C-terminal conserved region, are exported to the Maurer's Clefts. It is not clear whether this difference of localization is due to missing motifs in the C-terminus or to changes in 3D conformation due to the truncation of the C-terminus, including a transmembrane domain, of the protein. Whichever their respective transport mechanism, A-and B-RIFIN proteins have a distinct pattern of distribution during the intraerythrocytic life cycle of the parasite, which in correlation with the divergence of their regulatory regions [38] is suggestive of functional differences.
To test this hypothesis, we carried out a function shift analysis [26] of our sub-groups. The evolution of protein families and the consequential evolution of their function are accompanied by the accumulation of mutations at individual sites throughout the protein sequence [43]. These sites may incur different types of selective pressures. A specific site may become important for the maintenance of the function, and therefore a specific amino acid is fixed in that position. In contrast, a fixed site may lose its importance, and become prone to mutation (typical RSS sites). Alternatively, a switch of functional specificity of a site may result in the switch from one amino acid to another accompanied by strict conservation (no further mutations allowed) in both sub-groups (typical CSS site). Finally, the remaining mutations are thought to be randomly accumulated at selectively neutral sites. However, recent studies have shown that mutations in non-essential residues can greatly influence protein stability and aggregation [44]. These types of mutations may build up a compensation mechanism for mutations in key functional sites.

Conclusion
So far, the RIFIN protein family has been considered to be one large family with an unknown function but our results argue for a cautious approach when studying such variable protein families. The RIFIN proteins have been long neglected, possibly in part because of the complexity involved in studying such a large group of proteins. Antigenic variation is mostly a secondary function, as seen with the PfEMP1 proteins, which main function is in cytoadhesion. While physiological functions of RIFIN proteins remain obscure, it is expected that future focus on RIFIN sub-families, the 25 AA insertion/deletion and the predicted conservation-shifted sites between these sub-groups will help to simplify the quest for understanding their biological roles in the parasite. Finally, the lower variability of B-RIFIN molecules and their expression throughout the cycle of the parasite (multi-stage) suggest these proteins as candidate vaccine targets. Further analysis of this family in wild isolates may confirm this hypothesis. Independent alignments and phylogenetic analyses were carried out, on one hand, for the 3D7 strain (134 sequences) and, on the other hand, for the combined 3D7, DD2 and HB3 strains (258 sequences).

Phylogenetic analysis and sequence representation
Five hundred base pairs of upstream and downstream untranslated regions (UTR) as well as the cDNA sequences of the 3D7 rif genes were retrieved from GeneDB [56]. The UTRs were aligned in the same manner as the protein sequences.
Protein sequences are easier to accurately align than cDNA, however the degeneracy of the genetic code makes cDNA more informative than the corresponding protein translation. We used cDNA alignments derived from our protein multiple sequence alignments in order to increase the precision of the phylogenetic analysis. The cDNA alignments were constructed by replacing the amino acids in the protein alignments with the corresponding P. falciparum gene specific codons using the PAL2NAL software [57]. All the alignments are available upon request to the authors.
The C1 domain starts at the PEXEL motif and ends 30 AA after the insertion/deletion. The V2 domain starts 31 AA after the insertion deletion and ends 57 AA before the Nterminus of the protein alignment.
The alignments were used to construct distance trees using the Neighbor Joining method with the MEGA3.1 software [58]. We used a p-distance model with gaps/missing data treated as pairwise deletion for the proteins and UTRs and complete deletion for cDNA alignments. No trees were cut down throughout the experiments. In order to estimate robustness, bootstrap proportions were computed after 1000 replications.
Protein motifs were generated using Protein Sequence Logos and Relative Entropy server [59,60].
Secondary structure predictions were computed using PSIPRED [61,62]. The predicted secondary structures were aligned according to the protein alignment and a consensus prediction was generated using the Jalview software [63].

Function shift analysis
The function shift analysis was carried out on each subfamily pair, of the 3D7 genome sequences (after exclusion of two A-RIFIN and four B-RIFIN sequences which are hybrid A/B sequences; see Discussion for further details), using a previously described method [26]. In this method, two types of sites, namely rate shifting sites [27] and conservation shifting sites [26] are detected and a combined measure is calculated to assess the level of function shift between the sub-groups under consideration. In order for the algorithms to calculate shifting sites, the sequences need to segregate into their predicted groups. Six sequences (two A-RIFIN and four B-RIFN proteins) clustered in the opposite sub-group creating systematic errors in the algorithm. These sequences are all hybrids and were excluded from the function shift analysis.