Different patterns of gene structure divergence following gene duplication in Arabidopsis

Background Divergence in gene structure following gene duplication is not well understood. Gene duplication can occur via whole-genome duplication (WGD) and single-gene duplications including tandem, proximal and transposed duplications. Different modes of gene duplication may be associated with different types, levels, and patterns of structural divergence. Results In Arabidopsis thaliana, we denote levels of structural divergence between duplicated genes by differences in coding-region lengths and average exon lengths, and the number of insertions/deletions (indels) and maximum indel length in their protein sequence alignment. Among recent duplicates of different modes, transposed duplicates diverge most dramatically in gene structure. In transposed duplications, parental loci tend to have longer coding-regions and exons, and smaller numbers of indels and maximum indel lengths than transposed loci, reflecting biased structural changes in transposed duplications. Structural divergence increases with evolutionary time for WGDs, but not transposed duplications, possibly because of biased gene losses following transposed duplications. Structural divergence has heterogeneous relationships with nucleotide substitution rates, but is consistently positively correlated with gene expression divergence. The NBS-LRR gene family shows higher-than-average levels of structural divergence. Conclusions Our study suggests that structural divergence between duplicated genes is greatly affected by the mechanisms of gene duplication and may be not proportional to evolutionary time, and that certain gene families are under selection on rapid evolution of gene structure.


Background
Gene duplication is an important mechanism for evolution of functional novelty and increase of genome complexity [1]. Gene duplication may occur by different modes such as whole-genome duplication (WGD) [2] and single-gene duplications [3][4][5]. For example, Arabidopsis thaliana has experienced at least three WGD events-two recent events (α and β) since its divergence from other members of the Brassicales clade and a more ancient event (γ) shared with most if not all eudicots [6]. Single-gene duplications including local (tandem or proximal) and dispersed duplications also contribute to the origin of a substantial portion of Arabidopsis genes [5,7,8]. Transposed gene duplications, which relocate duplicated genes to new chromosomal positions via either DNA or RNA-based mechanisms [7,9], may contribute to the widespread existence of dispersed duplicates in the Arabidopsis genome [5,7].
Since a likely consequence of gene duplication is reversion to single copy (singleton) status [1], mechanisms for the retention of duplicated genes have been extensively studied. The 'neo-functionalization' model suggests that each of two duplicated genes can be retained if at least one evolves modified or novel functions [1]. The 'sub-functionalization' model suggests that both duplicated genes can be preserved if they partition the functions of their ancestor, through accumulation of degenerative mutations [10,11]. More recent models for gene retention include genetic buffering [12], functional redundancy [13][14][15], dosage balance constraints [5,16,17], or need for enhanced expression levels [18,19].
Retention of duplicated genes does not occur randomly. Following duplication, genes belonging to some functional categories have been preferentially restored to singleton status across different eukaryotic lineages [20]. In plants, modes of gene duplication retain genes in a biased manner [5]. Genes related to transcription factors, protein kinases, and ribosomal proteins are preferentially retained following WGDs [4,21], while those genes related to abiotic and biotic stress are more likely to be retained following local duplications [22,23]. Gene transpositions are more frequent in some families such as F-box, MADSbox, NBS-LRR, and defensins than others [5,8].
Functional divergence between duplicated genes was presumed to be driven by nucleotide substitutions including enhancer/promoter mutations, and nonsynonymous and synonymous substitutions [24][25][26][27]. However, insertions/deletions (indels) between duplicated genes, which may cause shifts of reading frame [32], have greater effects on the divergence in protein secondary structures [33][34][35]. In addition, duplicated genes also diverge in exon-intron structures following gene duplication, which was suggested to play an important role during the evolution of duplicated genes [36]. These facts, taken together, suggest that divergence in gene structures such as exon configuration and indels may also drive the functional divergence between duplicated genes.
In this paper, we study structural divergence between duplicated genes in Arabidopsis thaliana. We describe levels of structural divergence between duplicated genes using four different measures. Structural divergence is compared among different modes of gene duplication including WGD, and tandem, proximal and transposed duplications, and then related to duplication epochs, nucleotide substitutions and expression divergence. Evolutionary mechanisms for gene-structure divergence are also investigated.

Comparison of structural divergence among different modes of gene duplication
Modes of gene duplication in Arabidopsis were classified into WGD (α, β and γ events) and tandem, proximal and transposed (<16 Mya, i.e. after Arabidopsis-Brassica divergence, and 16-107 Mya, i.e. between Arabidopsis-Brassica and Arabidopsis-Populus divergence) duplications, as described in Methods. Divergence between duplicated genes often increases with duplication age [24,26,27]. To compare the evolutionary effects of different modes of gene duplication, it may be helpful to take duplication age into account. Here, synonymous (Ks) substitution rates are used as a rough proxy of duplication age. The Ks distributions of different modes of gene duplication are shown in Figure 1. The duplicated genes belonging to α WGD, tandem duplication, proximal duplication and transposed duplication after Arabidopsis-Brassica divergence (<16 Mya) are relatively younger than those belonging to β and γ WGDs and transposed duplication between Arabidopsis-Brassica and Arabidopsis-Populus divergence . Thus, to compare structural divergence among different modes of gene duplication, we restricted WGD duplicates to those retained from the α event, and transposed duplications to those that occurred after Arabidopsis-Brassica divergence (<16 Mya).
Structural divergence between duplicated genes was measured by differences in coding-region lengths and average exon lengths, and the number of indels and maximum indel length in their protein sequence alignment. Comparison of structural divergence among different modes of gene duplication is shown in Figure 2. When measured by differences in coding-region lengths and average exon lengths and the maximum indel length, structural divergence between duplicated genes shows the following trend: WGD < tandem < proximal < transposed (comparisons between consecutive gene duplication modes are significant at α = 0.05, Wilcoxon test). When measured by the number of indels, structural divergence between duplicated genes follows a slightly different trend: tandem < proximal < WGD < transposed (comparisons between consecutive gene duplication modes are significant at α = 0.05, Wilcoxon test). These comparisons, taken together, suggest that transposed duplications diverge more dramatically in gene structure than any other mode of gene duplication.

Transposed duplications are often associated with biased changes in gene structure
In transposed duplications, duplicated genes are transposed from ancestral (parental) loci to novel (transposed) loci [7]. Transposed duplications may occur via DNA or RNA-based mechanisms, and the latter mechanism, often referred to as retrotransposition, creates intronless retrocopies [9]. Comparison of gene structure between parental and transposed loci may help to better understand the genetic mechanisms and evolutionary effects of transposed duplications. We note that in this analysis we computed numbers of indels and maximum indel lengths for parental and transposed duplicates separately. We found that parental loci generally have longer coding-regions and exons, and fewer indels with smaller maximum indel lengths than transposed loci ( Figure 3), suggesting that transposed duplications tend to be associated with biased changes in gene structure. In other words, transposed duplication is a singular mode of gene duplication in which gene structure not only undergoes intensive changes but also is biased toward smaller gene size and complexity. A trend toward shorter exons, more indels and bigger maximum indel To minimize the effects of duplication age, WGD duplicates were restricted to those retained from the α event, and transposed duplications were restricted to those that occurred after Arabidopsis-Brassica divergence (<16 Mya).
lengths suggests that transposed duplications are not perfectly copied and losses of DNA segments frequently happen. This trend is contrary to the classical theory that duplicated genes are fully redundant immediately following gene duplication [1] but consistent with the observation that various types of transposable elements frequently only duplicate gene fragments [37,38].

Structural divergence and duplication epochs
To understand how structural divergence between duplicated genes changes over evolutionary time, we compared structural divergence among different epochs of gene duplications for WGDs (i.e. among α, β and γ events) and transposed duplications (i.e. between those occurring <16 Mya and 16-107 Mya). Figure 4 shows that the structural divergence between WGD duplicates, based on all measures, consistently increases across α, β and γ events; however, for transposed duplications, only number of indels increases from <16 Mya to  Mya. Moreover, transposed duplications show a decrease of maximum indel lengths from <16 Mya to  Mya. Compared with WGDs, transposed duplications have a higher rate of gene losses, evidenced by an "L" shaped distribution of duplication age [11]. It is possible that the different changing patterns of structural divergence over evolutionary time between WGDs and transposed duplications are determined by the biased, high rate of gene losses associated with transposed duplications, e.g. those duplicates that experienced extreme structural changes are less likely to survive over long periods of evolutionary time than those that experienced more moderate structural changes. It is also worth mentioning that transposed duplicates that have been preserved for long times (16-107 Mya) still shows higher structural divergence than WGD duplicates retained from the ancient γ event that occurred~117 Mya.

Structural divergence and nucleotide substitutions
For duplicated genes, structural divergence and nucleotide substitution are two major types of sequence divergence [36]. We compared non-synonymous substitution rates (Ka) among different epochs of gene duplication within WGDs and transposed duplications, and found the following trend: α WGD < β WGD < transposed (<16 Mya) < γ WGD < transposed (16-107 Mya) (comparisons between consecutive gene groups are significant at α = 0.05, Wilcoxon test). However, structural divergence of recent transposed duplications (<16 Mya) tend to be higher (except being measured by numbers of indels) than that of γ WGD (Figure 4), suggesting that gene structure can evolve much faster than nucleotide substitutions.
To further understand the relationships between structural divergence and nucleotide substitutions, we computed the Pearson's correlations between the four measures for structural divergence and nucleotide substation rates including Ka and Ks, based on all duplicated genes disregarding their modes (Table 1). Differences in coding-region lengths are significantly, positively correlated with Ka and Ka/Ks, indicating that the evolution of gene lengths is related to selection. Differences in average exon lengths are also positively, but   more moderately, correlated with Ka and Ka/Ks, indicating that the evolution of exon lengths is also related to selection. However, the number of indels is more likely to be related to Ks than Ka or Ka/Ks, indicating that indels occur more or less randomly between duplicated genes. The correlations between maximum indel lengths and nucleotide substitution rates are generally trivial, perhaps because duplicated genes losing long coding segments are preferentially lost following duplication. Structural divergence between duplicated genes were previously suggested to occur more or less randomly, i.e. correlated with evolutionary time [36]. However, we show that structural divergence between duplicated genes are related to both neutral evolution and selection, indicating that structural divergence between duplicated genes is a complicated process subject to both intrinsic and extrinsic factors.

Structural divergence and gene expression divergence
Expression divergence between duplicated genes is presumed to be determined by their genetic divergence such as regulatory sequence and coding sequence divergence. Indeed, expression divergence between duplicated genes was previously shown to be slightly correlated with Ka and/or Ks [24][25][26]. To date, it is unclear whether structural divergence between duplicated genes also affects their expression divergence. We computed the Pearson's correlations between the four measures for structural divergence and expression divergence based on the pooled modes of gene duplication ( Table 2). All four measures of structural divergence are positively correlated with expression divergence, indicating that structural divergence between duplicated genes is related to expression divergence. This analysis suggests that to study the genetic mechanisms for expression evolution between homologs, it is useful to look into changes in their gene structures.
The NBS-LRR gene family shows higher-than-average structural divergence The NBS-LRR genes have experienced frequent gene transposition in Arabidopsis [8]. As we have shown that transposed duplications tend to result in dramatic and biased changes in gene structure, we propose the hypothesis that the structural divergence between duplicated genes belonging to the NBS-LRR family is higher than the genome average. We computed the average structural divergence between duplicated genes belonging to the NBS-LRR family and compared it to that of the whole set of gene duplications using a t-test ( Table 3). The NBS-LRR gene family indeed shows higher-than-average structural divergence based on all four measures, suggesting that certain gene families may be under the selection for rapid evolution of gene structure.

Discussion
Ks increases approximately linearly with time only for relatively low levels of sequence divergence [39], meaning that there is great uncertainty in using Ks to represent evolutionary time. Thus, to ensure more accurate analyses, we did not use the correlation between structural divergence and Ks to investigate how structural divergence changes over time. Patterns of gene colinearity conservation within and between genomes can be used to estimate the epochs for WGDs and gene transpositions as previously described [6,40,41]. After assigning different epochs to gene duplication modes, we used their Ks distributions only for confirming the order of their relative ages. Classical population genetic theories suggest that duplicated genes have identical sequences immediately following duplication, and then gradually diverge over evolutionary time [1]. The observation that structural divergence between WGD duplicates increases with time is consistent with this classical theory. Due to the fact that most tandem/proximal duplicates are relatively younger than the most recent, Arabidopsis-specific α WGD (Figure 1), comparison between different epochs of tandem/proximal duplications are not feasible in this work. However, the observation that transposed duplications show dramatic and biased structural changes is inconsistent with the classical theorybut consistent with the observation that various types of transposable elements frequently only duplicate gene fragments [37,38].
The observation that there is a decrease of maximum indel lengths between the transposed duplications that occurred <16 Mya and 16-107 Mya suggests that structural divergence between duplicated genes may not be proportional to evolutionary time. More variations in maximum indel lengths in recently transposed genes could indicate that many transposed duplicates are essentially pseudogenes and not performing important functions [37], mixed in with the few that confer a striking, adaptive change that may render them finally preserved. However, it should be noted that the striking structural changes that are beneficial still require the intactness of key biological functions, and the transposed Maximum indel length 0.124 0 genes with extreme structural changes seldom survive over long evolutionary time. This study reveals that structural divergence between duplicated genes, measured in different ways, shows different patterns depending on modes of gene duplication, and can be affected by both neutral evolution and selection. Changes in gene structure between duplicated genes involve not only alteration of exon-intron structure [36,42] and gain/loss of introns [43], but also gain/ loss of DNA segments within coding-regions [37,38] which occurs more extensively in transposed duplications. Certainly there can be more measures to describe structural divergence between duplicated genes, and new biological insights can be generated based on novel measures for structural divergence. For duplicated genes, structural divergence seems more complicated than nucleotide substitutions. Future studies toward better understanding of the evolutionary mechanisms for gene structure changes are necessary.

Conclusions
In this work, we investigated structural divergence between Arabidopsis duplicated genes. We found that transposed duplicates diverge more dramatically in gene structure than genes duplicated by other modes, and that the structural changes in transposed duplications are biased toward shorter length and lower complexity. Structural divergence increases with evolutionary time for WGDs, but not transposed duplications, possibly because genes experiencing severe changes are preferentially lost. Structural divergence between duplicated genes is related to nucleotide substitution rates in different manners, but consistently positively correlated with expression divergence. The NBS-LRR gene family shows higher-than-average levels of structural divergence. This study suggests that structural divergence between duplicated genes, greatly affected by the mechanisms of gene duplication, may be not proportional to evolutionary time, and that certain gene families are under selection on rapid evolution of gene structure.

Genome annotations
Genome annotations for Arabidopsis thaliana, Brassica rapa, Populus trichocarpa and Vitis vinifera were obtained from Phytozome v8.0 (http://www.phytozome. net). For genes with multiple transcripts, only the longest transcript was used in related analyses.

Identification of gene duplication modes in Arabidopsis
Transposable element-related genes in Arabidopsis were excluded from analysis. Arabidopsis WGD duplicates were initially obtained from a previous study [6]. Then, α WGD duplicates were updated according to another study [44], to exclude tandemly-duplicated WGD duplicates which were shown to have very similar evolutionary patterns with tandem duplicates [45]. The WGD duplicate pairs included 3181 α, 1451 β and 521 γ pairs. Other modes of gene duplication were identified from the BLASTP result [46] of the Arabidopsis thaliana genome (E-value < 10 -10 & top five non-self hits for each gene). A total of 2130 tandem and 784 proximal duplications were obtained based on the following criteria: tandem duplications were BLASTP hits to consecutive genes in the genome; proximal duplications were BLASTP hits to nearby genes in the genome interrupted by fewer than ten non-paralogous genes.
To identify Arabidopsis transposed duplications, WGD duplicate pairs and tandem and proximal duplications were removed from the BLASTP result. In Arabidopsis, ancestral loci were the colinear genes between Arabidopsis and its outgroups (related genomes showing colinearity with Arabidopsis), and the noncolinear genes were deemed to be novel loci. Arabidopsis transposed duplications were the BLASTP hits consisting of an ancestral chromosomal locus and a novel locus. Note that based on different sets of outgroups, transposed duplications that occurred within different epochs can be inferred [40,41]

Indels between duplicated genes
The protein sequences of two duplicated genes were aligned using Clustalw [47] with default parameters. The Clustalw alignment was then transformed to a "fasta" format alignment, in which, gaps, i.e. consecutive "-", were deemed to be indels.

Coding sequence divergence
Coding sequence divergence was measured by nonsynonymous (Ka) and synonymous (Ks) substitution rates. The protein sequences of duplicate genes were aligned using Clustalw [47] with default parameters. Then, the protein sequence alignment was converted to a coding sequence alignment using the "Bio::Align::Utilities" module in the BioPerl package (http://www.bioperl.org/). Finally, Ka and Ks were calculated using the Yang & Nielsen method [48] via the "Bio::Tools::Run::Phylo::PAML:: Yn00" module in the BioPerl package.

Gene expression data
Gene expression data generated from the Affymetrix Arabidopsis ATH1 Genome Array (GPL198) were obtained from previous studies [26,49]. The expression divergence between duplicated genes was measured by 1-r, where r is the Pearson's correlation coefficient between their expression profiles [26].

Additional file
Additional file 1: Arabidopsis duplicated genes of different modes.