Comprehensive analysis of tandem amino acid repeats from ten angiosperm genomes
© Zhou et al; licensee BioMed Central Ltd. 2011
Received: 2 August 2011
Accepted: 23 December 2011
Published: 23 December 2011
Skip to main content
© Zhou et al; licensee BioMed Central Ltd. 2011
Received: 2 August 2011
Accepted: 23 December 2011
Published: 23 December 2011
The presence of tandem amino acid repeats (AARs) is one of the signatures of eukaryotic proteins. AARs were thought to be frequently involved in bio-molecular interactions. Comprehensive studies that primarily focused on metazoan AARs have suggested that AARs are evolving rapidly and are highly variable among species. However, there is still controversy over causal factors of this inter-species variation. In this work, we attempted to investigate this topic mainly by comparing AARs in orthologous proteins from ten angiosperm genomes.
Angiosperm AAR content is positively correlated with the GC content of the protein coding sequence. However, based on observations from fungal AARs and insect AARs, we argue that the applicability of this kind of correlation is limited by AAR residue composition and species' life history traits. Angiosperm AARs also tend to be fast evolving and structurally disordered, supporting the results of comprehensive analyses of metazoans. The functions of conserved long AARs are summarized. Finally, we propose that the rapid mRNA decay rate, alternative splicing and tissue specificity are regulatory processes that are associated with angiosperm proteins harboring AARs.
Our investigation suggests that GC content is a predictor of AAR content in the protein coding sequence under certain conditions. Although angiosperm AARs lack conservation and 3D structure, a fraction of the proteins that contain AARs may be functionally important and are under extensive regulation in plant cells.
Tandem amino acid repeats (AARs), or homopeptides, are protein segments that comprise a continuous array of identical residues. As repetitive DNA is very abundant in eukaryotic genomes , AARs are frequently found in the proteomes of eukaryotes [2–4]. These simple peptides can be encoded by tandem repeats of the same codon, which are vulnerable to point mutations, or by a mixture of synonymous codons . These repetitive codon tracts are primarily introduced by either replication slippage  or recombination .
AARs are often situated in disordered regions of proteins that lack regular 3D structures . Nevertheless, over the past two decades, increasing attention has been paid on biological importance of AARs (see  for a recent review) which have long been regarded as junk sequences . AARs have been shown to be associated with several diseases. For example, the expansion of a glutamine repeat may induce Huntington's disease and other neuro-degenerative diseases . Beneficial effects of AARs have also been uncovered. An example is the glutamine repeat that appears in a key component of the biological clock in the fungus Neurospora crassa White Collar-1. This AAR was suggested to control circadian period length . Large-scale analyses indicate that AARs tend to participate in the regulation of transcription [8, 13, 14] and are frequently involved in protein-protein interactions .
AARs are highly polymorphic and fast-evolving sequences [9, 16]. In line with the accelerated rate of evolution for protein segments that are in or near AARs, selective constraints are thought to be relaxed around AARs [8, 17]. There are competing interpretations of the rapid evolution of AARs. Some believed that AARs evolve in a largely neutral fashion , partly as a consequence of the balance between replication slippage and point mutations . Based on shifts of the frequency distribution of coding tri-nucleotide repeats compared to that of non-coding tri-nucleotide repeats, Mularoni et al. proposed that selection plays an important role in AAR evolution . There is also evidence for positive selection on the AARs from case studies of a few mammalian genes [20, 21].
The frequency and size of AARs show inter- and intra-species variation both in large-scale comparisons  and in studies focused on vertebrates [8, 13, 19] or fruit flies . The causal factors underlying this variation are still a matter of dispute [13, 17], and some have attributed them to GC content bias [16, 18, 23]. In plants, repetitive DNA is widely used as a genetic marker, and its variation among transcripts has been observed . Nevertheless, in contrast to animal AARs, plant AARs have not been intensively investigated, except in a recent report based on two model plants, Arabidopsis (Arabidopsis thaliana) and rice (Oryza sativa) .
In this report, we revisit the questions surrounding AARs in plants in the light of current accumulation of plant whole genome sequences. A comparison of 1-to-1 orthologous proteins between ten sequenced angiosperm species revealed a positive correlation between AAR content and GC content, a finding that may be applicable to some other non-metazoan taxa. Other factors related to AAR content variation were also discussed. We attempted to summarize the functions of conserved long angiosperm AARs and their host genes in the context of the rapid evolution of AARs and their flanking regions. Our analysis also supports the idea that AARs are widely associated with protein structural disorder. Finally, we suggest that transcripts of repeat-containing proteins (RCPs) are under various levels of regulation in plant cells.
Summary of the ten angiosperm genomes included in this study
Genome size (Mbp)
Number of proteins
Malus × domestica
Similar to other eukaryotes, angiosperm proteins are enriched in AARs (0.84 AAR per protein on average). Because short AARs may be derived from the interruption of a long AAR, we used repeated residues per 1000 amino acids (RRPK, Repeated Residues per Kilo Amino Acids, defined as the ratio of the total AAR length to the protein length, multiplied by 1000) to represent the AAR content of a protein or protein segment. For example, the RRPK of peptide "QQQQQSTWQQQQAAE" is 9/15 × 1000 = 600. There is a nearly 3-fold variation in RRPK between these ten species, with values that range from 5.06 (grape) to 15.25 (rice). It is somehow striking that large genomes or proteomes do not necessarily have higher RRPK.
High GC content favors replication slippage and, thus, the generation of AARs, which has been proposed in a number of reports [13, 16, 38]. On the other hand, GC content has also been treated as an indicator of the local recombination rate . Angiosperm species with high recombination rates (i.e., exactly, higher average centiMorgan per megabase), such as Arabidopsis and rice , are relatively enriched for AARs (Figure 1A). We attempted to test the association of RCPs with recombination hotspots in Arabidopsis by exploiting publicly available extensive SNP data . A total of 293 putative hotspot neighboring genes (see Materials and Methods) were identified. At the whole proteome level, the fraction of hotspot neighboring genes in genes encoding RCPs is higher than that in genes not encoding RCPs in our dataset (1.3% vs. 0.93%, Fisher's exact test, p = 0.01), indicating that the influence of recombination on the AAR frequency, although limited, cannot be excluded.
There is controversy over the relationship between GC content and AAR content. A positive correlation between GC content and AAR content has been observed in some mammalian species , while in a wider spectrum of taxa, a negative correlation was proposed . Thus, we examined the relationship between GC content and AAR content within two additional taxonomic groups from distinct eukaryotic clades, Sordariomycetes fungi and Diptera insects (summarized in Table S1 and Table S2 in Additional File 1, respectively). These 1-to-1 ortholog groups contain 4, 047 and 3, 680 proteins from fungi and insects, respectively.
However, GC content and residue composition are not the only factors that influence the AAR content. For example, both maize and grape have relatively higher GC content (Figure 1A) and fewer glutamine repeats (< 4.5%), but their RRPK are the lowest among eudicots and monocots, respectively (Figure 1A). These two species share at least two life-history traits: (1) relatively "large body size" and (2) cross-pollination. We intentionally used quotes in this paragraph to emphasize that, owing to the high plasticity of plant development, caution should be used when linking body size to genomic signatures. Conversely, self-pollinating "small grasses", such as Arabidopsis and rice, have abundant AARs. Rice orthologs are so abundant in AARs that they appear as an outlier in the linear regression (Grubbs's test, p = 0.015; Figure 1A). A recent survey  showed that barley (Hordeum vulgare) has a higher fraction of RCPs than sugarcane (Saccharum officinarum). Interestingly, the former is a self-pollinating "small grass", whereas the latter is a cross-pollinating "large grass". In all, life-history traits are seeming cofactors of AAR content and deserve re-examination when more angiosperm genome sequences become available.
Taken together, the driving force shaping the presence and content of different types of AARs or low-complexity sequences appear to be complex, as was recently suggested for Plasmodium falciparum . The interplay between GC content, AAR residue composition and life-history remains complicated and needs further investigation.
Like their animal counterparts , many angiosperm AARs have not been conserved over a long period of evolution. This trend is indicated by the observation that approximately 75% of AARs fail to align to the corresponding region in any of the other orthologs (i.e., the corresponding regions in the multiple alignment of other orthologs are filled with gaps). A faster rate of evolution, as estimated by average dN/dS ratio of the AAR flanking regions in comparison to RCPs as a whole, was also observed (Mann-Whitney U test, p < 1 × 10-9 for all species; Table S3 in Additional File 2), supporting previous work that was conducted in other species [17, 43]. Although it has been suggested that purifying selection is relaxed in flanking regions of AARs, only about 3% of these flanking regions show signs of positive selection, i.e., a dN/dS greater than 1 (Table S4 in Additional File 2). Assuming that the fraction of regions under positive selection would be underestimated by the average dN/dS, we also calculated pairwise dN/dS for three pairs of species: (1) Arabidopsis and papaya, (2) rice and false brome and (3) maize and sorghum. The fractions were still limited (Welch's t-test, p > 0.05; Table S4), indicating that positive selection is not a ubiquitous evolutionary process in AAR flanking regions.
Putative functions for conserved long AARs in Arabidopsis
ABI3 is a key component of the ABA signal transduction pathway./See text for details of AAR function.
SEU, together with LUG (LEUNIG), controls the development of several organs./See text for details of AAR function.
LUG, see above./The AAR is thought to be involved in the assembly of transcriptional co-repressors.
Speculation only 
FH6 (FORMIN HOMOLOG 6) binds profilin and is involved in actin-nucleating activity./The AAR may directly contribute to its binding activity.
Speculation only 
PFT1 (PHYTOCHROME AND FLOWERING TIME 1) is a transcription factor that controls the flowering time./The AAR may be involved in transcriptional activation.
Speculation only 
Fraction of fully disordered AARs
Angiosperm proteins are enriched in AARs whose content is positively correlated with the GC content in the coding sequences. It has also been suggested that the correlation between AAR content and GC content is influenced by residue composition of AARs as well as life-history traits. Similarly to AARs in many sequenced eukaryotic species, angiosperm AARs evolve rapidly and tend to be disordered. Although AARs are usually not well conserved, we identified 18 conserved long AARs for further detailed analysis. As potentially promiscuous molecules, RCPs are under at least three putative transcript-level regulatory controls in plant cells, including faster transcript decay, alternative splicing and tissue specificity of gene expression.
Sequences of A. thaliana (Version 9) and O. sativa (Version 6.1) were downloaded from TAIR  and RGAP , respectively. All of the other plant sequences were downloaded from the Phytozome 6.0 database . The sources of the non-plant genome sequences are summarized in Tables S1 and S2 in Additional File 1. For genes with multiple protein products (gene models), only the representative one (if available) or the longest one was retained.
To search for 1-to-1 orthologs between species within certain taxonomic groups, InParanoid 4.1, one of algorithms with the lowest false-positive rates , was initially employed to identify pair-wise orthologs between the reference proteomes (A. thaliana for plant species, N. crassa for fungus species and Drosophila melanogaster for insects, excluding proteins encoded by the mitochondrial/chloroplast genomes) and the proteomes of the other species, with a score cutoff of 40. Sets of 1-to-1 orthologs found in all of the species within each group were obtained by collecting the intersection of the ortholog pairs.
We used in-house PERL scripts to collect data on the length, composition and position of AARs in protein sequences and to calculate the GC content. All statistical tests were implemented in R 2.12.1 .
We deduced the recombination hotspot at the Arabidopsis genome from SNP data described in . Informative SNP markers in a chromosome were selected by the TAGGER application in HaploView 4.1  with "-maxDistance 20 -aggressiveTagging -tagrsqcutoff 0.8" options, excluding SNPs identified as "N" in more than three out of 20 Arabidopsis accessions. Mainly due to the greedy marker selection approach of TAGGER along the whole chromosomes, the total number of informative SNP markers selected here is 60, 904. The hotspots were searched in 40-marker-long sliding windows by PHASE 2.1.1 , with "-MR1 1 -X10" options. These windows moved 20 markers per step. Windows that were longer than 100 kb were discarded. A hotspot was defined as a two-marker interval with a Bayes Factor that was higher than 10 in comparison with the background recombination rate . Positions of the hotspot were compared with the position of genes in the Version 8 genome to collect genes that overlap with recombination hotspots. We called these genes putative hotspot neighboring genes (gene IDs were transferred to Version 9 for obsolete loci). We do not use Version 9 genomes here because approximately half of the markers fail to map to this version of genome .
The alignment of orthologous coding sequences was guided by a multi-protein sequence alignment that was generated by MAFFT 6.849 . A flanking region was defined as 33 amino acids on both sides of an AAR; this region was truncated if the end of a protein was reached or if there was an adjacent AAR closer within 33 amino acids. PAML 4.3 yn00 tool  was used to calculate the dN/dS ratio with default parameters.
PONDR® VSL2B  and IUPred  were used for predictions of disorder, with default parameters. We did not use PONDR® VSL2 because of limitations in computational capability. We used default thresholds (0.5) to predict disordered residues for both predictors.
where xi is the expression value in the ith tissue, xmax is the highest expression value among all of the tissues and N is the total number of tissues. For loci with multiple probes, the average tissue specificity index was used.
To calculate the RRPK of the protein segments that were encoded by different types of exons, the protein sequences of RCPs were mapped to their exon sequences using our in-house PERL scripts. Only exons that encode proteins were retained for calculation. The mRNA half-life for each probe was obtained from . For loci with multiple probes, the average mRNA half-life was used.
(Tandem) Amino Acid Repeat
Repeat Containing Protein
Repeated Residues per One Kilo Amino Acids
Single Nucleotide Polymorphism.
The authors acknowledge the generosity of those who released genome data that were used in this study. We thank the anonymous referees whose constructive comments were helpful in improving the quality of this work. We are also grateful to Dr. Zhi-Ping Feng at Walter and Eliza Hall Institute of Medical Research (Australia), Dr. Deng-Ke Niu at Beijing Normal University, and Dr. Fei He and Xiao-Bao Dong at China Agricultural University for helpful discussions. InParanoid and IUPred were kindly provided by Dr. Gabriel Östlund at Stockholm University and Dr. Zsuzsanna Dosztányi at Institute of Enzymology, Hungarian Academy of Sciences, respectively. Jing Liu wishes to thank Dr. Stephen Ficklin at Washington State University for his help in downloading data. This research was supported by grants from the National Natural Science Foundation of China (31070259 and 30830058) and Innovation Fund for Graduate Student of China Agricultural University (KYCX2011026).
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.