Repeat-encoded poly-Q tracts show statistical commonalities across species
© Willadsen et al.; licensee BioMed Central Ltd. 2013
Received: 7 September 2012
Accepted: 18 January 2013
Published: 2 February 2013
Skip to main content
© Willadsen et al.; licensee BioMed Central Ltd. 2013
Received: 7 September 2012
Accepted: 18 January 2013
Published: 2 February 2013
Among repetitive genomic sequence, the class of tri-nucleotide repeats has received much attention due to their association with human diseases. Tri-nucleotide repeat diseases are caused by excessive sequence length variability; diseases such as Huntington’s disease and Fragile X syndrome are tied to an increase in the number of repeat units in a tract. Motivated by the recent discovery of a tri-nucleotide repeat associated genetic defect in Arabidopsis thaliana, this study takes a cross-species approach to investigating these repeat tracts, with the goal of using commonalities between species to identify potential disease-related properties.
We find that statistical enrichment in regulatory function associations for coding region repeats – previously observed in human – is consistent across multiple organisms. By distinguishing between homo-amino acid tracts that are encoded by tri-nucleotide repeats, and those encoded by varying codons, we show that amino acid repeats – not tri-nucleotide repeats – fully explain these regulatory associations. Using this same separation between repeat- and non-repeat-encoded homo-amino acid tracts, we show that poly-glutamine tracts are disproportionately encoded by tri-nucleotide repeats, and those tracts that are encoded by tri-nucleotide repeats are also significantly longer; these results are consistent across multiple species.
These findings establish similarities in tri-nucleotide repeats across species at the level of protein functionality and protein sequence. The tendency of tri-nucleotide repeats to encode longer poly-glutamine tracts indicates a link with the poly-glutamine repeat diseases. The cross-species nature of this tendency suggests that unknown repeat diseases are yet to be uncovered in other species. Future discoveries of new non-human repeat associated defects may provide the breadth of information needed to unravel the mechanisms that underpin this class of human disease.
Repetitive sequences are ubiquitous within eukaryotic genomes. While in some contexts these sequences are ignored, for example to avoid false positives when searching sequence databases , repetitive DNA tracts are not isolated to intergenic regions; repeat tracts also occur within genes and promoter regions, and length variability in some tracts has known phenotypic effects, including morphological variation in dogs  and strength of cell surface adhesion in yeast . Repeat tracts are unstable (i.e., have high mutation rates) in comparison with non-repetitive DNA , and the degree of instability varies widely between tracts.
Short tandem repeat tracts can be classified by length of the repeat unit; tracts where the repeated unit is up to six nucleotides long are referred to as microsatellites, with repeats consisting of longer units being referred to as minisatellites. A particular subset of microsatellites – those consisting of a repetitive three-base-pair unit, called tri-nucleotide repeats (TNRs) – have been the focus of much study due to their association with an important class of human diseases, commonly referred to as tri-nucleotide or triplet repeat disorders. Around thirty TNR diseases such as Huntington’s disease (a coding region repeat) and Friedreich’s ataxia (an intronic repeat) have been identified . Such diseases are caused by variation in the number of copies of the repeated sequence – most commonly expansion, though contraction diseases also exist. Many of these diseases affect the nervous system, and demonstrate genetic anticipation; that is, as the copy number of the repeat sequence diverges from the population norm, the age of onset decreases while symptoms increase in severity .
While the exact causes of excessive variability in a specific repeat tract remain an open question, several features are generally agreed to contribute to high variability of repetitive sequences: length (i.e., number of repeats), purity (i.e., number of interruptions to the repetitive pattern) and sequence (i.e., the nucleotide sequence being repeated) [7, 8]. However, these features are not sufficient to determine expansion; flanking sequences  and repeat orientation with respect to origin of replication initiation  have been shown to be factors affecting whether repeats will undergo expansion.
The prevalence of these repetitive tracts and their distinctive characteristics have made large-scale surveys an appealing avenue for identifying potentially useful features for explaining their variability [11, 12]. Such surveys have focussed on TNRs in the human genome, likely due to both the availability of data and the relevance to understanding repeat diseases.
Until recently, all characterised TNR diseases were human-specific, but the recent discovery of a TNR mediated genetic defect in Arabidopsis thaliana supports the idea that both the mechanisms and the underlying causes are cross-species phenomena. This discovery raises questions about whether there are cross-species commonalities between repetitive sequences – specifically TNRs – that may help us to understand what makes a specific TNR sequence prone to repeat number instability, and the diseases that can result. Identification of naturally-occurring TNR diseases in other organisms also expands the scope of possible study in those model organisms. (For a summary of model systems and their characteristics for TNR study, see supplementary information in .)
One identified characteristic of human TNR sequences is that genes containing these repeats – and more specifically repeats in coding regions – have been shown to be significantly associated with regulatory function through gene ontology (GO) term analyses . Given the increased instability of TNR tracts, do these sequences have specific properties that support or enable regulatory function? For example, similar regulatory function associations have been observed in proteins containing repetitive homo-amino acid (homo-AA) tracts [14, 15], a likely product of exonic TNRs. These observations raise the question of whether TNR sequences’ functional associations are properties of the repeat sequences themselves, or whether the observed associations can be explained by repetitive amino-acid tracts in the resulting proteins.
In this study, we investigate the functional associations of TNR sequences across different species to see whether cross-species analyses support the purported functional roles of repetitive sequences, and whether these functional roles can be explained by sequence properties common to multiple species. In particular, we ask whether the functionality of TNR sequences in multiple species can be explained by their associated proteins’ amino acid repeat tracts. Identifying the functional roles of existing TNR sequences is a crucial step in understanding repeat variability, and expanding such knowledge across multiple species provides valuable background knowledge in selecting model organisms for studying the mechanisms of repeat variability.
As a first step towards understanding species-specific characteristics of TNRs, we identify and analyse TNR tracts in six different species: Saccharomyces cerevisiae, Arabidopsis thaliana, Caenorhabditis elegans, Drosophila melanogaster, Mus musculus and Homo sapiens. Repeat tract scanning identified 247, 1947, 559, 3996, 79727 and 35736 TNR sequences in these species, respectively. As repeat length and repeat sequence are widely accepted factors in TNR variability, we compare these properties across organisms.
It is interesting to note the large differences in TNR frequencies among these genomes. Notably, the Drosophila melanogaster genome (∼165Mb) contains more than six times as many TNR sequences as the Caenorhabditis elegans genome (∼100Mb); there are over twice as many TNRs in the mouse genome than in the human genome. The latter is particularly striking since their genomes are similar in size and the large degree of homology between them.
These analyses suggest that simple uses of known correlates of TNR expansion are unlikely to produce informative cross-species patterns. While the differing distribution of triplet sequences may be curious, it does not provide any new insights into the structural characteristics or function of TNR sequences. As an alternative approach, we focus on higher-level characteristics such as the known functional associations and the homo-AA tract composition of TNR sequences.
Top-5 over-represented GO/Biological process terms in exonic repeat-associated genes by species
regulation of biological process
regulation of macromolecule metabolic process
regulation of cellular process
regulation of metabolic process
positive regulation of cellular process
RNA metabolic process
nucleic acid metabolic process
cellular macromolecule metabolic process
regulation of biosynthetic process
macromolecule metabolic process
Ras protein signal transduction
regulation of Ras protein signal transduction
regulation of cellular process
regulation of signal transduction
regulation of small GTPase mediated signal transduction
anatomical structure development
multicellular organismal development
nervous system development
multicellular organismal development
anatomical structure development
cellular developmental process
nervous system development
cell projection organization
cellular component morphogenesis
From these results we conclude that the previously-observed regulation association of coding region TNRs is not exclusively a human-specific trait, but can be seen as a cross-species phenomenon, even across a range of dissimilar organisms. Importantly, these results do not address the possibility that the functional associations are the result of a derivative sequence property, such as the homo-amino acid repeat tracts in corresponding proteins.
Previous GO over-representation analyses of TNR se- quences have used whole organism gene sets as a statistical background . However, when looking at functional associations of genes containing coding sequence-localised TNR tracts, it must be noted that the protein sequences associated with these genes will constitute an unusual subset of the proteome, and may give a very different statistical background. Specifically, translations of TNR tracts will result in protein sequences enriched in homo-AA tracts.
Through their association with TNRs, protein homo-AA repeat tracts have been implicated in a range of human diseases  and are more likely to be involved in transcriptional regulation , possibly due to the structural characteristics of the homo-AA tract. It has been suggested that these tracts are inherently structurally disordered [17–19], and that such unstructured regions may act as flexible regions, increasing binding affinity . The prevalence of transcription factors amongst homo-AA repeat-containing proteins raises the question of whether functional associations previously ascribed to coding sequence TNRs may be explained by the homo-AA tracts they encode.
Due to the redundancy of the genetic code, a homo-AA tract is not necessarily encoded by a TNR tract; instead, variant encodings may be used and in fact, they are expected to be less prone to mutation-driven length variability. As such, repetitive DNA sequence encoding important regulatory functional elements appears less than optimal, unless there is a associated functional difference.
In order to identify the degree to which homo-AA tracts in TNR-associated proteins explain the functional associations of these nucleotide repeats, we performed whole-proteome repeat scans of each organism’s non-redundant proteome and split the identified homo-AA tract containing proteins (hereafter simply referred to as homo-AA proteins) into two sets – those where the homo-AA tract was encoded by a repetitive DNA sequence, and those where variant encoding was in use – before identifying functional associations of these sets using GO terms.
GO term over-representation testing of the TNR-encoded homo-AA proteins was initially done using the variant-encoded homo-AA protein set as a background model; this test identifies whether the TNR-encoded set is significantly different from the variant-encoded set. We also used the whole set of homo-AA containing proteins as a background, identifying whether TNR-encoded proteins form an identifiably distinct subset of all homo-AA proteins in terms of functional associations.
Division of TNR- and variant-encoded homo-AA proteins
Over-represented GO terms in TNR encoded homo-AA tract containing proteins
adherens junction assembly
filtration diaphragm assembly
nephrocyte diaphragm assembly
Increased sequence instability is a distinguishing feature of TNR tracts as a whole; in TNR-encoded protein-coding repeat regions, such instability underlies the repeat diseases, but may also affect other aspects of protein function, such as the number and type of interactions the encoded protein is involved in. However, similar instability would not be expected in variant-encoded tracts. In order to identify whether sequence instability is a distinguishing factor of TNR-encoded homo-AA tract proteins, we investigated two characteristics related to sequence stability: protein-protein interaction (PPI) counts and estimated sequence conservation.
A protein’s number of PPIs and its evolutionary rate have been shown to be linked; it has been observed that proteins with higher PPIs evolve more slowly, likely due to sequence constraints involved in maintaining existing interactions , though other factors such as expression levels also contribute . As such, the higher variability commonly associated with TNR tracts suggests that homo-AA proteins should have lower PPI counts than their variant-encoded counterparts.
Using the same TNR- vs. variant-encoded distinction as above, and PPI data from the IntAct database, we identified the number of PPIs each homo-AA protein is involved in. Looking at the distribution of PPI counts in these proteins (see Additional file 2: Figure S2) we find that there is no significant difference between TNR- and variant-encoded homo-AA proteins in terms of the number of protein interactions associated with each set.
A different approach to investigating sequence stability is to directly assess the conservation of the homo-AA encoding sequence itself. We used pre-computed Drosophila and human PhastCons scores from UCSC (see Methods) to evaluate sequence-level conservation. Genomic loci corresponding to homo-AA encoding regions were obtained by reverse-mapping homo-AA tract boundaries onto exonic sequence. From these loci and PhastCons scores, we obtained conservation metrics for individual tracts. Segmenting these conservation scores as above, we found that conservation of homo-AA encoding DNA was not significantly affected by whether the sequence was classified as a TNR.
Note that this finding does not contradict previous findings that TNR sequences show higher variability. The comparison here is with variant-encoded homo-AA sequences, which constitute a very specific background model. In addition, PhastCons scores are not well-suited to identifying repeat length variation; as such, this method will not account for a major factor in the variability of TNR tracts.
As repeat unit and repeat length are central factors in determining TNR variability, considering these factors is also essential when investigating homo-AA proteins. Using the same TNR- and variant-encoded protein sets as above, we classified homo-AA proteins by the repeated residue and compared the frequency and length of residues between the sets. The hypothesis was that there would be no difference between the TNR- and variant-encoded sets in terms of amino acid make-up of repeat regions.
These analyses have demonstrated that there is evidence for a link between coding sequence TNRs and regulatory function, and that this link is not unique to humans, but can also be seen to different degrees in other species. However, we have also shown that these functional associations – previously characterised for human exonic sequences  – are entirely explainable in terms of the characteristics of the resulting proteins, and specifically the homo-amino acid tracts encoded by these sequences. Furthermore, few additional associations were found for either TNR- or variant-encoded homo-AA proteins, suggesting that the increased variability typically associated with tri-nucleotide repeat sequences appears to be neither a benefit nor a barrier in considering functional aspects of the resulting gene products.
While these findings do not contradict the suggestion that expanded exonic tandem repeat regions may be co-opted to fulfil a functional role as regulation-enhancing or -enabling structures, they do strongly suggest that there is nothing functionally unique about TNR sequences. Instead, we suggest that the strong GO term associations previously attributed to TNR tracts are indicative of opportunistic use of existing repeat sequences, a position supported by the cross-species nature of the associations observed above.
Relevant questions have been raised concerning why high-purity homo-AA tracts are so prevalent within structurally disordered proteins, given that repetitive tracts are not necessary for encoding disordered regions . One hypothesis is that high purity in amino acid repeats reflects evolutionary recency in underlying TNRs, driven by microsatellite proliferation and expansion processes . Our study indicates that there is no clear evidence for this hypothesis: no significant difference was found between the number of protein-protein interactions – used here as a proxy for evolutionary constraints – between TNR- and variant-encoded homo-AA proteins, and the nucleotide-level conservation of homo-AA-encoding exonic tracts was likewise unaffected by encoding distinctions. In addition, less than a third of homo-AA proteins could be directly linked to TNR encoding in any organism. These results suggest that evolutionary recency or other TNR-derived properties provide little explanation for the prevalence of pure repeats in structurally disordered proteins.
In contrast to the above results that discount observed or theorised TNR associations, our analysis of homo-AA tract composition shows striking differences between tracts that are TNR- and variant-encoded. In all organisms studied except Saccharomyces cerevisiae, glutamine repeats were more likely to be encoded by a TNR sequence than by a variant encoding; for all species, these repeats were also longer when repeat-encoded. This abundance of glutamine repeats among TNR-encoded homo-AA repeat tracts suggests that a correspondence may be drawn with the prevalence of poly-glutamine diseases among known human TNR diseases . Glutamine-encoding CAG ·CTG repeats have been the focus of much research due to their disease associations, and here we show that TNR-encoding of glutamine repeats is associated with longer repeat tracts, without taking into account any disease associations. In addition, this pattern is evident in multiple organisms with no currently characterised poly-glutamine diseases. In combination with the recent characterisation of a TNR-associated genetic defect in Arabidopsis, this finding supports the notion that poly-glutamine and other protein repeat diseases may be found in non-human contexts, which would provide a wider range of model organisms for studying the mechanisms and determinants of repeat disease.
By taking a cross-species approach linking homo-amino acid repeat tracts in proteins with tri-nucleotide repeats, this study has explained the regulatory function associations seen among TNR-containing genes. Analysing homo-AA tract-containing proteins, we identified cross-species commonalities in TNR-encoded protein repeat tracts; specifically, that TNR-encoded poly-glutamine repeats show several consistent cross-species statistical patterns. These results raise questions about the existence of undiscovered repeat mediated phenotypes in other species, and whether such repeats may share a broader cross-species statistical profile.
The human (hg19), mouse (mm10), Drosophila melanog- aster (dm3) and Caenorhabditis elegans (ce10) reference genomes and genomic feature locations were obtained from the UCSC Genome Browser  (RefSeq Genes track ). Annotations for Arabidopsis thaliana and Saccharomyces cerevisiae were from TAIR  (tair9) and SGD (S288c) , respectively. Mitochondrial, chloroplast and unassembled chromosome sets were excluded from further analysis. Multiple splice variants were not considered; in each case, all but one splice variant was discarded. For tair9, the first identified splice variant was retained in the absence of canonical splice information.
From each annotation, we retained the largest set of genes so that there is a unique mapping between the gene identifiers and Uniprot protein identifiers. Feature locations were used to classify regions as intronic, exonic (i.e., coding region), or UTR, upstream or intergenic; these mutually exclusive classifications were then used to annotate genomic repeat tracts.
Repeat tracts were identified using Tandem Repeats Finder 4.04 , with the following parameters: Match=2, Mis456, match=7, Delta=7, PM=80, PI=10 and Minscore=40. and a maximum repeat period of 3. Identified repeats were further filtered to remove all repeats with a period of one or two; period-one repeats have multiple periodicities, but were here considered to be mono-nucleotide repeats and were excluded from further consideration.
For comparison with other definitions of a repeat, the minimum length under this scoring is repeat units (i.e., 20 nucleotides) with no mismatches. In subsequent tests involving amino acid tracts (see below), we used a stricter criterion to enable precise reverse mapping from amino acids coding sequence.
For each genome, whole-genome and per-chromosome tri-nucleotide frequencies were determined using a custom script to obtain an order-two Markov background from the whole-genome and chromosome sequences respectively. In order to test whether TNR sequence frequencies were consistent with the (order-two Markov) background, a chi-squared test was used.
GO term associations with genes/gene products were obtained from the Gene Ontology project, as was the ontology itself (version from July 2012). Individual genes or gene products were annotated with each GO term appearing in the association set, and with the transitive closure of those terms. The transitive closure of the GO graph was constructed using only is_a and part_of relationships to avoid false positives (e.g., from has_part or regulates relationships). GO term over-representation was assessed using the Fisher exact test, with Bonferroni correction applied to adjust for multiple hypothesis testing. In preliminary studies, GO-term over-representation in non-coding regions was analysed. The associations discovered were weaker and semantically very similar to those of coding regions, which also have more common disease associations; as a result, non-coding regions were not included in further analyses.
The proteomes were downloaded from UniProt, using the UniProt/Swiss-Prot identifiers obtained by mapping gene identifiers from the annotations described above.
Protein-protein interaction data were sourced from the IntAct database .
Repeat tracts were identified by scanning all protein sequences for homo amino acid runs of length at least seven residues for correspondence with the repeat unit thresholds identified by Tandem Repeats Finder. We then examined the coding sequence for each homo-AA tract to determine if it is encoded by a TNR: a homo-AA tract is considered TNR-encoded if at least seven consecutive residues of the tract were encoded by the same codon. (A separate tool XSTREAM  is available to identify homo amino acid runs, but due to small discrepancies of what counts as a repeat by Tandem Repeats Finder and XSTREAM we were unable to utilise them to map back and forth between genomic and proteomic repeat locations.) The complete proteome sets were scanned and repeat sequences identified were used as a base set for further study.
As a measure of per-site genomic conservation, pre-computed PhastCons  scores were used. These were sourced from the UCSC genome browser tables for Drosophila (phastCons15way) and human (phastCons46way).
We did not complete this analysis for the other four organisms.
Analysis of tract composition was undertaken for TNR-encoded homo-AA tracts. The distribution of specific amino-acid repeats encoded by TNRs was assessed by a two-tailed binomial test for each amino acid; success counts were defined as the number of TNR-encoded repeats for that residue, with the probability of success defined as the proportion of TNR-encoded homo-AA proteins over the total set of homo-AA proteins.
Length of homo-AA repeats in TNR- and variant-encoded tracts was compared using the non-parametric Mann-Whitney U test. All homo-AA tract lengths were gathered, split into one set per residue, and annotated as being either TNR- or variant-encoded. A significant result indicates that homo-AA repeats tracts of a given amino-acid are longer (or shorter) when TNR-encoded.
This research was supported by National Health and Medical Research Council (NHMRC) Project grant ID 1004112 and an Australian Research Council (ARC) Future Fellowship (S.B.). This material is the responsibility of the authors, and does not reflect the views of the NHMRC or the ARC.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.