Mononucleotide repeats are asymmetrically distributed in fungal genes

Background Systematic analyses of sequence features have resulted in a better characterisation of the organisation of the genome. A previous study in prokaryotes on the distribution of sequence repeats, which are notoriously variable and can disrupt the reading frame in genes, showed that these motifs are skewed towards gene termini, specifically the 5' end of genes. For eukaryotes no such intragenic analysis has been performed, though this could indicate the pervasiveness of this distribution bias, thereby helping to expose the selective pressures causing it. Results In fungal gene repertoires we find a similar 5' bias of intragenic mononucleotide repeats, most notably for Candida spp., whereas e.g. Coccidioides spp. display no such bias. With increasing repeat length, ever larger discrepancies are observed in genome repertoire fractions containing such repeats, with up to an 80-fold difference in gene fractions at repeat lengths of 10 bp and longer. This species-specific difference in gene fractions containing large repeats could be attributed to variations in intragenic repeat tolerance. Furthermore, long transcripts experience an even more prominent bias towards the gene termini, with possibly a more adaptive role for repeat-containing short transcripts. Conclusion Mononucleotide repeats are intragenically biased in numerous fungal genomes, similar to earlier studies on prokaryotes, indicative of a similar selective pressure in gene organization.


Background
Genetic patterns could not be studied comprehensively until whole genome sequences became available. The first such genome-wide sequence analysis focused on both the abundance and distribution of competence-associated sequence motifs in the prokaryote Haemophilus influenzae [1]. Since then, as more genome sequences became available, numerous other genetic features were studied in prokaryotes [2][3][4] and eukaryotes [5,6]. Many such genome-wide analyses concentrated on the genomic distribution of simple sequence repeats (SSRs, also known as microsatellites), stretches of mono-and oligonucleotide repeats [7][8][9][10].
SSRs generally occur more frequently in non-coding regions of the genome. One of the reasons for this avoidance in the protein coding regions is that many SSRs predispose for disruptive frameshifts via strand-slippage during replication, transcription or translation [8,11]. Recently Ackermann and Chao postulated that selection for sequence stability in coding regions has been a pervasive force in the distribution biases of mononucleotide repeats (MNRs) in both prokaryotic and eukaryotic genome sequences [12]. More recently, we discovered that MNRs also display a biased distribution within coding sequences: in a wide range of both bacterial and archaeal genomes, mononucleotide repeats were predominantly biased towards the 5' end of genes [13], presumably to prevent or reduce the expression of toxic or costly frameshifted proteins. However, for eukaryotes no such intragenic distribution analysis has been carried out.
The Fungi represent a kingdom within the eukaryotic domain with many fully sequenced representatives. These can vary in genome size (less than 3 Mbp for Encephalitozoon cuniculi, up to an estimated 82 Mbp for Puccinia graminis), cellular organization (single cells or multicellular organisms) as well as life-style (saprophytic or pathogenic) [14,15]. With respect to SSRs, a previous analysis in sequenced fungal genomes described different patterns of their occurrence [9], but did not study their intragenic distribution.
Here we analyze genome-wide sets of predicted coding sequences from fully sequenced fungal genomes, and assess the intragenic distribution patterns of mononucleotide repeats. MNRs are more commonly present than the more complex repeats, which makes their interspecific distribution comparisons possible. This could illuminate the similarities and differences between the distributions of disruptive repeats in prokaryotes and eukaryotes, and help identify the selective forces that brought about these biases.

Methods
Transcript data of the coding regions (excluding introns) from 47 fungal genomes were obtained from the Broad Institute http://www.broad.mit.edu/annotation/. The datasets, their size and the description of the strains are given in Table 1. Truncated genes, transcripts with annotated internal stop codons and genes that are not multiples of three in length were excluded from the analyses, which basically rely on a genome-wide quantitative assessment of MNRs in the five quintiles of the annotated protein-coding sequences. Sequence motif analyses and codon usage profiles were carried out using in-house perl scripts, which are available upon request.
Fungal transcripts were compared to the KOG database [16] according to the nearest-neighbor method (using raw scores) [17] in order to distinguish between transcripts with a well-defined ortholog and transcripts that lacked an ortholog in the database.

Intragenic distribution biases of mononucleotide repeats in fungal coding sequences
For each genome we tested the intragenic distribution profiles of the longest MNRs in the five quintiles of all pre-dicted protein coding genes, with a minimal occurrence of 100 repeats of that length in the genomic transcript data (see Figure 1 for 6 representative cases, and Additional File 1 for all repeat distributions). The gene repertoires of most strains (25/47 genomes) show a non-proportional distribution of MNRs over the five quintiles (Chi-square, 4 degrees of freedom, p < 0.05, Additional File 1). The strongest bias is observed in Candida parapsilosis, where83% (104/125 repeats) of all repeats of 10 bp or longer are in the first quintile of the genes. In several cases, the majority of repeats are in the last quintile (80-100%) of the genomic gene set (e.g., Botrytis cinerea and Chaetomium globosum).

Genes with mononucleotide repeats of 15 bp or longer
The 294 fungal transcripts with repeats of 15 residues or longer from all genomes combined are given in Additional File 2. Again, when analyzing the intragenic distribution of the 298 repeats in these 294 fungal genes, a strong bias towards the first quintile of the genes is observed, both in transcripts with well-described functional orthologs (KOGs), but also in transcripts without orthologs (nKOGs) in the KOG database ( Figure 2). As for the extreme cases, Aspergillus fumigatus, Aspergillus niger, Aspergillus terreus, Candida lusitaniae, Histoplasma capsulatum, Paracoccidioides brasiliensis and Rhizopus oryzae each contain genes with mononucleotide repeats over 30 nucleotides long. The gene with the longest repeat is encountered in A. terreus (transcript ATET_00185), which contains a stretch of 68 consecutive adenine residues, encoding 22 consecutive lysines. Among the intragenic repeats of over 15 nucleotides in length, guanine and cytosine tracts are relatively rare with only 80 out of 298 repeats, similar as to what was found previously [18]. Still, some genomes harbour transcripts with repeats that consist solely of long guanine or cytosine tracts; Chaetomium globosum (6 genes), Coprinus cinereus (4 genes), Neurospora crassa (4 genes) and Sclerotinia sclerotiorum (5 genes). Other genomes have repeat-containing genes with only adenine or thymine tracts of over 15 residues in length: Candida albicans wo1 (11 genes), Candida lusitaniae (13 genes), C. tropicalis (41 genes), Coccidioides immitis rs (8 genes) and R. oryzae (11 genes).

Trinucleotide repeats in Neurospora crassa coding regions
Previous analyses have shown that fungal genomes also harbour oligonucleotide repeats, which on occasion have been associated with particular processes [19]. These repeats are mostly typical for the individual species, and though abundant, the total repeat-specific counts are still often less than 100 in the coding regions of a genome. However, in N. crassa, numerous trinucleotide repeats have been identified [20], of which only a few are encountered in large numbers in the coding regions: (GGT) n (117 repeats of n > 4), (TTG) n (144 repeats of n > 2) and (ACA) n (117 repeats of n > 7). Changes in the copy number of trinucleotides in a gene do not cause a shift in the reading frame, and therefore we hypothesized that these repeats need not be intragenically biased. Nevertheless, we observed that all three trinucleotide repeats occur more frequently at the gene termini of coding regions, with over half of all GGT repeats in the last quintile of the coding regions ( Figure 3). This suggests that the bias of repeats may not be caused solely by their risk for frameshifts.

Intragenic repeat resistance as indicated by genome repertoire analyses and consecutive homogenous codon usage profiles
The fractions of genes per genome that contain a mononucleotide repeat decreases rapidly with increasing repeat lengths in all tested genomes ( Figure 4). Nevertheless, a substantial difference may exist between the tolerances of the species to disruptive intragenic mononucleotide repeats. Short repeats (5 bp) are encountered in most (72-97%) of the protein-coding gene repertoires of all 47 tested species, and are never intragenically biased. However, with increasing repeat lengths, ever larger discrepancies arise between the percentages of the different genomic gene repertoires that contain repeats of such lengths (Figure 4, Additional File 3). C. tropicalis has an 80× larger gene fraction containing repeats of 10 residues or longer than Neosartorya fischeri (~3% and ~0.03%, respectively), although N. fischeri contains almost twice as many genes. This higher gene fraction with repeats suggests that C. tropicalis enjoys a much higher tolerance for disruptive intragenic repeats.
Comparing the expected versus observed frequencies of neighbouring lysine or phenylalanine codon, we find that N. fischeri avoids flanking homogeneous codons (i.e., AAA-AAA-AAA for lysine, and TTT-TTT-TTT for phenylalanine) to a much higher extent than C. tropicalis, corroborating the higher repeat tolerance of the latter species (data not shown).

Intrageneric comparisons of intragenic mononucleotide repeats
The dataset contains the gene repertoires from different species and strains from the same genus, which allows us to test for intrageneric heterogeneity of repeat distribution biases. Three genera are tested, Aspergillus, Candida and Coccidioides. Firstly, both Aspergillus and Candida species show a 5' bias for intragenic repeat distribution patterns, which are relatively mild in Aspergillus, and strong in Candida. The sole outlier in the Aspergilli is Aspergillus flavus, which contrary to its relatives shows no (strong) 5' bias. In Candida, there is a very strong bias of repeats to the 5' end, except for the species Debaryomyces hansenii, which shows a 3' bias. The two species with the lowest observed 5' bias are Candida guilliermondii and C. lusitaniae, which are closely related species, and both branch off with Debaromyces hansenii in one of the so-called CTGsubclades [21]. ides species (eight strains) show a distribution bias in mononucleotide repeats.

Authenticity of the intragenic repeat bias
Of the 294 fungal transcripts that contain repeats of 15 bp or longer, 36% (105/294) could be assigned to a eukaryotic cluster of orthologous genes (KOG, Additional File 2, [16]). In this set of 105 orthologs to bona fide protein-coding genes, we observe a significant distribution bias of the repeats towards the 5' end (61% of all repeats are present in the first quintile of these 105 genes, p < 0.05, Figure  5A). Moreover, we discern that the genes with repeats at the gene termini are on average much longer than genes with the repeat in the middle of the gene ( Figure 5B). A similar trend was observed for the fungal genes that could not be assigned to a KOG (termed nKOG), i.e., 34% of all repeats are in the first gene quintile, and transcripts with repeats at the gene termini are longer than transcripts with a repeat in the middle of the gene ( Figure 5A and 5B). Interestingly, a large proportion of the 105 genes with long repeats that also have homologs in the KOG database are found in the pathogenic species C. tropicalis and C. parapsilosis (48 genes, 47 (98%) of which have the long repeat in the first gene quintile).

Discussion and conclusion
Analyses of prokaryotic gene repertoires revealed a persistent bias of disruptive sequence repeats towards gene termini, potentially to curtail the metabolic costs or toxicity associated with transcribing and (or) translating nonfunctional genes [13]. In order to explore this phenomenon for the eukaryotic domain, we investigated the genome repertoires from 47 sequenced fungi, and discovered that this pattern is also evident in the fungal kingdom.
Still, large discrepancies exists between the intragenic distribution bias of MNRs of the different genomes: Coccidioides spp. show no intragenic preference for sequence repeats, whereas the Candida spp. demonstrate a very strong 5' end bias, with even up to 83% of all intragenic repeats in C. parapsilosis in the first gene quintile. Some genomes display a strong 3' end bias for intragenic repeats, such as Botrytis cinerea, which has 36% of its The distribution of very long repeats (>15 bp mononucleotide repeats, 298 repeats, 294 genes) in the quintiles of the predicted coding regions from all tested fungal genomes mononucleotide repeats in the last gene quintile. A bias of these disruptive repeats to either gene terminus agrees with a selection pressure to remove potential toxic side products, or alternatively to reduce metabolic costs, similar to what was found by Akashi and Gojobori in a comparison of 'expensive' and 'cheap' amino acid usage in highly expressed genes [22].
The species-specific gene fractions that contain long repeats could be a proxy for the species' tolerance for disruptive repeats. These differences are substantial, as the gene fraction of C. tropicalis with repeats of ten nucleotides or longer is 80 times higher than the gene fraction in N. fischeri. The adjacent homogeneous codon usage in these species also shows a higher avoidance of contiguous AAA and TTT codons in the latter species. A mechanistic explanation for a higher repeat-tolerance is still unknown, but could be studied by a more detailed functional characterisation of the genes that contain these long repeats.
When analysing the most abundant intragenic trinucleotide repeats in N. crassa, we find a strong bias of these Distribution of trinucleotide repeats GGT, TTG and ACA in the quintiles of the protein coding genes from Neurospora crassa Figure 3 Distribution of trinucleotide repeats GGT, TTG and ACA in the quintiles of the protein coding genes from Neurospora crassa. Represented are the deviations of the expected value (i.e., 20%).
repeats to the gene termini, even though differences in trinucleotide repeats do not cause a shift in the reading frame. This suggests that a selection pressure to remove potential toxic side products, or alternatively to reduce metabolic costs may not be the only explanation for the bias of repeats to gene termini. Previous analyses on amino acid repeats in Drosophila spp. also showed an avoidance of these repeats in the middle of genes [23]. Numerous trinucleotide repeat disorders are known to cause disease in humans [24,25], and studies into the intragenic repeat location biases could help determine mechanistic aspects of repeat expansions and contractions.
Sequence repeats have been thought to convey adaptive benefits due to their potential to facilitate rapid changes in the coding content [26], while one study also suggested that repeats are preferentially located towards recombination hot spots [27]. We find that larger genes have a more prominent bias of repeats at their gene termini. This could mean that many of the smaller transcripts with their mononucleotide repeat in the middle are misannotations, or gene remnants after the erosion of repeat-containing genes. Alternatively, these small repeat containing genes could represent foci of genetic novelty.
This focus on intragenic distribution biases is new [13,28], and could be expanded to other features previously not analysed as such, like methylation patterns, or targets for transcriptional or translational regulation. This can help resolve the origin and organization of new genetic features, as well as the evolutionary forces governing them.

Authors' contributions
MWJvP conceived the project, carried out the analyses, interpreted the results and wrote the paper. LHdG inter-Percentages of the gene repertoires that contain homopolymeric tracts in fungal species: only the three species with the high-est (C. albicans sc5314, C. tropicalis and S. cerevisiae) and three lowest (F. verticillioides, N. fischeri and P. tritici-repentis) genome fraction that contain homopolymeric tracts of said length (x-axis) are depicted, as well as the average of all 47 strains Genes with a functional KOG annotation (KOG) and those without such an annotation (nKOG) are depicted with blue circles and red crosses, respectively. B) The average lengths of the transcripts (and the standard error of the means) are depicted for genes with a KOG annotation (blue) and genes without a KOG annotation (red), with respect to the position of the repeat in that transcript.
preted the results and wrote the paper. Both authors read and approved the final manuscript.