Algorithm
1. Score calculation
The flowchart given in figure 1 describes the score calculation process that can be divided into three principal steps.
Step1 - From each coding sequence (CDS) of a complete assembled genome, the corresponding protein sequence was compared with its upstream and downstream chromosomal regions using a TBLASTN search (expectation value threshold = 1.0). The length of the surrounding DNA sequences was three times longer than that of the CDS considered (see the Methods section). If the calculated length was below the minimal length (Lmin) of 1500 bp, the length retained was equal to Lmin. The bit score was retained for each TBLASTN sequence alignment. These initial scores, four per CDS considering its two flanking sequences and both DNA strands, were called TB (Tblastn Bit) scores. When several high-scoring segment pairs (HSPs) were obtained between a given CDS protein sequence and one of its flanking region, the total TB score was calculated as the sum of bit scores of all HSPs that correspond to the same strand (plus or minus) and do not overlap by more than 20%.
Step2 - Four final scores, called FTB (Final Tblastn Bit) scores, were calculated for each CDS from the TB scores of both strands of its two surrounding sequences (Figure 2). Each TB score was first divided by a TB self-score to express it in percentage. This TB self-score was the TBLASTN bit score obtained when the protein sequence of the considered CDS was compared with its own genomic sequence. Its calculation also depended on the number and overlapping of HSPs generated by the BLAST program. The FTB scores were then obtained by subtracting two TB scores expressed as a percentage corresponding to both complementary strands of the same surrounding region.
Step3 - A filter was necessary to reduce the high background score generated by the small proteins or low sequence complexity proteins. If TB self-score < 250 and 0 < FTB score < 50, this filter was applied: filtered FTB score = FTB score - ((250 - TB self-score)/4).
2. TGA extraction
All CDSs of a genome were examined, in the ascending order of their chromosomal coordinates, to select the CDSs that have at least one FTB score equal to or greater than the threshold value of 10 (for the determination of the threshold see the Methods section). To decide if a selected CDS (position n) belongs to a TGA, its own FTB scores were compared with those of its two adjacent CDSs (positions n-1 and n+1) (Figure 3). The algorithm of score comparison determines the position of each CDS within the TGA. A TGA is created when a selected CDS occupies the "first position". As long as the CDSs located downstream from this "first CDS" occupy a "middle position", the TGA is extended. But as soon as a next CDS occupies the "last position", TGA elongation comes to an end. When no directly adjacent CDS fits exactly the criteria of a tandem arrayed gene, a "relic" tag was first attributed to the CDS at position n. Then, the n-2 and n+2 CDSs were analyzed, permitting the presence of one intervening spacer gene within a TGA. The direct or opposite relative orientation of gene pairs in TGAs was determined by comparing their corresponding FTB scores (Figure 3).
3. Minisatellite detection
Richard and Dujon [11] found, in the S. cerevisiae genome, 49 protein-coding genes containing consecutive sequence repetitions, called minisatellites, whose repeat unit size ranges from 10 bp to 192 bp. These internal tandem repeats can cause problems during TGA extraction, since some of them persist outside the coding sequence and can contribute to a FTB score ≥ 10. Moreover, in some cases, the chromosomal distance between two non-homologous genes containing minisatellites is so small that they can be related to the same TGA. Therefore, all CDSs involved in a TGA were submitted to the equicktandem (maxrepeat = 600, threshold = 20) and etandem (minrepeat = 10, maxrepeat = 300) programs of the EMBOSS package [12] to find tandem repeats in their DNA sequences. Equicktandem rapidly identifies potential repeat sizes whereas etandem searches genuine tandem repeats, calculates a consensus for the repeat block and takes gaps into account.
The research of minisatellites and the "relic" tagging in the step of TGA extraction are useful when performing the manual curation of TGAs (described in the Methods section). This enables the researcher to eliminate false positive TGAs due to short tandem repeats in gene sequences and to identify the TGAs containing gene relics, respectively.
Implementation
Score calculation, TGA extraction and minisatellite detection are the three automated steps of our algorithm and have been implemented in Python version 2.4 http://www.python.org/. The output file of the first step is a table of the FTB scores calculated for each CDS of a given genome. From this table, another script extracts CDSs organized in tandem arrays and then identifies among these CDSs those containing minisatellites. Data of this second script can be manually analyzed to eliminate the false positive TGAs (step of manual curation). Scripts are freely available from the corresponding author who will make recommendations to users and inform them on improvements added to the computational program.
Application: identification of TGAs in hemiascomycete yeast genomes
Databases
We applied our method to nine completely sequenced genomes of hemiascomycete yeasts. The Saccharomyces cerevisiae chromosome sequences and information on annotated genes were downloaded from Saccharomyces Genome Database (SGD) http://www.yeastgenome.org/ [13]. The Eremothecium (Ashbya) gossypii genome sequence and its corresponding annotations were taken from Ashbya Genome Database (AGD) http://agd.vital-it.ch/Ashbya_gossypii/index.html/ [14]. Génolevures website http://cbi.labri.fr/Genolevures/ [15] provided all the data relative to complete genome sequences of the species Candida glabrata, Debaryomyces hansenii, Kluyveromyces lactis, Kluyveromyces thermotolerans, Saccharomyces kluyveri (genome sequenced in collaboration with Mark Johnston, Washington University Department of Genetics), Yarrowia lipolytica and Zygosaccharomyces rouxii.
Computation of all TGAs in the nine yeast genomes
We define a TGA as a structure of contiguous paralogous gene copies, which are either functional or degenerated copies (gene relics), allowing for the presence of one heterologous gene inserted between the two homologous copies. In accordance with this definition, we developed an in silico method to detect such TGAs in complete eukaryotic genomes. This method was applied to nine hemiascomycete yeast genomes. 469 TGAs comprising 999 CDSs were identified in this set of genomic data, which represents 2% of CDSs involved in TGAs in relation to the total number of CDSs present in these nine genomes (Figure 4 and Additional file 1). This proportion is independent of the phylogenetic position of species and is not higher in the post-WGD (whole genome duplication) species S. cerevisiae and C. glabrata. Interestingly, the D. hansenii genome shows a percentage of CDSs in TGAs twice higher than the other species. Another remarkable result is the percentage of TGAs that include a duplicated gene relic (11%).
On an average 6% of yeast genes contain intron sequences, generally only one intron per gene. Among all CDSs that belong to TGAs, 2.5% (25 CDSs in 16 TGAs) have an intron. In half of the 16 TGAs concerned, an intron was not present in all the duplicated copies. The physical distribution of TGAs is relatively homogeneous along the different chromosomes of each species (data not shown).
A first assessment of functional bias for genes in tandem arrays was based on Gene Ontology (GO) term annotation information [16]. Functions such as cellular homeostasis, cell wall organization and biogenesis, conjugation and response to stress, are overrepresented in TGAs. The proportion of genes involved in TGAs is very low in the following functional classes: transcription, translation, RNA metabolic process and ribosome biogenesis and assembly (data not shown).
Distributions of TGA sizes and CDS orientations in TGAs
TGAs were classified according to the number of CDSs they contain. Among the nine yeast genomes, the size of TGAs ranged from one CDS to 16 CDSs (Figure 5). TGAs belonging to the 1CDS-relic class consist of one CDS and one gene relic and represents ~10% of the total number of TGAs, a proportion equivalent to that of the 3CDSs-TGA class. The majority of TGAs (76.5%) comprise two CDS copies and the percentage in four copies falls to 1.5. Large TGAs of six and eight CDSs (one and two arrays, respectively) are concentrated in C. glabrata, but the largest one containing 16 CDSs is in D. hansenii. The latter (TGA n°234), located on the chromosome E of D. hansenii, has been manually reconstructed since it was disrupted 3 times by heterologous genes. In the sequenced strain of S. cerevisiae, S288C, the TGA size does not exceed five CDSs and the 1CDS-relic class of TGAs is absent. Nevertheless, three TGAs made up of more than one CDS plus one relic (nCDSs-relic class) are present in this reference strain.
CDSs in TGAs are principally in a direct orientation (78.9%) with approximately the same proportion of sense and antisense orientations (Figure 5). In C. glabrata TGAs, the antisense orientation is represented 2.4 times more than the sense orientation, although the number of CDSs on the minus strand (2649 CDSs) and on the plus strand (2551 CDSs) is identical. TGAs consisting of two CDSs oppositely oriented represent 19.0% of total TGAs. The low frequency of mixed TGAs (2.1%) is in correlation with the low percentage of TGAs containing at least 3 CDSs (13.9%) and of which at least two copies have a convergent or divergent orientation, this being the less frequent orientation of a CDS pair (expected frequency = 0.190 × 0.139 = 0.026).
Data curation
The originality of our method of TGAs detection lies in the possibility to identify TGAs containing degenerated gene copies. From the nine yeast genomes analyzed, two different sets of CDSs, possibly close to a gene relic, appeared during the TGA extraction step and received a relic tag: (i) 360 isolated CDSs with at least one FTB score ≥ 10 and (ii) 23 CDSs, located on one extremity of a TGA comprising at least two CDS, with two FTB scores ≥ 10 (see Additional file 2). 15% (53 CDSs) and 22% (5 CDSs) of these two respective sets of tagged CDSs were first retrieved by an automated curation step to create new TGAs. After this procedure, a manual data curation produced supplementary TGAs or some TGAs recomposed from 24% (85 CDSs) and 48% (11 CDSs) of the two respective sets of tagged CDSs. No false positive CDS was found among the 23 tagged CDSs included in a TGA, whereas approximately half the isolated CDSs (177 false positives) were eliminated. Finally, 52 CDSs were located near a homologous gene relic: 45 constitute the class of TGAs that contain just one CDS (1CDS-relic) and 7 belong to larger TGAs (nCDSs-relic).
A possible source of false positive TGAs is the presence of minisatellites in the coding sequences. Among the 999 CDSs implicated in a TGA, 199 (20%) have internal tandem repeats (see Additional file 2). The manual examination of these suspicious CDSs revealed that 6.6% of the total number of TGAs before curation (33/502) were false positives. Nevertheless, it was difficult in some cases to decide whether the similarity between contiguous CDSs is partially or totally due to minisatellites. This problem occurred when long stretches of tandem repeats covered a large part of the CDSs. Therefore, the percentage of false positives is probably slightly underestimated.
Distributions of FTB scores
The degree of gene conservation in a TGA can be estimated by comparing BLAST bit scores between two genes. FTB scores were calculated from the result of TBLASTN sequence alignments. Two FTB scores ≥ 10 (S2 and S3) were obtained per gene pairs in a TGA. Two additional FTB scores (S1 and S4) characterize both chromosomal regions surrounding each TGA. The TGA flanking regions can contain CDSs, but these coding sequences have no homology with members of the TGA (Figure 6A). For each pair of tandem repeated CDSs (TGP for tandem gene pair), except those manually recomposed, the FTB score S2 was compared with the FTB score S3 (Figure 6B). These two scores are mostly equal or equivalent. In 6.2% of cases, the difference between both scores is > 20 and is due principally (60% of cases) to the size difference between both genes of a TGP. The comparison between the FTB scores S1 and S4 shows that in 93% of cases (323 TGAs) at least one of these two scores is equal to 0 (Figure 6C). Few TGAs (6 in total) have one score, S1 or S4, greater than 10. They correspond to TGAs that start or end with a homologous gene relic. Therefore, they belong to the nCDSs-relic class of TGAs. One of the seven TGAs in this class is not represented here because it was corrected manually.
Evolution of TGAs in comparison with the one of the other duplicated genes
Distribution of amino acid sequence identities between pairs of homologous proteins was computed to evaluate the divergence between paralogous (tandem arrayed and dispersed) and orthologous gene copies (Figure 7). Orthologous proteins show monomodal distributions and average identities comprised between 48-58%. They represent the most sequence-conserved homologous proteins. The distribution of tandem paralogs is intermediate between the distribution of other paralogs and the distribution of orthologs. Thus, paralogs in TGAs are less divergent than dispersed paralogs. Although the global distributions of these two sets of paralogs are not bimodal, the protein identities do not follow a normal distribution since the paralogs exhibiting high sequence conservation are found more frequently than their poorly conserved counterparts. This feature is amplified in the case of tandem duplicated genes.