Program: VNTRfinder
VNTRfinder uses the TRF program [6] to detect repeats. Both variant and invariant repeats are reported, the former defined as a tandem repeat from a reference sequence that is found in a target sequence by VNTRfinder and observed to be of different length. The e-PCR [7] program is used to match known repeat regions in a query sequence to the apparently matching region in the target. The program permits variation in the length of the chosen flanks, and the number of permitted mismatches. The e-PCR wordsize parameter is set to 0 so that a maximum number of candidate matches are considered. The user can specify TRF repeat detection parameters and paste or upload two files with sequences of interest. The first file contains the reference sequence(s), the second the target sequence(s) – the sequences across which to look for repeat variation, of which an unlimited number can be entered. TRF is run on the reference(s) and any redundancy from overlapping repeats is eliminated according to the method described by Denoeud and colleagues [8]. For VNTRfinder, flanking regions of each tandem repeat identified in the reference sequence(s) are then aligned to the related regions of target sequence(s). For cases where there is one reference and one target sequence, there is the option to automatically run the search in the reverse direction.
For computational efficiency, the method implemented requires that there are no insertions or deletions in the TR flanks. This has the drawback that for more distantly related homologous repeats, the program will miss them. This feature of our approach will be disadvantageous in certain contexts, i.e. when the flanks of tandem repeats are strongly diverged, and contain indels. However, for many practical applications this is not a major issue: where it is a problem, the alternative BLAST based approach of Denoeud and Vergnaud is more optimal.
A mismatch parameter defines the number of permitted mismatches in flanking sequences. In locating the best matching sequence, a search is conducted by starting with zero mismatches, which is increased, if necessary, until a single, unambiguous match between reference and target sequences is obtained, or until the value for the maximum mismatches permitted is reached. If two regions of an identical degree of base similarity are detected, then the results are excluded. The motivation for taking this approach, rather than systematically extending the repeat flanks until an apparently unique orthologous region is defined, was as follows; a search of 215,000 repeats from the human genome against itself with a flanklength of 20 revealed approximately 2% that had more than one match. A simple expectation would be ~1 × 10-7. Since the great excess of such matches are likely to represent recently duplicated regions of the genome, simply extending the flanks to find the better of the two matches may be prone to erroneous matching of paralogous regions. By avoiding such extension the user is less likely to be misled by such mismatches. In our experience with the smaller bacterial datasets, systematic extension of flanks to identify a match had little effect on the ability to match many additional TRs.
The speed of VNTRfinder decreases with the number of sequences being searched against; as each repeat in the references(s) is searched against each target sequence, the more targets, the longer the search will take because each reference repeat must be searched against each target. For instance, a search of tandem repeats across the genome of Mycobacterium tuberculosis H37Rv against the genome of Mycobacterium tuberculosis CDC1551 took approximately 32 minutes for a dataset of 690 repeats detected using Tandem Repeats Finder default parameters. When Mycobacterium bovis was added as an additional target sequence, the search time increased to 62 minutes and when Mycobacterium leprae was added as another target, the search time increased to 127 minutes computational time. Also, increases are proportional to the degree of evolutionary divergence of the tandem repeat flanks
It is possible to scan the identified related region of the reference sequence between the flanks for repeats using TRF to enrich for length variations that represent a genuine tandem repeat array length difference rather than alternative splices, genome rearrangements etc. This is done by specifying to keep results where the hit "represents length difference consistent with change in the repeat copy-number" is selected. In this instance, the repeat unit length and motif will be the same in both reference and target (if this option is not selected, these may differ). Specifically, the sequence of the hit is scanned with TRF and tandem repeats are checked to ensure that the observed copy-number agrees with the expected one, given the length of the hit tandem array and the length of the repeat unit as follows:
For cases where the length of the hit tandem array is greater than that of the reference, we calculate I, a measure of the inconsistency of the variant with any multiple of the repeat copy number. This is calculated as
when the length of the hit is shorter than that of the reference, this is represented as
where ExpV is the expected copy-number for the variant (given the length of the variant and the length of the repeat unit), ObsV is the observed copy-number for the variant, as estimated by TRF, and ObsR is the observed copy-number for the reference repeat. This gives a distribution close to 0 for variations consistent with a change in repeat copy-number and close to 1 for other variants. Only variants with a value of I below 0.5 are retained. The equations serve to describe how closely a length variant matches the expected lengths seen from changes that represent precise changes in copy-number. The cut-off chosen is arbitrary, however, the vast majority of datapoints lie quite close to 0 or 1, and therefore alternative choices of the chosen cut-off are unlikely to alter the results obtained. It is important to note that this option represents a stringent search for repeat variation. The option to search for other types of variations is thus also provided, which identifies significantly more matches.
Descriptive statistics are gathered to describe the reference repeat and the extent to which the repeat varies across the reference and target sequences. These are the identifier of the reference sequence containing the repeat(s), the repeat unit length and tandem array length, the repeat unit sequence and tandem array sequence, the start and stop coordinates of the repeat in the reference sequence, the repeat copy-number, the left and right flanking sequence of the repeat, the total tandem array length of the reference repeat and all the hits, represented as a "population" of tandem array lengths, e.g. 26|28|26, and the number of mismatches tolerated in aligning the flanks to the target sequence.
Summary statistics that describe variability detected are also provided. These include a simple binary metric describing whether or not a repeat was observed to be variable and also a heterozygosity score (equivalent to "gene diversity" [7]) that describes the variability of the repeat in the population of sequences analyzed. The standard deviation of the heterozygosity is also given, which is clearly excessively large in the case of a variant detected between only two sequences, but becomes a useful statistic to establish the reliability of the heterozygosity in the situation where the user has as input a large number of sequences, which have a large number of allelic variants. The standard deviation and standard error of all observed tandem repeat array length alleles is also provided, which provides an indication of the spread of allele sizes. In many cases, the heterozygosity is not particularly informative unless derived from a large sample size and therefore close attention should be paid to the standard error of the repeat array length alleles. The lengths of all unique tandem array lengths and their frequency of occurrence are also provided. Examples of the output files can be viewed on our website.
Program: PolyPredictR
PolyPredictR predicts potentially polymorphic tandem repeats using simple rules previously described [4, 5]. The first set of rules, described by Wren and colleagues [4] use information on the length of the tandem repeat, its homogeneity and the copy-number of the repeat unit to predict polymorphism. Specifically, if a repeat is 100% homogenous (all units are identical), the copy-numbers 12, 6.5, 5.5, 4.5, 3.5 and 2.5 represent thresholds beyond which the repeats of unit length 1, 2, 3, 4, 5–9 and >=10 are predicted to be potentially polymorphic, respectively. If all units are not identical, these rules may not be applied. An earlier implementation of these rules does tolerate imperfect repeats (>90% homogenous) [9] but has higher false positive rates. These rules are also implemented by PolyPredictR. The second set of rules, described by Naslund and colleagues [5] uses a larger number of criteria as a predictive model. These include copy-number, entropy (summarises the percentage of different nucleotides in the repeat), GC dinucleotide bias and percentage match between repeats in the tandem array. These very simple rules provide a crude indicator, but the authors' validations of their efficiency have been relatively limited. Therefore, the utility of these predictors are not well understood and should be treated with some caution. Other attempts to understand the relationship between polymorphism and repeat sequence characteristics have had limited success, highlighting the difficulty of obtaining a set of rules applicable and relevant to all genomes [1, 8, 10]. Notably, it has been shown that sensitivity of various measures, such as the total length of the tandem repeat, the percentage matches between adjacent copies of the repeat unit, and the GC content of a tandem repeat, while significant predictors of polymorphism, can vary in their predictive power between different species [1, 10].