Using paired-end sequences to optimise parameters for alignment of sequence reads against related genomes
© Ratnakumar et al. 2010
Received: 24 November 2009
Accepted: 3 August 2010
Published: 3 August 2010
Skip to main content
© Ratnakumar et al. 2010
Received: 24 November 2009
Accepted: 3 August 2010
Published: 3 August 2010
The advent of cheap high through-put sequencing methods has facilitated low coverage skims of a large number of organisms. To maximise the utility of the sequences, assembly into contigs and then ordering of those contigs is required. Whilst sequences can be assembled into contigs de novo, using assembled genomes of closely related organisms as a framework can considerably aid the process. However, the preferred search programs and parameters that will optimise the sensitivity and specificity of the alignments between the sequence reads and the framework genome(s) are not necessarily obvious. Here we demonstrate a process that uses paired-end sequence reads to choose an optimal program and alignment parameters.
Unlike two single fragment reads, in paired-end sequence reads, such as BAC-end sequences, the two sequences in the pair have a known positional relationship in the original genome. This provides an additional level of confidence over match scores and e-values in the accuracy of the positional assignment of the reads in the comparative genome. Three commonly used sequence alignment programs: MegaBLAST, Blastz and PatternHunter were used to align a set of ovine BAC-end sequences against the equine genome assembly. A range of different search parameters, with a particular focus on contiguous and discontiguous seeds, were used for each program. The number of reads with a hit and the number of read pairs with hits for the two end sequences in the tail-to-tail paired-end configuration were plotted relative to the theoretical maximum expected curve. Of the programs tested, MegaBLAST with short contiguous seed lengths (word size 8-11) performed best in this particular task. In addition the data also provides estimates of the false positive and false negative rates, which can be used to determine the appropriate values of additional parameters, such as score cut-off, to balance sensitivity and specificity. To determine whether the approach also worked for the alignment of shorter reads, the first 240 bases of each BAC end sequence were also aligned to the equine genome. Again, contiguous MegaBLAST performed the best in optimising the sensitivity and specificity with which sheep BAC end reads map to the equine and bovine genomes.
Paired-end reads, such as BAC-end sequences, provide an efficient mechanism to optimise sequence alignment parameters, for example for comparative genome assemblies, by providing an objective standard to evaluate performance.
With the availability of the so-called Next Generation Sequencing (NGS), relatively cheap high-throughput short molecule sequencing technologies such as Illumina GA and ABI SOLiD, and medium length sequencing technologies such as Roche 454 is giving non-specialist laboratories the ability to sequence large genomes. However, the large number of reads produced by these NGS technologies creates problems for the utilisation of the sequence data. In the last few years a number of new programs for the alignment of short reads, for example in the range 30-150 bases, have been described, these include Maq , SOAP  and Bowtie . In general, these programs are designed for resequencing projects, where few nucleotide sequence differences are expected between the sequence reads and the reference genome.
However, many projects are likely to be low coverage skims of previously unsequenced genomes  possibly combining identification of SNPs with a survey of the genome sequence. The optimal design of SNP chips and effective utilisation of the chips in whole genome association analyses requires the relative order of and the distance between the SNPs and their association with genes to be known. Obtaining this information is likely to rely on comparative genomics by utilising the assemblies of related genomes to order and orientate sequence reads and contigs. The assembly of the cat genome based on Sanger sequencing used such a process to build an assembly from a 1.9 fold coverage of the genome . For the cat, a combination of MegaBLAST and Blastz was used to generate the genome assembly, which utilised alignments to a number of other genomes such as human, chimpanzee, mouse, rat, dog, and bovine .
In recent years a wide range of different programs have emerged to complement BLAST, itself a compromise between specificity and sensitivity relative to the Smith-Waterman algorithm. True Smith-Waterman is too slow for large scale projects, but in an effort to approach its speed, sensitivity and specificity, MegaBLAST  and PatternHunter [7, 8], amongst others, have been developed. A key to increasing the speed of the sequence alignment programs has been the utilisation of discontiguous seeds [7, 9], allowing the matches to be spread over longer sequences with internal mismatches and therefore the utilisation of longer seeds for the same sensitivity. This approach has been implemented in MegaBLAST, Blastz  and PatternHunter amongst other programs. Using discontiguous seeds improves the specificity and sensitivity of the programs. Further innovations have included using multiple discontiguous seeds and refining the patterns of the seeds [11, 12]. However, much of the analysis and comparisons of approaches have been carried out on mRNA/EST sequence sets [9, 12] and not on genomic DNA which, in the eukaryotes, has quite different distributions of repeats. The new sequence alignment programs that have been developed for the alignment of sequence reads against reference genome sequences for resequencing projects (see above) do not appear to be suitable for comparative genomics approaches. For aligning medium length genomic sequence reads (150-500 bases) against related genomes, it is not immediately clear which program and which parameters would yield the best compromise between sensitivity and specificity.
Here we use the example of the analysis of the effectiveness of three widely used DNA sequence alignment programs to position ovine BES reads against the equine and bovine genome assemblies to demonstrate the utility of the approach. We use the information about the positional relationship of the end sequences of each BAC in the ovine genome to estimate the sensitivity and specificity of the methods of determining the positions in the related, but not identical genomes.
Programs and parameters used
-r 1 -q -1 -X 40 -W 16
-r 1 -q -1 -X 40 -W 12
-r 1 -q -1 -X 40 -W 11
-r 1 -q -1 -X 40 -W 9
-r 1 -q -1 -X 40 -W 8
-t 21 -W 11 -q -3 -r 2 -G 5 -E 2 -N 2
-t 18 -W 11 -q -3 -r 2 -G 5 -E 2 -N 2
-t 16 -W 11 -q -3 -r 2 -G 5 -E 2 -N 2
K = 4500, L = 4500, M = 50
K = 2200, L = 2200, M = 50
K = 2500 L = 2500 M = 50 T = 0
-db 0 -mi -mj -b 2 -N 1
PatternHunter  (Table 1) performed poorly, with high sensitivity, but very low specificity. This is probably due to the inability of PatternHunter to utilise "soft masking" of the repeats. That is, PatternHunter does not have the ability to initiate matches in unmasked (upper case) sequence and then extend the matches into repetitive (lower case) sequence. No further testing of PatternHunter was undertaken. As expected running MegaBLAST without soft masking also leads to significantly reduced specificity (data not shown). Thus for positioning sequences from genomes with multicopy repeats (i.e. most large eukaryote genomes) "soft masking" and associated extension options are essential for maximum specificity.
Blastz  has been widely used for aligning genome sequences including whole genome alignments and allows "soft masking" of the sequences. Running Blastz with a contiguous seed of length 8 bases, see BZc1 (Table 1), had a lower sensitivity than PatternHunter, with a very similar percentage of the BACs in the total dataset positioned in the tail-to-tail configuration and therefore an increased specificity, i.e. fewer BACs with positions and a higher probability that the BESs position using Blastz are correctly positioned (Figure 1A). Using Blastz with a discontiguous seed of 12 matches/19 bases, whilst keeping the remaining parameters from the previous case unchanged, returned a very similar sensitivity and specificity to search BZc1 (data not shown). Relaxing the parameters, see BZd2 (Table 1), increased the sensitivity, but with a loss of specificity (Figure 1A). Increasing the stringency of the parameters with the discontiguous seed, see BZd1 (Table 1), improved specificity to the theoretical maximum, but with a substantial penalty to the sensitivity (Figure 1A). For a detailed description of the use of seeds in sequence alignment programs see Brown 2007 .
Effect of match reward and mismatch penalty on specificity and sensitivity of contiguous and discontiguous MegaBLAST searches
% total BESs
% BESs positioned
in tail-to-tail BACs
in tail-to-tail BACs
match reward, -r
mismatch penalty, -q
Overall, for this particular comparison, the best parameters were short contiguous word size settings for MegaBLAST. A small improvement in sensitivity over searches was observed when using a word size of 8 instead of 9 (Figure 1A). Although the increase in sensitivity versus a word size of 16 was small (from ~70% BESs positioned to ~80%) the increase in specificity was large (from ~20% in tail-to-tail pairs to ~35%).
In comparison, the virtual ovine genome approach , which aimed to maximise matches and by applying a filter of tail-to-tail BACs, to minimize incorrectly positioned BACs, positioned 45.5% of ovine BACs (and thus BESs) in tail-to-tail organisation on the human genome assembly. This result considerably exceeds the performance of the approaches described here, although its aim is position as many BACs as possible, not position unpaired reads, for which it is not suitable. However, this approach relied on the availability of tools for the conversion of coordinates between genome builds that may not be available for emerging genome sequences and therefore it may not be generally applicable.
Calculation of true and false positive and false negative rates from search results.
BESs with positions
BESs in tail-to-tail BACs
BESs predicted to be in tail-to-tail BACs2
To test the applicability of the approach described here with sequences of shorter length the ovine BESs were trimmed to the first 240 bases, whilst retaining the pairing information. A smaller number of searches of program and parameter values were undertaken for the alignment of trimmed ovine BESs against the equine genome assembly (Figure 1B). As with the full length sequences, contiguous MegaBLAST performed substantially better than discontiguous MegaBLAST. In contrast to the full length BESs, where a word size of 8 was slightly better than 11, the improvement in using a word size of 8 over a word size of 11 was far greater for medium length sequences, especially once the score cut off was applied.
Using an earlier version of the bovine genome assembly (Btau3.0) reduced the sensitivity and specificity scores for all search programs and parameter sets tested relative to Btau4.0, but did not alter the relative ranking of the programs and parameters (data not shown). Thus where the objective is to optimise search parameters, as long as the order of contigs and scaffolds in the comparison genome is approximately correct, this approach will be effective.
Paired-end sequences provide a very useful dataset for optimising program and parameters for positioning sequence reads (or contigs) with a range of different lengths from one genome against another genome. Parameter estimation can be undertaken with genomes in various stages of assembly, although a substantial number of scaffolds much longer than the average length of the inserts in the paired-end reads are required. Surprisingly, MegaBLAST with contiguous seeds performs better than discontiguous seed MegaBLAST and Blastz for alignment of ovine reads to the equine and bovine genomes. PatternHunter would perform much better if it were able to effectively utilise soft masking. A range of programs and parameter settings can be quickly surveyed with datasets appropriate to the particular objective. The optimal balance between yield and specificity of positioning chosen will depend on the subsequent use of the results.
The objective of the general methodology we present here is to compare a number of alignment programs to maximise the number of sequences generated as part of a sequencing project of species A, that are correctly positioned on the already assembled genome of a related species B. A set of paired-end reads must be obtained for species A as part of the genome sequencing project. Then the paired-end reads for species A can be used to optimise the choice of DNA sequence alignment program and parameters to align all unpaired end reads to the framework genome B, thus enabling rapid and accurate construction of sequence contigs for species A. The general approach is as follows; with the selected DNA sequence alignment programs and a range of parameters position the paired-end reads from species A on to the genome of species B. Then count the total number of sequences positioned by each program and set of parameters and calculate the number of sequences that retain the original positional relationship of paired end reads within species A within the genome of species B. Plot the results as in Figure 1 and determine the program and parameters that achieved the required balance between the total number of sequences positioned and the false positive and false negative rates. Position the remaining sequences using the chosen program and parameters.
The set of ovine BESs downloaded from GenBank were filtered as described previously  and "soft" masked at the most sensitive settings of RepeatMasker  using a repeats database built from the ovine genome sequence reads (unpublished data, personal communication Alan McCulloch). The ovine BESs were aligned to the equine genome assembly (build Equcab1.0) and bovine genome assemblies (builds Btau3.0 and Btau4.0) downloaded from the UCSC genome browser website  using the programs and parameters described in Table 1. BES matches to the genome assemblies were assigned to the tail-to-tail category of BACs as previously described , otherwise they were assigned as unpaired. For all analyses if multiple matches were returned the highest scoring match was taken as the genomic position of the BES. If more than one top match had the same score, i.e. a tie, all of the positions with the same score were checked to determine if a tail-to-tail BAC was generated with the other BES. If a tail-to-tail BAC was generated from one of the tied score positions the BES was counted as contributing to a tail-to-tail BACs, otherwise the BESs from both ends of the BAC were counted as unpaired.
BLAST version 2.2.16 was obtained from the NCBI website . Blastz version 2004-Dec-22 was downloaded from the Miller laboratory website . PatternHunter 2.0 was down-loaded from the Bioinformatics Solutions Inc. website .
X = number of BESs with positions
t = Total number of BESs
y = Expected number of total BESs that are correctly positioned in a pair
Built into this approach is the assumption that positioning one BES correctly is independent of positioning the other end of the BAC, or of positioning it correctly. Where this is not true the number of BACs with both ends correctly positioned may exceed the theoretical maximum curve by a small amount.
The percentage of BACs in tail-to-tail configuration is always expressed as a percentage of the total BACs in the dataset, not as a percentage of BACs with at least one BES positioned.
o = number of BESs actually in tail-to-tail BACs
e = number of BESs predicted to be in tail-to-tail BACs
Bacterial Artificial Chromosome
BAC end sequence.
The authors would like to thank the Alan McCulloch for access to the sheep genome repeats library, Nathan Watson-Haigh for comments on the draft manuscript and members of the International Sheep Genomics Consortium (ISGC), in particular John McEwan and James Kijas for useful discussions. The authors also gratefully acknowledge the early pre-publication access under the Fort Lauderdale conventions to the draft equine genome sequence provided by the Broad Institute and to the draft bovine genome sequence provided by the Baylor College of Medicine Human Genome Sequencing Center and the Bovine Genome Sequencing Project Consortium. This work was partly funded by SheepGenomics (a joint venture of Meat and Livestock Australia and Australian Wool Innovation). The work was undertaken as part of the development of sheep genomics tools by the ISGC. The authors would also like to that the anonymous reviewers whose comments contributed significantly to the effective presentation of this work.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.