High-throughput sequencing (HTS) technology has recently shown a rapid and impressive development and this has led to the production of gigabases of sequence in a few hours for only a fraction of the former cost . HTS has produced an explosion of knowledge in genetics and genomics thanks to the development of specific applications such as genome re-sequencing (whole genome sequencing and targeted sequencing). This technological evolution was paralleled by the development of new algorithms to deal with the quantity and the quality of reads produced. A fundamental analysis steps in re-sequencing approaches is the mapping of the reads onto a reference genome. This step, which involves the accurate positioning of reads onto a reference genome sequence, is highly important because it determines the global quality of downstream analyses. The algorithms used for this step are called mappers. Mappers have to be sensitive and accurate and, if possible, fast and not too computationally demanding. They should be able to find the true position of each read on a reference genome and ideally distinguish between technical sequencing errors and natural genetic variations.
In recent years many mappers have been developed and distributed (more than 60 mappers are listed in ). Two studies [2, 3] have classified mappers using a wide variety of features that include: the type of data, their application, the sequencing platform, the read length, the allowed error rate, parallel implementation, the ability to deal with multi-mapped reads (i.e. reads aligned to multiple locations), the input and output formats, and the available parameters. Mappers have multiplied and so has the range of possible settings. Hence, the growing difficulty in selecting a mapper has been raised in recent studies aimed at evaluating mapper performances through a multiplicity of comparison criteria. Some of these studies have focused on mapper sensitivity (ability to correctly map reads) [4–6]. Schbath et al. studied the ability of mappers to identify unique versus multi-mapped reads using a well-controlled benchmark containing reads with exactly three mismatches . Hatem et al. introduced a benchmarking suite to analyze mapping tools , which consists of tests that cover input properties and algorithmic features.
In addition to the difficulty in determining evaluation criteria, choosing an appropriate evaluation method, i.e. how to compare mappers according to the evaluation criteria, and using the appropriate metrics, are also problematical. Using real datasets to evaluate mapper performances allows only a rough assessment and classification of mappers by comparing the percentage of mapped reads, but does not reveal the actual accuracy of mappers. Attempts have been made to avoid this pitfall using simulated datasets in which the original read positions are known. Another difficulty lies in the accurate definition of what a correctly mapped read is. The basic definition is to consider a read as correctly mapped if the original location is retrieved . Ruffalo et al. broadened this definition by adding a condition on the quality score, which had to be superior to a given threshold . In a more recent paper , a new definition was introduced in which a read was considered to be correctly mapped if the mapping criteria were not violated, i.e. contained less errors than the threshold parameter set by the user.
Using simulated data allows numerical values to be obtained and compared between a set of mappers. However, simulated data do not have the same characteristics as real data, even when an error model based on real data is used. Real HTS data present biases  that can be very difficult to simulate. Additionally, the current definition of the mapping correctness based only on the original start location presents some weaknesses: a read can have several correct positions on the reference sequence and sequencing errors or true genetic variations can lead to a better alignment in a genome position different from the original one. Holtgrewe et al. introduced the interval definition, rather than the genome position, to describe a read mapping  and used a full-sensitivity algorithm to identify all possible matching intervals within a given error rate range for each read. This method has been implemented in RABEMA (Read Alignment BEnchMArk), a tool that evaluates the result of arbitrary read mappers that support the SAM output format with real and simulated datasets. Our analysis of the published literature on mapper evaluation led us to conclude that for a complete and robust comparison of mappers, real and simulated datasets should be used. Using real datasets avoids simulation biases and gives a real picture of mapper behavior, whereas simulated datasets are benchmarks from which all parameters can be controlled. Additionally, a sound, more complete definition of what constitutes a correctly mapped read needs to be considered (see below).
In all the previous studies, mapper performance was evaluated using large eukaryotic genomes (mainly the human genome) and, for the most part, short Illumina or Illumina-like reads data were used, except in [4, 6] where 454 datasets were evaluated with a reduced number of mappers and metrics. The type of sequencing errors and their rate is inherent to the sequencing technology and more precisely to the nucleotide elongation detection methods used. For example, Life Technologies sequencing by oligonucleotide ligation and detection (SOLiD) technology showed a strong bias in its coverage of repetitive elements , whereas the Illumina reversible dye-terminator sequencing technology (HiSeq) mainly caused substitutions . Pyrosequencing on solid support (454/Roche) and ion semiconductor sequencing technology (Ion Torrent, Life Technologies) produced indel errors associated with homopolymer-regions . In the published evaluations, the criteria that were tested and the default parameters of the mappers were usually chosen to address or deal with substitution-type errors and are, therefore, less informative for mapping the reads from new technologies like the Ion Torrent platform.
Furthermore, the analysis of small microbial genomes compared with the analysis of large eukaryotic genomes poses other challenges because microbial genomes contain a wide range of GC content, which is sometimes extreme. Very high or very low GC content means that there is a high probability of encountering homopolymers in a genome sequence and this is known to be a specific problem for pyrosequencing and ion semiconductor sequencers. A recent development in the HTS technologies has made available benchtop sequencers targeted at the quick and inexpensive sequencing of small to moderate-sized genomes, mainly bacteria, viruses, fungi, and parasites. Small microbial genome sequences could be considered to present a simpler, less demanding mapping process compared with the mapping process for larger eukaryotic genomes. However, this is only partially true because the characteristics of small microbial genomes are not the same as those of eukaryotic genomes. The questions of interest are also usually different and, consequently, the expected mapping quality criteria are not exactly the same. Whole genome sequencing or re-sequencing is an important application in the new field of microorganism characterization using HTS. For instance, clinical diagnosis and the epidemiological study of microbial strain circulation will be profoundly remodeled in the near future by the use of HTS, which should, very soon, be used as a characterization approach for pathogens and which will probably slowly replace the present PCR and biochemical based characterization methods [13, 14]. In this particular context the re-sequencing applications and derived analyses are in the front-line of research and development. The focus includes the sequencing of the entire length of a microbial genome and the analysis of obtained reads by mapping them onto one or several reference strains to identify potential relevant changes in the studied genome. The aim is to accurately identify the gain or loss in genetic elements (genes or parts of genes, prophages, and plasmids) as well as small changes (mutations and indels) to predict a potential new phenotype or a derived new pathogenicity profile. This requirement poses several challenges, the most important of which is the necessity to distinguish true genetic variations from sequencing errors.
In this paper, we focus on the evaluation of mappers in the context of whole genome sequencing or re-sequencing for small microbial, mainly bacterial genomes. We tested 14 mappers, mostly using their default settings to be in the general context of non-expert users. We selected four criteria to match this context: (i) computational resource and time requirements, (ii) robustness of mapping through the evaluation of precision, recall and F-measure, (iii) ability to report positions for reads in repetitive regions, and (iv) ability to retrieve true genetic variation positions. To evaluate a mapper’s robustness on simulated datasets, we introduced a new definition of a correctly mapped read. In addition to the original start position (i.e. the position from which a read is simulated) that was used in most previous studies, the end position as well as the numbers of insertions, deletions, and substitutions in the alignment were also used to classify the mapping of a read as correct. This definition is more stringent than the previous ones because it implies that it is a full-length read alignment and that the error count is correct. Indeed, sequencing errors can mean that the original location of a read is not necessarily the best alignment location. Using mappers tuned to report all possible hits (‘all’ mode) and to accept a higher error rate than the error rate introduced in simulated reads, it should be possible to retrieve the original location in addition to potential equivalent or better hits. With the new definition of a correctly mapped read used in this study, we ensured that the mapper was able to retrieve the expected original alignment despite inevitable sequencing errors in the reads, thereby allowing a true evaluation of the mapper’s robustness.
The analysis was applied to data generated by the Ion Torrent Personal Genome Machine (PGM), a newly arrived technology dedicated mainly to small genome sequencing, for which mapper performances have not yet been evaluated. Reads from real datasets and artificially simulated reads were used. Simulated reads were generated using a new customizable read simulator, CuReSim, which can generate reads of user-determined lengths with insertions, deletions, and substitutions introduced at a controlled rate and with an adjustable error distribution along the read. CuReSim and CuReSimEval, a script that can be used to evaluate mapping quality, were developed in Java to run on all operating systems (see Section 2 of Additional file 1 for more details) and are freely available at http://www.pegase-biosciences.com/tools/curesim/. We have shown that in microbial genome sequencing, some mappers, such as segemehl, present higher robustness than others, especially when the number of sequencing errors was high. Other mappers are more robust for other applications that demand other quality criteria. For example, BWASW, SHRiMP2, SMALT, SSAHA2 and TMAP, might perform particularly well for sequencing focused on rare variant discovery because they show a robust discrimination of variations. SMALT can localize most of the positions of reads located in repeated regions. Some mappers, such as Novoalign, SMALT and SRmapper, needed very small memory resources (about 20 MB), while SNAP was very fast and required only about two minutes to process the bigger datasets used in this study. These results emphasize the observation that mapper choice is application dependent and users should carefully consider the targeted aim before choosing a mapper. The evaluation approach presented here, together with the developed tools (CuReSim to generate simulated reads and CuReSimEval to evaluate mapping quality) can be considered as a general method to evaluate existing or in-development mappers and could prove interesting in the evaluation of the performances of mappers for the coming third generation of sequencers that may have yet another type and rate of errors.