Efficient and scalable scaffolding using optical restriction maps

In the next generation sequencing techniques millions of short reads are produced from a genomic sequence at a single run. The chances of low read coverage to some regions of the sequence are very high. The reads are short and very large in number. Due to erroneous base calling, there could be errors in the reads. As a consequence, sequence assemblers often fail to sequence an entire DNA molecule and instead output a set of overlapping segments that together represent a consensus region of the DNA. This set of overlapping segments are collectively called contigs in the literature. The final step of the sequencing process, called scaffolding, is to assemble the contigs into a correct order. Scaffolding techniques typically exploit additional information such as mate-pairs, pair-ends, or optical restriction maps. In this paper we introduce a series of novel algorithms for scaffolding that exploit optical restriction maps (ORMs). Simulation results show that our algorithms are indeed reliable, scalable, and efficient compared to the best known algorithms in the literature.


Introduction
To conduct basic biological research such as but not limited to diagnostic, biotechnology, forensic biology, biological pathways and knowledge of DNA sequences has become inevitable. Scientists need to know the sequence of bases to reveal genetic information that is hidden in a particular segment of a DNA molecule. For example, they can use sequence information to identify which stretches of DNA molecule contain genes, as well as analyze those genes to detect potential changes in the sequence that may cause diseases. So, to obtain an in-depth knowledge of a particular DNA molecule, sequencing of that molecule is the primary step. DNA sequencing is any process that is used to map out the precise order of the nucleotides within a single strand of a DNA molecule. The structure of DNA was modeled as a double helix in 1953. The first notable method for sequencing DNA was developed during the 1970s known as Sanger sequencing. It is a method of DNA sequencing based on the selective incorporation of chain-terminating dideoxynucleotides by DNA polymerase during in vitro DNA replication [6,7]. It was developed by Frederic Sanger and his colleagues in 1977 and was the most widely used sequencing technology until the advent of NGS technologies. An alternative to Sanger was shotgun sequencing [8,9]. By the time the Human Genome Project (HGP) began in 1990, only a few scientific laboratories had the ability to sequence a mere 100k bases, and the total cost of sequencing remained very high. Since then, technological improvements and computerized automation have increased the sequencing speed and lowered the cost to the point where individual genes can be sequenced routinely, and some laboratories managed to sequence well over 100 million bases per year. Beginning in the late 1990s, the scientific community has developed a number of new DNA sequencing technologies including the first of the "next-generation" sequencing methods.
High-throughput or next-generation sequencing technologies parallelizes the sequencing process and produce thousands or millions of short reads (25-100 bp) simultaneously at a single run. Some of the sequencing technologies dominating the NGS market today are Massively parallel signature sequencing (MPSS), 454 pyrosequencing, Illumina (Solexa) sequencing, SOLiD sequencing, Ion semiconductor sequencing, etc. An introductory review of these techniques can be found in [1]. After generating NGS reads, they can either be assembled de novo or aligned to a known reference sequence [2]. The choice solely depends on the biological application of interest as well as cost, effort, and time constraints. For example, if the intended application of interest is to determine a complete genomic sequence of a new species, we have to follow de novo sequencing strategy. On the contrary, identifying genetic variations in multiple strains of highly related genomes can be accomplished by aligning NGS reads to their reference genomes. This approach is cheaper and faster than de novo sequencing. But there are some limitations and challenges associated with this alignment approach. One of the most important challenges is in placing the reads within repetitive regions in the reference genome. Besides this, some of the regions existing in the source genome may not even exist in the reference genome. This could happen because of gaps in the reference genome [3]. The problem of aligning reads in repetitive regions can be solved by exploiting matepair reads information. De novo sequencing techniques also face challenges in repetitive regions and from low read coverages that result in gaps in the constructed sequence. The former can be overcome by employing mate-pair reads [4] or optical restrictions maps [5] information and the later can be solved increasing the read coverage.
In sequencing DNA is first shredded randomly into numerous smaller fragments. The resulting fragments are sequenced using the chain termination method to obtain reads. Multiple overlapping reads for the target DNA are obtained by performing several rounds of this fragmentation and sequencing. The resulting reads of these fragments are then reassembled into their original order based on overlaps. Reassembly is done by a computer program ultimately yielding the complete and continuous sequence. A contig is a series of overlapping DNA sequences used to make a physical map that reconstructs the original DNA sequence of a chromosome or a region of a chromosome. It is a set of overlapping DNA segments that together represent a consensus region of DNA. If the coverage is large enough and the sequenced reads are error free, there should be only one contig containing the entire genome. But in the next generation sequencing technologies the coverage can be low resulting in gaps and the reads also can be erroneous. As a consequence sequence assemblers typically produce multiple contigs. Obtaining the exact orientation and precise order of the contigs is the next challenging task. This step is known as scaffolding.
In genomic mapping, a scaffold is a series of contigs that are in the correct order but not necessarily connected in one continuous stretch of the genomic sequence. So, a scaffold is not only composed of contigs but also gaps. The problem of finding the correct order of the contigs can be posed as the problem of finding a permutation of these contigs that optimizes an objective criterion. Scaffolding is known to be NP-hard. Any information about the orderings such as the sizes of fragments of the DNA molecule can indeed help in devising an efficient algorithm for scaffolding. We can get fragment size information by employing restriction enzymes. A restriction enzyme (or restriction endonuclease) is an enzyme that cuts DNA at or near some specific recognition nucleotide sequences known as restriction sites. Restriction enzymes are of three types and found in bacteria and archaea. A restriction enzyme acts against invading viruses by electively cutting up a foreign DNA in a process called restriction. In general, restriction enzymes recognize a specific sequence of nucleotides and produce a double-stranded cut in the DNA. The recognition sequences usually vary between 4 and 8 nucleotides, and they are generally palindromic sequences. The locations of these specific sequences of nucleotides on a DNA molecule are called restriction sites. A restriction map detects known restriction sites within a sequence of DNA by cleaving it with a specific restriction enzyme. A restriction map provides a number of fragment sizes which collectively serve as a unique "fingerprint" or "barcode" for that sequence [10]. An optical restriction map (ORM) [11] is also similar to a restriction map with only one difference. It provides an ordered list of fragment sizes and this method has been combined with the assembly process to sequence whole genomes. Some of the recent research on ORMs in the context of contigs assembly can be found in [12], [13], [14], or [5].
We have employed ORMs in the context of scaffolding to find the relative order and correct placement of contigs produced by sequence assemblers. In this paper we propose several algorithms for scaffolding. We use a two phase strategy for scaffolding (just like the authors of [5]). In the first phase we compute a score for each contig corresponding to each possible placement of the contig in the ORM. In the second phase we utilize the scores computed in the first phase to come up with a non-overlapping placement of (possibly a subset of) the contigs in the ORM. In brief, we transform each contig into an ordered sequence of fragment sizes (just like the ORM). A greedy scoring scheme is then applied to find a score for each contig for each possible placement of the contig in the ORM. Greedy placement algorithms are then used to place the contigs in a correct order by using the matching scores. To validate the robustness of our proposed algorithms we have introduced different types of errors. Our simulation results on both real and synthetic data show that our algorithms are indeed scalable and efficient in terms of both accuracy and time. The rest of this paper is organized as follows: Section 2 contains the algorithms we propose. Simulation results and relevant discussions are presented in Section 3. Section 4 concludes the paper.

Methods
There are two phases in our algorithm. In the first phase we compute a score for each contig corresponding to each possible placement of the contig in the ORM. In the second phase we utilize the scores computed in the first phase to come up with a non-overlapping placement of the contigs in the ORM. These two phases are described in Sections and, respectively.

A scoring scheme Overview
To effectively and correctly order the contigs we need a reliable scoring mechanism. As the genomic sequence can be composed of millions or even billions of characters, we should also consider the time spent by the proposed algorithms. There is a trade off between the time an algorithm takes and the accuracy it gives. We achieve a very good balance between these two. This is done by carefully formulating the scoring algorithm. For each contig, we generate an ordered list of fragment sizes. Since we know the sequence of the restriction enzyme from the ORM of the genomic sequence, we can easily find the ordered restriction fragment sizes of any contig by incorporating in silico digest of the restriction enzyme. The resulting list of ordered fragment sizes can be mapped with the ORM. Assuming that there are no errors either in the ORM or the fragment sizes of the contigs, for any given ordered fragment sizes of a contig, in general, there will exist a subset of matching ordered fragment sizes in the ORM. Exploiting this information we can order the contigs. But in a real world scenario the data often may not be error free. Errors could occur due to the omission of some restriction sites or a change in some fragment sizes (due to sequencing errors). To quantify the effect of the errors a scoring mechanism is introduced. Let . . , c i n i } be the set of the ordered in silico fragment sizes of a contig C i and B = {o l , o l+1 , o l+2 , . . . , o m−l+1 } be the set of ordered fragment sizes of a particular region of the ORM stretching from the lth fragment to the (m − l + 1) th fragment. The score of the contig Ci for the region stretching from the lth fragment to the (m − l + 1) th fragment of the ORM is defined as follows: where P and M RS are the penalty term and number of missed restriction sites, respectively. Penalty term P is user defined and should be very large. Under ideal circumstances where there are no errors in reads, there are no errors in the ORM, the assembly is perfect, etc., we should not tolerate any missed restriction sites. In this case P could be even ∞. But in practice, depending on the technology employed, we could expect to see some errors in every process. As a result, we have to use a finite penalty. The value of P will thus depend on the error rates in the different technologies. If the expected error rate is low, then P has to be large. If the expected error rate is high, then P has to be low. In our experiments a value of 999 for P seems to yield good results.
More details on our algorithms are given in the next section.

A greedy scoring algorithm
The input to the Greedy Scoring algorithm are an ORM of the genomic sequence of interest, an ordered list of fragment sizes for each contig, and a penalty term. The fragment sizes may not be known exactly. Each fragment size in general can be thought of as a random variable for which we know the mean and the standard deviation. For simplicity assume that the standard deviation is the same (say s) for all the fragment sizes. The algorithm proceeds greedily to calculate the score of each contig. In fact, the algorithm computes multiple scores for each contig. If m is the number of fragment sizes in the ORM, then the algorithm computes m scores for each contig.
Let o 1 , o 2 , . . . , o m be the fragment sizes in the ORM. Let C be any contig and let the fragment sizes of C be c 1 , c 2 , . . . , c n . A score for C is computed by matching c 1 with o 1 ; Another score is computed by matching c 1 with o 2 ; and so on. In other words, we compute a score for C by matching c 1 with o i for each possible value of i, 1 ≤ i ≤ m. What is the score when c 1 is matched with o i (for some specific value of i)? We correlate a prefix of C (of minimum length) with a prefix of o i , o i+1 , . . . , o m such that the two prefix sums are nearly the same (within s).
In other words, we identify the least integer u and an we find such u and q, we match c u with o i+q−1 . Now we proceed recursively, i.e., we look for a prefix (of least length) of c u+1 , c u+1 , . . . , c n and a prefix of o i+q , o i+q+1 , . . . , o m whose sums are nearly the same (up to s); and so on.
The score for the resultant mapping of the contig C is obtained using Equation 1. For example, the partial score corresponding to the mapping of c u with o i+q−1 is partial scores are computed and added. Note that when we map c 1 with o i , the last fragment c n of the contig will be mapped with some fragment o t in the ORM. Corresponding to this mapping of the contig C, we refer to o i as the starting fragment and o t as the ending fragment. For the base case when i = m, we match c n with o m . Also when n j=1 c j > m j=i o j we match c n with o m . More details of the algorithm can be found in Algorithm 1. The run time of our greedy scoring algorithm is O(mnr), where m is the number of fragments in the optical map, r is the number of contigs and n is the maximum number of fragments in any contig.

Placement schemes
The placement scheme utilizes the matching scores of the contigs to find the correct order. We propose three different placement algorithms that are described below.

Some notations
The list of ordered fragment sizes in the ORM is o 1 , o 2 , . . . , o m . The number of contigs is denoted as r. Let the cotigs be C 1 , C 2 , . . . , C r . The number of fragments in

Greedy placement algorithm 1 -GPA1
GPA1 takes as input the contigs and the ORM together with the output of Algorithm 1. If m is the number of ordered fragments in the ORM, then the number of scores associated with each contig will be m, as described in the previous section. The algorithm proceeds as follows: At first the matching scores associated with each contig are sorted individually in increasing order. The first position of the sorted list of each contig contains the minimum score among all the scores. As the penalty term is very large, this matching score is the best score for placing this contig anywhere in the ORM. We now sort the contigs based on the indices of the starting fragments corresponding to the best scores. As an example, assume that there are 5 contigs C 1 , C 2 , . . . , C 5 and consider their best scores. For each such score there is a starting fragment and an ending fragment. If the starting fragments of these contigs are o 5 , o 11 , o 3 , o 22 , and o 7 , respectively, then the sorted order of the fragments will be o 3 , o 5 , o 7 , o 11 , and o 22 . So the corresponding contigs with respect to its starting fragments will be C 3 , C 1 , C 5 , C 2 , C 4 . In general, let this sorted order be C 1 , C 2 , . . . , C r . Followed by this, we attempt to place the contigs in the ORM in this order (using the mapping corresponding to the best score). Specifically, we first try to place C 1 ; Next we attempt to place C 2 ; and so on. When we try to place any contig C, we check whether the starting and/or ending fragments of C will overlap with any of the already placed contigs. If there is such an overlap, we discard C and move onto the next contig in the sorted list.
A detailed pseudocode is supplied in Algorithm 2. Let m be the number of fragments in the optical map, and r be the number of contigs. Intuitively the number of matching scores of each contig C i is at most O(m). Since the matching score is an integer, sorting matching scores of each contig C i takes at most O(m) time. So, the execution time of lines 2-7 in Algorithm 2 is O(mr). Sorting contigs with respect to starting fragment takes O(r) time (line 8). In the worst case lines 9-12 take O(r 2 ) time. Since r ≪ m, the run time of Algorithm 2 is O(mr).

Greedy placement algorithm 2 -GPA2
GPA2 proceeds as follows: At first the matching scores associated with each contig are sorted individually in increasing order. Note that we consider m possible matchings for each contig and hence each contig has a list of m mappings and scores. Let the list of mappings (in sorted order of the matching scores) for contig C be L C .
The number of matching sites for a contig mapping is defined to be the number of fragments in the contig that are matched with fragments in the ORM. For each contig, we know that there are m scores (with one score per starting fragment or mapping). Corresponding to each starting fragment (i.e., mapping) we can also compute the number of matching sites. Thus for every contig, we have a list of m numbers of matching sites. We identify for each contig the mapping that has the largest number of matching sites. Let b C be this number for contig C. We order the contigs based their b C values in non-increasing order. Let the sorted list be C 1 , C 2 , . . . , C r based on their b C values.
Place the contigs one-by-one based on the above sorted list starting from C 1 For any contig C, mappings for this contig will be considered as per the list L C . In other words, the first time when we try to place C, we will use the mapping found in L C [1]. When we try to place C using this specific mapping, we check whether the starting and/or ending fragments of the contig will overlap with already placed contigs. If there is no overlap, we process the next contig. If there is an overlap while placing C (using the mapping in L C [1]), we move to the next entry in L C , i.e., L C [2]. If successful, we process the next contig. If not, we move on to the next entry in L C , and so on. We make repeated attempts to place C at most d times (where d is a user-specified parameter). If we are not successful in these d attempts, we ignore C and proceed to process the next contig.
Additional details of the algorithm are supplied in  Place the contig c i at the end of the list C 12 Return C

Greedy placement algorithm 3 -GPA3
GPA3 takes as input the contigs and the ORM together with the output of Algorithm 1. If m is the number of ordered fragments in the ORM, then the number of scores (or mappings) associated with each contig will be m, as described in the previous section. The algorithm proceeds as follows: At first the matching scores associated with each contig are sorted individually in increasing order. The first position of the sorted list of each contig contains the minimum score (i.e., the best score) among all the scores. We now sort the contigs based on their best scores. Let this sorted order be C 1 , C 2 , . . . , C r . Followed by this, we place the contigs in the ORM in this order. Specifically, we first try to place C 1 ; Next we try to place C 2 ; and so on. Note that for any given contig and a corresponding score, we know the starting fragment as well as ending fragment (in the ORM). While trying to place any contig C, we check if there will be any overlaps with any of the contigs already placed. If so, we move on to the next entry in C's list and check if C can be placed based on the corresponding starting and ending fragments without any overlaps. We make a total of at most d such attempts to place C (where d is a userdefined parameter). If C cannot be placed successfully within these attempts, we drop C from further considerations and move on to the placement of the next contig.
A pseudocode of the algorithm can be found in Algorithm 4. Let m be the number of fragments in the optical map, and r be the number of contigs. Place the contig c i at the end of the list C 17 Break Sort the array C with respect to the starting position in increasing order 22 Return

Results and Discussions
To prove the effectiveness of our proposed algorithms we have done rigorous simulations on both real and synthetic datasets. The simulation results show that the algorithms are indeed scalable and efficient. We have also compared our algorithm with one of the best known algorithms [5]. Our algorithm outperforms the aforementioned algorithm in terms of run time, by more than two orders of magnitude, and accuracy. The run time of the scaffolding algorithm of [5] is O(m 2 n 2 r), where m is the number of fragments in the optical map, r is the number of contigs and n is the maximum number of fragments in any contig. In comparison, the run time of our algorithm is O(mnr). In this section we present our experimental results. All the programs have been run on an Intel Core i5 2.3 GHz machine with 4 GB of RAM. if (c i is not overlapped with already placed contigs in C){ 16 Place the contig c i at the end of the list C 17 Break Sort the array C with respect to the starting position in increasing order 22 Return C

Real datasets
Real datasets are comprised of two strains of yersiniae bacteria, namely, Yersinia pestis, and Yersinia enterocolitica. The yersiniae are Gram-negative rods belonging to the family Enterobacteriaceae. They consist of 11 species of which three are pathogenic to humans. Those are Yersinia pestis, Yersinia pseudo-tuberculosis, and Yersinia enterocolitica. The genomic sequences of Yersinia pestis and Yersinia enterocolitica contain 4,653,728 bp and 4,615,899 bp, respectively. Each of the genomic sequences is randomly fragmented into a number of non-overlapping substrings/contigs of different lengths. We then permute the resulting contigs randomly to break the relative order existing among them. As we know the placement of the contigs when we generate them, we can easily detect whether our algorithms reconstruct the correct orderings from the randomly permuted contigs. To show the robustness of our proposed algorithms we introduce errors by discarding restriction sites with some probability. We also introduce errors by resizing, i.e., by increasing or decreasing the fragment sizes of the contigs. We have generated 50, 100, 200, and 400 contigs from the genomic sequence of Yersinia pestis and 200, and 400 contigs from Yersinia enterocolitica. Accuracy is defined as the fraction of the contigs placed correctly. If a contig cannot be placed, i.e, if the placement overlaps with other contigs, we call it a conflict. On the contrary when the placement of a contig is out of order (i.e. when the contig is misplaced) we call it wrong placement. From Table 1 and Table 2 it is evident that if there are no errors in the datasets, the accuracy found by applying the different methods is in the range: [97%, 100%]. The less the number of contigs, the more accurate the resulting placement of the contigs are. In this case, the algorithms are more resilient with errors. It is also the case that GPA3 is more robust against the errors introduced in the datasets.
To simulate practical scenarios, we have randomly generated reads of size 100 bp each from the two Yersinia strains. Contigs were created employing the String Graph Assembler (SGA) [15]. These contigs were then ordered using GPA2. After ordering we concatenated the ordered contigs to find the scaffold. As the sequences are very long, it is infeasible to calculate the edit distance between the original sequence and resulting scaffold. So, the genomic sequence and the corresponding scaffold are aligned using MUMmer [16]. The acronym "MUMmer" comes from "Maximal Unique Matches", or MUMs. It is based on the suffix tree data structure designed to find maximal exact matches of two input sequences. In Figure 1 we have aligned ordered contigs of Yersinia pestis onto the original sequence of Yersinia pestis. We have aligned   ordered contigs of Yersinia enterocolitica onto the original sequence of Yersinia enterocolitica. The plots [Please see Figure 1 and 2] represent the set of all MUMs between the two input sequences. Forward MUMs are plotted as red lines/dots while reverse MUMs are plotted as blue lines/dots (encircled). A line of dots with unit slope represents an undisturbed segment of conservation between the two sequences, while a line of dots with negative unit slope represents an inverted segment of conservation between the two sequences. As is evident, the alignments ordered contigs (i.e. scaffold) are nicely placed onto the original sequences. The coverage of these two alignments is approximately 92% which proves the effectiveness of our algorithms.

Synthetic datasets
We have generated four genomic sequences of various sizes by choosing each character randomly from a uniform distribution. We generated reads of size 100 bp from each of the datasets such that the average coverage of the reads to a particular position of the sequence is around 5. Reads were generated by taking substrings of size 100 bp from randomly selected positions in the sequence. SGA [15] was used to generate contigs from the reads. The contigs were then ordered using our algorithms. ORM is created in silico by choosing a 4-bp long sequence acting as a restriction enzyme. The ordered fragment sizes of each contig are also created by employing the same procedure stated above. After getting the scaffold we calculate the edit distance between the original sequence and the resulting scaffold. It is intuitive that if the placement of the contigs in the scaffold is correct, then the following statement holds: | Size(original_sequence) − Size(constructed_sequence)| ≈ edit_distance(original_sequence, constructed_sequence).
Our simulation results show that this is indeed the case [Please see Table 3].

Comparison
We have compared our algorithms with one of the the best known algorithms existing in the literature [5]. The simulation results show that our proposed algorithms are superior in terms of both run time as well as accuracy. As the size of the sequence is increased more and more, our algorithms are faster and faster than [5]. We have compared our proposed algorithms with [5] by using synthetic datasets. The ground truth of exact ordering of contigs is unknown in the case of real datasets as we do not know the placement of the contigs in prior. As optimal ordering is NP-hard, computationally it is impossible to find the correct placement when the number of contigs is large. So, to compare with [5] we have generated 4 artificial sequences of various sizes. 50 contigs were generated from each of the sequences. Contigs generation process is described in Section. Accuracy is calculated as the fraction of contigs placed  correctly. As is evident from the simulation results, our algorithms are two orders of magnitude faster and our placements are also better [Please see Table 4] than [5].
In some cases we did not calculate the accuracy as it was taking an indefinite amount of time compared to our algorithms. '-' indicates this issue in Table 4.

Conclusions
Contig assembly is a very challenging task. In de novo assembly it is one of the most important steps to construct an entire genomic sequence from millions of reads produced by the sequencers. A series of algorithms has been proposed in this paper to order the contigs. ORM is used to calculate matching scores between the sequence and contigs. Contigs are then placed so that the overall cumulative matching scores are minimized. We have performed rigorous simulations on both real and synthetic datasets. The results show that our algorithms are efficient in terms of both run time and accuracy.