With the emergence of next-generation sequencing, the availability of prokaryotic genome sequences is expanding rapidly. A total of 5,276 genomes have been released since 2008, yet only 1,692 genomes were complete. The final phase of microbial genome sequencing, particularly gap closing, is frequently the rate-limiting step either because of complex genomic structures that cause sequence bias even with high genomic coverage, or the presence of repeat sequences that may cause gaps in assembly.
We have developed a Cytoscape plugin to facilitate gap closing for high-throughput sequencing data from microbial genomes. This plugin is capable of interactively displaying the relationships among genomic contigs derived from various sequencing formats. The sequence contigs of plasmids and special repeats (IS elements, ribosomal RNAs, terminal repeats, etc.) can be displayed as well.
Displaying relationships between contigs using graphs in Cytoscape rather than tables provides a more straightforward visual representation. This will facilitate a faster and more precise determination of the linkages among contigs and greatly improve the efficiency of gap closing.
The emergence of next-generation sequencing (NGS) technology greatly facilitated genome sequencing. The long reads produced by Roche 454 or PacBio SMRT makes de novo assembly easier to complete. Despite the symmetrical representation of sequences produced by 454 or other NGS methods, tens to hundreds of contigs still exist due to repeat sequences or GC/AT-rich regions in the genomes. Therefore, determining the order of contigs and filling in the gaps among them using PCR are two essential and rate-limiting steps in the final phase of whole-genome sequencing. The ‘Newbler Assembler’ developed by Roche 454 has strict parameters to avoid mis-assembly and thus results in the breakdown of some contigs. For example, one read would be separated and placed into two contigs due to base-calling variation in different reads, and in some extreme cases, no gap truly existed between two such “contigs”. Several existing scaffolders for high throughput sequencing (HTS) genome assemblies, such as GRASS , SSPACE , OPERA  and MIP Scaffolder , may provide effective scaffolding; however, they lack global visualization and have to face the balance between scaffold length and accuracy. Most visualization tools, such as Consed , DNASTAR lasergene  and Gap , which are often used for genome completion and enable users to verify the assembly of contigs, can only display a linear relationship of contigs . To provide a genome-level overview, ABySS-Explorer  and TGNet  were developed. TGNet incorporates several scripts for converting transcripts to facilitate assembly and represents contigs graphically using points. ABySS-Explorer  is another global viewer of contig assembly. However, neither program was designed to treat repeat contigs or display the reads that link contigs and imply the location of gaps and repeat contigs [8, 10] (Table 1). These programs also lack special functions for microbial genome analysis. Therefore, we developed ContigScape, a Cytoscape  plugin that can be used to display all relationships of contigs, including each contig and linked reads in a microbial genome; the gaps and repetitive sequences can then be confirmed by users. Our goal is to display the original relationships of all contigs instead of a manually trimmed result, as the real association of contigs should be depicted as a network rather than a linear linkage. Furthermore, repeat contigs, gaps and even plasmids can be highlighted, filtered, and customized.
From reference, From scaffolding information, 454 repeat reads or other database
From reference, From other database
Fungi,,bacteria, plasmid, virus etc.
One edge and two nodes
ContigScape is a convenient Java plugin based on Cytoscape , which is an established, free, and open-source software platform for the visualization and analysis of molecular interaction networks and can be used on Windows, Linux and Mac platforms. ContigScape is a simple and efficient plugin that makes gap closing during microbial genome sequencing more efficient.
Sequencing of samples, de novo assembly of the genomes, and scaffolding
All genome sequences used in Table 2 had been released in GenBank and were generated by different laboratories in China and sequenced by the Chinese National Human Genome Center ast Shanghai. In our approach, genome sequencing was conducted using the Roche 454 GS FLX system and the GS FLX Titanium Sequencing Kit. Reads were then de novo assembled using Newbler v2.3. We constructed the mate-pair DNA libraries with insert sizes larger than 3 kb and sequenced using the Illumina Hiseq 2000 sequencing platform. A random subset of mate-pair reads were used for mapping and analysis with scaffold.pl (perl script, see Additional file 1, using BWA , Samtools , FASTX-Toolkit and BEDTools  programs).
Strains used in this study and general sequence information
Programming language, systems, and external programs
ContigScape was developed based on Cytoscape, which is available for Linux, Windows and MacOS X. The core programming language of ContigScape is Java. Users are provided with a comprehensive manual that explains all functions (see Additional file 1).
Counting contig abundance and copy number, and display
Our interest lies in estimating the abundance of repeat contigs. We define a repeat contig as one at least having twice as much read coverage than the average genome coverage. Average genome coverage is the ratio of the total bases of reads assembled into contigs and the total size of all contigs. When users input Contig Relationship Scape (CRS) file in our plugin without original assembly result, the default arithmetic for genome coverage is to count the average coverage of all contigs with size bigger than 20 kb (In our experience, the repeat contig bigger than 20 kb is rare in microbial genome except plasmid). Each copy number is calculated as the ratio of contig abundance and average genomic coverage, which represents the corresponding repetition rate of the contigs. Contig abundance is the ratio of total bases of reads assembled into this contig and the contig size. We define a specific contig as one having read coverage less than 1.5 fold average (default value is 1.5, which can be set by users). So, the contigs whose coverage is greater than 1.5 and less than 2 are probable repeats. They need to be confirmed by counting the connections at the end of the contig or PCR method. Like Figure 1B7, 106S-106E is a repeat contig verified by two linkages in each end. PCR needs to be used to determine the relationship “37S-37E-106S-106E-41S-41E-106E-106S-42E-42S” or “37S-37E-106S-106E-41E-41S-106E-106S-42E-42S”.
Meanwhile, the average number of linkages between contigs can be computed by , where Z is the average number of linkages and n is the number of relationships conforming to the requirements. As above, the ratio of link number and Z indicates the width of edge representing linkage in CytoScape.
Principles of displaying Roche 454 genome assembly results
Roche 454 reads now exceed 700 base pairs in length and thus can be used to resolve gaps caused by small repeats. The ‘Newbler Assembler’ may produce a ‘454Contigs.ace’ file, which contains all assembly information and can be shown by ‘Consed’ . As indicated in Supporting Figure 2, when a read was separated into two contigs, the coordinate of the read in each contig was shown after the read name, followed by the contig number with which this read was linked. The general principle to label the reads spanning the linked contigs is to use ‘fmX’ to represent the 5’ end of the reads located in contigX and ‘toY’ to represent the 3’ end of the reads located in contigY. This unique feature of the ‘Newbler Assembler’ labeling system in conjunction with long reads from 454 enables us to extract all the information of ‘fm’ and ‘to’ from the ‘454Contigs.ace’ file. This information can then be arranged into a relationship table (Figure 2C, D), such as ‘5’-end-Contig1’ linked to ‘3’-end-Contig2’. This relationship table can then be displayed by ContigScape as shown in Figure 3D.
Principles of displaying scaffolds constructed by mate-pair reads
A scaffold is a consensus sequence formed by ordered contigs using ‘N’ to fill any gaps. The most common method uses the mate-pair information to assemble contigs into scaffolds. Scaffolding programs can determine the separation of two contigs depending on the fragment size of the mate-pair reads. For example, if two contigs were separately mapped by a pair of 3-kb mate-pair reads, the two contigs could be joined into a scaffold, and the gap size would be 3 kb minus the distance between the mapping loci and the end of contigs. This method would allow repeat regions less than 3 kb in length to be bridged. However, ambiguous linkages can occur if the repeat region was longer than the fragment size of the mate-pair library (Figure 4). Similar to the results from ‘Newbler’ for 454 reads, ContigScape can display a relationship network within scaffolds by counting the number of mate-pair reads linking to large contigs (>500 bp).
Results and discussion
Repeats are usually assembled into single contigs and thus cause gaps. After sequencing, two repeat regions (R1 and R2, Figure 3A) were assembled into the R1/R2 repeat contig (Figure 3B), and ContigScape reported all of its possible linkages with other regions (1–4, Figure 3C–D). Further PCR validation guided by this predicted linkage would exclude the incorrect relationships and result in a final correct consensus sequence. The repeat contigs in ContigScape are shown in red (Figure 3D) to distinguish them from the normal contigs shown in dark blue (default setting). In addition, the number of reads connecting two contigs is labeled with linkage edges, and the linkage reliability is illustrated by variable edge thickness.
The key feature of ContigScape is to determine the linkage of two contigs assembled from 454 or Illumina reads. An ‘Ace’ file can be opened directly by ContigScape and the relationship of contigs can be saved as a CRS format (see sample, tabbed.txt, tabbedCov.txt, Additional file 1). The CRS format includes two files, and each contains three columns. ‘tabbed.txt’ contains the number of connections among contigs, and ‘tabbedCov.txt’ describes the length and coverage of contigs. The ‘tabbed.txt’ is similar to AGP file and describes how the chromosomes and scaffolds were assembled from the component contigs, but does not require contigs to be sorted in advance. It will produce an original graph after loading the two files, and a final graph needed for the layout function of Cytoscape. Researchers can also obtain the CRS information by converting the results from GRASS, SSPACE, OPERA and MIP scaffolders.
Another prominent characteristic of ContigScape is the calculation of the coverage of contigs and the subsequent definition of the contig whose coverage exceeded two fold above the average, denoted as ‘repeat contig’. Each contig is represented by one edge and two nodes, with ‘XS’ and ‘XE’ indicating the 5’ end (Start) and 3’ end (End) of contigX (X represents a number), respectively. The linkage (reads) is represented by a sole edge whose thickness varies based on the number of supporting reads. The number on the edge of contigs indicates the contig length, whereas the number on the edge of linkages indicates the number of linking reads.
Application of technology to display 454 contigs and scaffolding by mate-pair reads
We have used this tool for the visualization of eleven genomes (Table 2, Figure 5), accelerating the completion of these genomes (nine of them have been published). After de novo assembly by 454 Newbler, researchers can estimate the complexity of specific genomes and the difficulty of gap closing with global views. In Figure 5, we see significant differences in the assembly of eleven genomes due to variance in the number of total contigs and repeat contigs. In addition, ContigScape has been applied to gap closing of an additional 40 genomes (Figure 6); the network of contigs in Streptomyces, Leptospira and Ralstonia is complex, whereas the contig graphs of Brucella, Mycoplasma and Ketogulonicigeniumis is simple. These genomes comprised bacteria, archaea, virus and fungi. It was clear that the gap closing for A. hospitalis W1 was easy. In the graph of A. hospitalis W1, we saw that the 28-kb contig3 was a tandem repeat, which had previously been identified as an integrated plasmid . It is easy to determine if the plasmid is circular and if the copy number exceeds two, such as A. orientalis HCCB10007 and E.tarda EIB202 . The 24th graph of Figure 6 shows four circular plasmids composed of only one contig. There was also a high-copy-number contig in the graph of Mycobacterium tuberculosis CCDC5079, and BLAST identified it as IS6110, an insertion element. The 14 rRNA operons of Bacillus thuringiensis BMB171 , each of approximately 5 kb in length, can also be clearly displayed (Figure 5). There were also many independent and closed rings in the assembly graph of Cotesia vestalis Bracovirus, which were identified as 35 non-redundant circular genome segments. The number of contigs in the fungus Cordyceps militaris exceeded 2,000, therefore the contigs need further scaffolding.
We applied ContigScape to a recently assembled Streptomyces sp genome with 111 contigs sequenced by Roche 454 without scaffolding. We added seven contigs (contig140, 141, 142, 143, 144, 145 and 146) into the two CRS files to show different plasmids (Figure 1B). After processing, we found 25 repeat contigs, constituting six plasmids, 8 rRNA operons and one telomere (contig28, Figure 1B2). The remaining repeats include IS elements, phage or other sequences. Figure 1A shows that 52 nodes have no linkage, and they need additional scaffolding information. Therefore, PCR is necessary to fill the remaining gaps. Any relationships requiring validation are indicated by a green edge.
Judging whether a repeat contig was from chromosome or plasmid mainly depended on the linkage information of two ends of this contig. Four different types were shown in Figure 1B: 1). Repeat contigs connected in a circular fashion (Panel 3), 2). Individual contig connected itself without anyone else (Panel 4 and 6), 3). One end of repeat contig having no linkage to any other contigs, usually representing linear chromosome telomere or linear plasmid end (Panel 1 and 2), 4). A linear plasmid composed of only one repeat contig without connections to any contigs (Panel 5). While if a plasmid is linear and single copy, ContigScape cannot distinguish it. We can estimate whether or not a contig was a plasmid effectively based on above described situation in our experience. Of course researcher must confirm whether it is a plasmid or not by PCR, sequencing and annotation.
In Figure 1B, 143E has connections with 142E and 144E (Panel 3). But the number of connections (800) between 143E and 142E is more than that (10) between 143E and 144E. In this case, the latter might be a nonspecific connection caused by little overlap among the reads. Additionally, Figure 1B shows that contig78 in the linear plasmid 80E-80S-78E-78S-54E-54S also has another copy in the chromosome (Panel 1).
We also applied this program to another Streptomyces sp genome with 145 contigs sequenced by Roche 454 with mate-pair information (Figure 7). We can better interpret the relationship between contigs by using mate-pair reads. Figure 7C represents a linear chromosome with an 18 kb repeat at the ends (telomeres).
Display functionality of ContigScape
There are several unique features of ContigScape for microbial genome analysis (Figure 1). In particular, the “find genomic features” function may identify contigs belonging to plasmid/terminal repeats, determine whether the plasmid was linear or circular, and counting the read coverage of this plasmid (Figure 1B). Second, ContigScape may determine the locations of the ends of linear chromosomes based on a repeat contig where in one end has two edges and the other has none. After the ‘Ace’ file is loaded, the genomic structure network can be displayed, including the linkage of contigs, contig size and number of repeats. Meanwhile, another plugin called Network Analyzer  can be used to determine the complexity of the network (genome), and thus estimate the amount of work required to complete the genome. When viewing the graph, the 1,000 base pairs of both 5’-end and 3’-end can be loaded, with 20 “N” linking them representing the middle sequences. Clicking the edge of two contigs, the sequence containing corresponding contigs’ ends can also be displayed. The displayed sequence can be used to design primers in ContigScape and perform blast against NCBI database. In addition, the user can open “edit panel” to edit the connections of the network. In addition to gap closing in bacterial genomes, complete BAC or plasmid sequences can also be finished using ContigScape. It can also display if a CRS file, converted from scaffolding results using different methods, was imported. The workflow of ContigScape is shown in Figure 8. Other functions of ContigScape are described in an Additional file 1 (see ContigScape manual).
Comparative assembly  utilizes a reference genome sequence as a guide to discern repeat contigs. However, there are three obvious weaknesses regarding comparative assembly: (1) the target species must have previously been sequenced and assembled; (2) structural variations exists in different references; (3) it cannot resolve large insertions. For example, we resequenced Amycolatopsis mediterranei S699 and assembled the genome de novo. Comparing with the previously released A. mediterranei S699 assembly , which was assembled using A. mediterranei U32 as a reference, the genome we sequenced contained a 10-kb insertion. The differences can likely be attributed to the different strategies used for genome assembly . De novo assembly is a reliable way to avoid these weaknesses of comparative assembly.
Each sequencing technology has its own biases that result in coverage gaps. As coverage increases, the number of gaps decreases. However, gaps can occur if reads that would typically be assembled into one contig cannot span a large repeat area. Therefore, utilizing repeat contigs is important. During scaffold construction, repeat contigs usually cause errors in scaffolding or in the creation of linkages. Some programs may elect to link two unique contigs with one repeat contig, thus the individual repeat contig is used only once. Therefore, correct judgment will greatly reduce the efforts invested in genome assembly. Displaying straightforward graph-based relationships of contigs in Cytoscape rather than tables also facilitates a faster and more precise determination of the linkages among contigs. Our goal is to display the original relationships of all contigs rather than the manually trimmed results because the true association of contigs should be depicted as a network rather than a linear linkage.
ContigScape isn’t an assembly program and cannot replace phred/phrap/consed package, indeed they are complementary to each other. Consed  and its process “autofinish”  are very useful in gap closing. Actually, all contigs’ PHD files together with ABI3730 data sequenced after PCR must be assembled using phrap and edited by consed at last in our finishing strategy. ContigScape looks like a canvas used to judge and edit the order among contigs and can evaluate the complexity of shot-gun assembly in global visually. The plugin can only process several NGS assembly data directly like 454Conitgs.ace and mate-pair reads, while the assembly result made by other programs should be transformed into CRS file as input.
Using ContigScape, contigs can be displayed and repeat contigs, gaps, and even plasmids can be highlighted, filtered, and customized. We designed unique functions for microbial genome analysis in ContigScape, such as the identification of plasmids, whether they are linear or circular and an estimation of their read coverage. We believe with the development of the third-generation sequencing technologies, gap closing will be much easier due to fewer assembled contigs. Long repeats will still hamper the assembly, especially in larger genomes; however, ContigScape will play an important role in gap closing for these genomes.
The genome sequences have been deposited at NCBI under the accession numbers:
Other requirements: Java 1.6 or higher, Cytoscape 2.8.3 (After Java and Cytoscape are installed, put ContigScape.jar under cytoscape2.8.3/plugins folder).
Restriction for non-academics: Users willing to use ContigScape for non-academic purposes should contact the corresponding author for details.
High throughput sequencing
Contig relationship scape.
We would like to thank the students of gap closing group in Chinese National Human Genome Center at Shanghai for suggestions about the plugin. This work was supported by the grants from National Natural Science Foundation of China (30830002, 31121001 ,31270056), from National Basic Research Program of China (2012CB721102) and the Shanghai Rising-Star Program (11QA1404600).
State Key Laboratory of Genetic Engineering, Department of Microbiology, School of Life Sciences, Fudan University
Shanghai-MOST Key Laboratory of Health and Disease Genomics, Chinese National Human Genome Center at Shanghai
CAS Key Laboratory of Pathogenic Microbiology & Immunology, Institute of Microbiology, Chinese Academy of Sciences
CAS Key Laboratory of Synthetic Biology, Institute of Plant Physiology and Ecology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences
Department of Microbiology and Li KaShing Institute of Health SciencesThe Chinese University of Hong Kong, Prince of Wales Hospital, Shatin, New Territories
Graduate School of Chinese Academy of Sciences
Gritsenko AA, Nijkamp JF, Reinders MJ, de Ridder D: GRASS: a generic algorithm for scaffolding next-generation sequencing assemblies.Bioinformatics 2012,28(11):1429–1437.PubMedView Article
Boetzer M, Henkel CV, Jansen HJ, Butler D, Pirovano W: Scaffolding pre-assembled contigs using SSPACE.Bioinformatics 2011,27(4):578–579.PubMedView Article
Bonfield JK, Smith K, Staden R: A new DNA sequence assembly program.Nucleic Acids Res 1995,23(24):4992–4999.PubMedView Article
Nielsen CB, Cantor M, Dubchak I, Gordon D, Wang T: Visualizing genomes: techniques and challenges.Nat Methods 2010,7(3 Suppl):S5-S15.PubMedView Article
Nielsen CB, Jackman SD, Birol I, Jones SJ: ABySS-Explorer: visualizing genome sequence assemblies.IEEE Trans Vis Comput Graph 2009,15(6):881–888.PubMedView Article
Riba-Grognuz O, Keller L, Falquet L, Xenarios I, Wurm Y: Visualization and quality assessment of de novo genome assemblies.Bioinformatics 2011,27(24):3425–3426.PubMedView Article
Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T: Cytoscape: a software environment for integrated models of biomolecular interaction networks.Genome Res 2003,13(11):2498–2504.PubMedView Article
Bonfield JK, Whitwham A: Gap5–editing the billion fragment sequence assembly.Bioinformatics 2010,26(14):1699–1703.PubMedView Article
Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D: The human genome browser at UCSC.Genome Res 2002,12(6):996–1006.PubMed
Stalker J, Gibbins B, Meidl P, Smith J, Spooner W, Hotz HR, Cox AV: The Ensembl Web site: mechanics of a genome browser.Genome Res 2004,14(5):951–955.PubMedView Article
Huang W, Marth G: EagleView: a genome assembly viewer for next-generation sequencing technologies.Genome Res 2008,18(9):1538–1543.PubMedView Article
Schatz MC, Phillippy AM, Shneiderman B, Salzberg SL: Hawkeye: an interactive visual analytics tool for genome assemblies.Genome Biol 2007,8(3):R34.PubMedView Article
Li H, Durbin R: Fast and accurate short read alignment with Burrows-Wheeler transform.Bioinformatics 2009,25(14):1754–1760.PubMedView Article
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R: The sequence alignment/Map format and SAMtools.Bioinformatics 2009,25(16):2078–2079.PubMedView Article
Quinlan AR, Hall IM: BEDTools: a flexible suite of utilities for comparing genomic features.Bioinformatics 2010,26(6):841–842.PubMedView Article
Tang B, Zhao W, Zheng H, Zhuo Y, Zhang L, Zhao GP: Complete genome sequence of Amycolatopsis mediterranei S699 based on de novo assembly via a combinatorial sequencing strategy.J Bacteriol 2012,194(20):5699–5700.PubMedView Article
Xu J, Zheng HJ, Liu L, Pan ZC, Prior P, Tang B, Xu JS, Zhang H, Tian Q, Zhang LQ: Complete genome sequence of the plant pathogen Ralstonia solanacearum strain Po82.J Bacteriol 2011,193(16):4261–4262.PubMedView Article
Mi S, Song J, Lin J, Che Y, Zheng H: Complete genome of Leptospirillum ferriphilum ML-04 provides insight into its physiology and environmental adaptation.J Microbiol 2011,49(6):890–901.PubMedView Article
He J, Shao X, Zheng H, Li M, Wang J, Zhang Q, Li L, Liu Z, Sun M, Wang S: Complete genome sequence of Bacillus thuringiensis mutant strain BMB171.J Bacteriol 2010,192(15):4074–4075.PubMedView Article
Yang M, Lv Y, Xiao J, Wu H, Zheng H, Liu Q, Zhang Y, Wang Q: Edwardsiella comparative phylogenomics reveal the new intra/inter-species taxonomic relationships, virulence evolution and niche adaptation mechanisms.PLoS One 2012,7(5):e36987.PubMedView Article
You XY, Liu C, Wang SY, Jiang CY, Shah SA, Prangishvili D, She Q, Liu SJ, Garrett RA: Genomic analysis of Acidianus hospitalis W1 a host for studying crenarchaeal virus and plasmid life cycles.Extremophiles 2011,15(4):487–497.PubMedView Article
Chen YF, Gao F, Ye XQ, Wei SJ, Shi M, Zheng HJ, Chen XX: Deep sequencing of Cotesia vestalis bracovirus reveals the complexity of a polydnavirus genome.Virology 2011,414(1):42–50.PubMedView Article
Li Y, Zheng H, Liu Y, Jiang Y, Xin J, Chen W, Song Z: The complete genome sequence of Mycoplasma bovis strain Hubei-1.PLoS One 2011,6(6):e20999.PubMedView Article
Zheng P, Xia Y, Xiao G, Xiong C, Hu X, Zhang S, Zheng H, Huang Y, Zhou Y, Wang S: Genome sequence of the insect pathogenic fungus Cordyceps militaris, a valued traditional Chinese medicine.Genome Biol 2011,12(11):R116.PubMedView Article
Assenov Y, Ramirez F, Schelhorn SE, Lengauer T, Albrecht M: Computing topological parameters of biological networks.Bioinformatics 2008,24(2):282–284.PubMedView Article
Pop M, Phillippy A, Delcher AL, Salzberg SL: Comparative genome assembly.Brief Bioinform 2004,5(3):237–248.PubMedView Article
Verma M, Kaur J, Kumar M, Kumari K, Saxena A, Anand S, Nigam A, Ravi V, Raghuvanshi S, Khurana P: Whole genome sequence of the rifamycin B-producing strain Amycolatopsis mediterranei S699.J Bacteriol 2011,193(19):5562–5563.PubMedView Article
Gordon D: Viewing and editing assembled sequences using consed.Curr Protoc Bioinformatics 2003,Chapter 11(Unit11):12.
Gordon D: Automated finishing with autofinish.Genome Res 2001,11(4):614–625.PubMedView Article
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.