ContigScape: a Cytoscape plugin facilitating microbial genome gap closing
- Biao Tang†1, 2,
- Qi Wang†3, 6,
- Minjun Yang2,
- Feng Xie3,
- Yongqiang Zhu2,
- Ying Zhuo3,
- Shengyue Wang2,
- Hong Gao3,
- Xiaoming Ding1,
- Lixin Zhang3Email author,
- Guoping Zhao1, 2, 4, 5Email author and
- Huajun Zheng2Email author
© Tang et al.; licensee BioMed Central Ltd. 2013
Received: 29 December 2012
Accepted: 20 April 2013
Published: 30 April 2013
With the emergence of next-generation sequencing, the availability of prokaryotic genome sequences is expanding rapidly. A total of 5,276 genomes have been released since 2008, yet only 1,692 genomes were complete. The final phase of microbial genome sequencing, particularly gap closing, is frequently the rate-limiting step either because of complex genomic structures that cause sequence bias even with high genomic coverage, or the presence of repeat sequences that may cause gaps in assembly.
We have developed a Cytoscape plugin to facilitate gap closing for high-throughput sequencing data from microbial genomes. This plugin is capable of interactively displaying the relationships among genomic contigs derived from various sequencing formats. The sequence contigs of plasmids and special repeats (IS elements, ribosomal RNAs, terminal repeats, etc.) can be displayed as well.
Displaying relationships between contigs using graphs in Cytoscape rather than tables provides a more straightforward visual representation. This will facilitate a faster and more precise determination of the linkages among contigs and greatly improve the efficiency of gap closing.
KeywordsContigScape Repeat contig Microbial Visualization Linkage Gap closing
Comparison to other genomic display tools
Connections between contigs
Connections between scaffolds
Global contig display
The weight of contigs’ relationship
From paired reads
Any producing ACE files
(Gordon et al., )
From paired reads
(Gordon et al., )
(Bonfield et al., )
(Kent et al., )
(Stalker et al., )
(Robinson et al., )
Any producing ACE files
(Huang, Marth )
From paired reads viewed within a single scaffold
Any producing AFG files
(Schatz et al., )
From paired reads
(Nielsen et al., )
From transcripts, From scaffolding information
(Oksana et al., )
From reference, From scaffolding information, 454 repeat reads or other database
From reference, From other database
Fungi,,bacteria, plasmid, virus etc.
One edge and two nodes
ContigScape is a convenient Java plugin based on Cytoscape , which is an established, free, and open-source software platform for the visualization and analysis of molecular interaction networks and can be used on Windows, Linux and Mac platforms. ContigScape is a simple and efficient plugin that makes gap closing during microbial genome sequencing more efficient.
Sequencing of samples, de novo assembly of the genomes, and scaffolding
Strains used in this study and general sequence information
Amycolatopsis mediterranei S699 
Ralstonia solanacearum Po82 
Amycolatopsis orientalis HCCB10007
Mycobacterium tuberculosis CCDC5079
Leptospirillum ferriphilum ML-04 
Bacillus thuringiensis BMB171 
Edwardsiella tarda EIB202 
Acidianus hospitalis W1 
Programming language, systems, and external programs
ContigScape was developed based on Cytoscape, which is available for Linux, Windows and MacOS X. The core programming language of ContigScape is Java. Users are provided with a comprehensive manual that explains all functions (see Additional file 1).
Counting contig abundance and copy number, and display
Meanwhile, the average number of linkages between contigs can be computed by , where Z is the average number of linkages and n is the number of relationships conforming to the requirements. As above, the ratio of link number and Z indicates the width of edge representing linkage in CytoScape.
Principles of displaying Roche 454 genome assembly results
Principles of displaying scaffolds constructed by mate-pair reads
Results and discussion
Repeats are usually assembled into single contigs and thus cause gaps. After sequencing, two repeat regions (R1 and R2, Figure 3A) were assembled into the R1/R2 repeat contig (Figure 3B), and ContigScape reported all of its possible linkages with other regions (1–4, Figure 3C–D). Further PCR validation guided by this predicted linkage would exclude the incorrect relationships and result in a final correct consensus sequence. The repeat contigs in ContigScape are shown in red (Figure 3D) to distinguish them from the normal contigs shown in dark blue (default setting). In addition, the number of reads connecting two contigs is labeled with linkage edges, and the linkage reliability is illustrated by variable edge thickness.
The key feature of ContigScape is to determine the linkage of two contigs assembled from 454 or Illumina reads. An ‘Ace’ file can be opened directly by ContigScape and the relationship of contigs can be saved as a CRS format (see sample, tabbed.txt, tabbedCov.txt, Additional file 1). The CRS format includes two files, and each contains three columns. ‘tabbed.txt’ contains the number of connections among contigs, and ‘tabbedCov.txt’ describes the length and coverage of contigs. The ‘tabbed.txt’ is similar to AGP file and describes how the chromosomes and scaffolds were assembled from the component contigs, but does not require contigs to be sorted in advance. It will produce an original graph after loading the two files, and a final graph needed for the layout function of Cytoscape. Researchers can also obtain the CRS information by converting the results from GRASS, SSPACE, OPERA and MIP scaffolders.
Another prominent characteristic of ContigScape is the calculation of the coverage of contigs and the subsequent definition of the contig whose coverage exceeded two fold above the average, denoted as ‘repeat contig’. Each contig is represented by one edge and two nodes, with ‘XS’ and ‘XE’ indicating the 5’ end (Start) and 3’ end (End) of contigX (X represents a number), respectively. The linkage (reads) is represented by a sole edge whose thickness varies based on the number of supporting reads. The number on the edge of contigs indicates the contig length, whereas the number on the edge of linkages indicates the number of linking reads.
Application of technology to display 454 contigs and scaffolding by mate-pair reads
We applied ContigScape to a recently assembled Streptomyces sp genome with 111 contigs sequenced by Roche 454 without scaffolding. We added seven contigs (contig140, 141, 142, 143, 144, 145 and 146) into the two CRS files to show different plasmids (Figure 1B). After processing, we found 25 repeat contigs, constituting six plasmids, 8 rRNA operons and one telomere (contig28, Figure 1B2). The remaining repeats include IS elements, phage or other sequences. Figure 1A shows that 52 nodes have no linkage, and they need additional scaffolding information. Therefore, PCR is necessary to fill the remaining gaps. Any relationships requiring validation are indicated by a green edge.
Judging whether a repeat contig was from chromosome or plasmid mainly depended on the linkage information of two ends of this contig. Four different types were shown in Figure 1B: 1). Repeat contigs connected in a circular fashion (Panel 3), 2). Individual contig connected itself without anyone else (Panel 4 and 6), 3). One end of repeat contig having no linkage to any other contigs, usually representing linear chromosome telomere or linear plasmid end (Panel 1 and 2), 4). A linear plasmid composed of only one repeat contig without connections to any contigs (Panel 5). While if a plasmid is linear and single copy, ContigScape cannot distinguish it. We can estimate whether or not a contig was a plasmid effectively based on above described situation in our experience. Of course researcher must confirm whether it is a plasmid or not by PCR, sequencing and annotation.
In Figure 1B, 143E has connections with 142E and 144E (Panel 3). But the number of connections (800) between 143E and 142E is more than that (10) between 143E and 144E. In this case, the latter might be a nonspecific connection caused by little overlap among the reads. Additionally, Figure 1B shows that contig78 in the linear plasmid 80E-80S-78E-78S-54E-54S also has another copy in the chromosome (Panel 1).
Display functionality of ContigScape
Comparative assembly  utilizes a reference genome sequence as a guide to discern repeat contigs. However, there are three obvious weaknesses regarding comparative assembly: (1) the target species must have previously been sequenced and assembled; (2) structural variations exists in different references; (3) it cannot resolve large insertions. For example, we resequenced Amycolatopsis mediterranei S699 and assembled the genome de novo. Comparing with the previously released A. mediterranei S699 assembly , which was assembled using A. mediterranei U32 as a reference, the genome we sequenced contained a 10-kb insertion. The differences can likely be attributed to the different strategies used for genome assembly . De novo assembly is a reliable way to avoid these weaknesses of comparative assembly.
Each sequencing technology has its own biases that result in coverage gaps. As coverage increases, the number of gaps decreases. However, gaps can occur if reads that would typically be assembled into one contig cannot span a large repeat area. Therefore, utilizing repeat contigs is important. During scaffold construction, repeat contigs usually cause errors in scaffolding or in the creation of linkages. Some programs may elect to link two unique contigs with one repeat contig, thus the individual repeat contig is used only once. Therefore, correct judgment will greatly reduce the efforts invested in genome assembly. Displaying straightforward graph-based relationships of contigs in Cytoscape rather than tables also facilitates a faster and more precise determination of the linkages among contigs. Our goal is to display the original relationships of all contigs rather than the manually trimmed results because the true association of contigs should be depicted as a network rather than a linear linkage.
ContigScape isn’t an assembly program and cannot replace phred/phrap/consed package, indeed they are complementary to each other. Consed  and its process “autofinish”  are very useful in gap closing. Actually, all contigs’ PHD files together with ABI3730 data sequenced after PCR must be assembled using phrap and edited by consed at last in our finishing strategy. ContigScape looks like a canvas used to judge and edit the order among contigs and can evaluate the complexity of shot-gun assembly in global visually. The plugin can only process several NGS assembly data directly like 454Conitgs.ace and mate-pair reads, while the assembly result made by other programs should be transformed into CRS file as input.
Using ContigScape, contigs can be displayed and repeat contigs, gaps, and even plasmids can be highlighted, filtered, and customized. We designed unique functions for microbial genome analysis in ContigScape, such as the identification of plasmids, whether they are linear or circular and an estimation of their read coverage. We believe with the development of the third-generation sequencing technologies, gap closing will be much easier due to fewer assembled contigs. Long repeats will still hamper the assembly, especially in larger genomes; however, ContigScape will play an important role in gap closing for these genomes.
The genome sequences have been deposited at NCBI under the accession numbers:
[GenBank: CP003729], [GenBank: CP002819], [GenBank: CP002820], [GenBank: CP003410], [GenBank: CP002884], [GenBank: CP002919], [GenBank: CP001903], [GenBank: CP001904], [GenBank: CP001135], [GenBank: CP002535], [GenBank: HQ009524-HQ009558], [GenBank: CP002513], [GenBank: AEVU00000000].
Availability and requirements
Project name: ContigScape
Project home page: http://sourceforge.net/projects/contigscape/.
Operating systems: Windows, Linux, MacOSX.
Programming language: Java, Perl
Software packages (Linux): Fastx_toolkit 0.0.13, BEDTools 2.14.3, BWA 0.5.7, Samtools 0.1.18
Other requirements: Java 1.6 or higher, Cytoscape 2.8.3 (After Java and Cytoscape are installed, put ContigScape.jar under cytoscape2.8.3/plugins folder).
Restriction for non-academics: Users willing to use ContigScape for non-academic purposes should contact the corresponding author for details.
High throughput sequencing
Contig relationship scape.
We would like to thank the students of gap closing group in Chinese National Human Genome Center at Shanghai for suggestions about the plugin. This work was supported by the grants from National Natural Science Foundation of China (30830002, 31121001 ,31270056), from National Basic Research Program of China (2012CB721102) and the Shanghai Rising-Star Program (11QA1404600).
- Gritsenko AA, Nijkamp JF, Reinders MJ, de Ridder D: GRASS: a generic algorithm for scaffolding next-generation sequencing assemblies. Bioinformatics. 2012, 28 (11): 1429-1437. 10.1093/bioinformatics/bts175.View ArticlePubMed
- Boetzer M, Henkel CV, Jansen HJ, Butler D, Pirovano W: Scaffolding pre-assembled contigs using SSPACE. Bioinformatics. 2011, 27 (4): 578-579. 10.1093/bioinformatics/btq683.View ArticlePubMed
- Gao S, Sung WK, Nagarajan N: Opera: reconstructing optimal genomic scaffolds with high-throughput paired-end sequences. J Comput Biol. 2011, 18 (11): 1681-1691. 10.1089/cmb.2011.0170.PubMed CentralView ArticlePubMed
- Salmela L, Makinen V, Valimaki N, Ylinen J, Ukkonen E: Fast scaffolding with small independent mixed integer programs. Bioinformatics. 2011, 27 (23): 3259-3265. 10.1093/bioinformatics/btr562.PubMed CentralView ArticlePubMed
- Gordon D, Abajian C, Green P: Consed: a graphical tool for sequence finishing. Genome Res. 1998, 8 (3): 195-202.View ArticlePubMed
- Burland TG: DNASTAR’s Lasergene sequence analysis software. Methods Mol Biol. 2000, 132: 71-91.PubMed
- Bonfield JK, Smith K, Staden R: A new DNA sequence assembly program. Nucleic Acids Res. 1995, 23 (24): 4992-4999. 10.1093/nar/23.24.4992.PubMed CentralView ArticlePubMed
- Nielsen CB, Cantor M, Dubchak I, Gordon D, Wang T: Visualizing genomes: techniques and challenges. Nat Methods. 2010, 7 (3 Suppl): S5-S15.View ArticlePubMed
- Nielsen CB, Jackman SD, Birol I, Jones SJ: ABySS-Explorer: visualizing genome sequence assemblies. IEEE Trans Vis Comput Graph. 2009, 15 (6): 881-888.View ArticlePubMed
- Riba-Grognuz O, Keller L, Falquet L, Xenarios I, Wurm Y: Visualization and quality assessment of de novo genome assemblies. Bioinformatics. 2011, 27 (24): 3425-3426. 10.1093/bioinformatics/btr569.View ArticlePubMed
- Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T: Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 2003, 13 (11): 2498-2504. 10.1101/gr.1239303.PubMed CentralView ArticlePubMed
- Bonfield JK, Whitwham A: Gap5–editing the billion fragment sequence assembly. Bioinformatics. 2010, 26 (14): 1699-1703. 10.1093/bioinformatics/btq268.PubMed CentralView ArticlePubMed
- Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D: The human genome browser at UCSC. Genome Res. 2002, 12 (6): 996-1006.PubMed CentralView ArticlePubMed
- Stalker J, Gibbins B, Meidl P, Smith J, Spooner W, Hotz HR, Cox AV: The Ensembl Web site: mechanics of a genome browser. Genome Res. 2004, 14 (5): 951-955. 10.1101/gr.1863004.PubMed CentralView ArticlePubMed
- Robinson JT, Thorvaldsdottir H, Winckler W, Guttman M, Lander ES, Getz G, Mesirov JP: Integrative genomics viewer. Nat Biotechnol. 2011, 29 (1): 24-26. 10.1038/nbt.1754.PubMed CentralView ArticlePubMed
- Huang W, Marth G: EagleView: a genome assembly viewer for next-generation sequencing technologies. Genome Res. 2008, 18 (9): 1538-1543. 10.1101/gr.076067.108.PubMed CentralView ArticlePubMed
- Schatz MC, Phillippy AM, Shneiderman B, Salzberg SL: Hawkeye: an interactive visual analytics tool for genome assemblies. Genome Biol. 2007, 8 (3): R34-10.1186/gb-2007-8-3-r34.PubMed CentralView ArticlePubMed
- Li H, Durbin R: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009, 25 (14): 1754-1760. 10.1093/bioinformatics/btp324.PubMed CentralView ArticlePubMed
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R: The sequence alignment/Map format and SAMtools. Bioinformatics. 2009, 25 (16): 2078-2079. 10.1093/bioinformatics/btp352.PubMed CentralView ArticlePubMed
- Quinlan AR, Hall IM: BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010, 26 (6): 841-842. 10.1093/bioinformatics/btq033.PubMed CentralView ArticlePubMed
- Tang B, Zhao W, Zheng H, Zhuo Y, Zhang L, Zhao GP: Complete genome sequence of Amycolatopsis mediterranei S699 based on de novo assembly via a combinatorial sequencing strategy. J Bacteriol. 2012, 194 (20): 5699-5700. 10.1128/JB.01295-12.PubMed CentralView ArticlePubMed
- Xu J, Zheng HJ, Liu L, Pan ZC, Prior P, Tang B, Xu JS, Zhang H, Tian Q, Zhang LQ: Complete genome sequence of the plant pathogen Ralstonia solanacearum strain Po82. J Bacteriol. 2011, 193 (16): 4261-4262. 10.1128/JB.05384-11.PubMed CentralView ArticlePubMed
- Mi S, Song J, Lin J, Che Y, Zheng H: Complete genome of Leptospirillum ferriphilum ML-04 provides insight into its physiology and environmental adaptation. J Microbiol. 2011, 49 (6): 890-901. 10.1007/s12275-011-1099-9.View ArticlePubMed
- He J, Shao X, Zheng H, Li M, Wang J, Zhang Q, Li L, Liu Z, Sun M, Wang S: Complete genome sequence of Bacillus thuringiensis mutant strain BMB171. J Bacteriol. 2010, 192 (15): 4074-4075. 10.1128/JB.00562-10.PubMed CentralView ArticlePubMed
- Yang M, Lv Y, Xiao J, Wu H, Zheng H, Liu Q, Zhang Y, Wang Q: Edwardsiella comparative phylogenomics reveal the new intra/inter-species taxonomic relationships, virulence evolution and niche adaptation mechanisms. PLoS One. 2012, 7 (5): e36987-10.1371/journal.pone.0036987.PubMed CentralView ArticlePubMed
- You XY, Liu C, Wang SY, Jiang CY, Shah SA, Prangishvili D, She Q, Liu SJ, Garrett RA: Genomic analysis of Acidianus hospitalis W1 a host for studying crenarchaeal virus and plasmid life cycles. Extremophiles. 2011, 15 (4): 487-497. 10.1007/s00792-011-0379-y.PubMed CentralView ArticlePubMed
- Chen YF, Gao F, Ye XQ, Wei SJ, Shi M, Zheng HJ, Chen XX: Deep sequencing of Cotesia vestalis bracovirus reveals the complexity of a polydnavirus genome. Virology. 2011, 414 (1): 42-50. 10.1016/j.virol.2011.03.009.View ArticlePubMed
- Li Y, Zheng H, Liu Y, Jiang Y, Xin J, Chen W, Song Z: The complete genome sequence of Mycoplasma bovis strain Hubei-1. PLoS One. 2011, 6 (6): e20999-10.1371/journal.pone.0020999.PubMed CentralView ArticlePubMed
- Zheng P, Xia Y, Xiao G, Xiong C, Hu X, Zhang S, Zheng H, Huang Y, Zhou Y, Wang S: Genome sequence of the insect pathogenic fungus Cordyceps militaris, a valued traditional Chinese medicine. Genome Biol. 2011, 12 (11): R116-10.1186/gb-2011-12-11-r116.PubMed CentralView ArticlePubMed
- Assenov Y, Ramirez F, Schelhorn SE, Lengauer T, Albrecht M: Computing topological parameters of biological networks. Bioinformatics. 2008, 24 (2): 282-284. 10.1093/bioinformatics/btm554.View ArticlePubMed
- Pop M, Phillippy A, Delcher AL, Salzberg SL: Comparative genome assembly. Brief Bioinform. 2004, 5 (3): 237-248. 10.1093/bib/5.3.237.View ArticlePubMed
- Verma M, Kaur J, Kumar M, Kumari K, Saxena A, Anand S, Nigam A, Ravi V, Raghuvanshi S, Khurana P: Whole genome sequence of the rifamycin B-producing strain Amycolatopsis mediterranei S699. J Bacteriol. 2011, 193 (19): 5562-5563. 10.1128/JB.05819-11.PubMed CentralView ArticlePubMed
- Gordon D: Viewing and editing assembled sequences using consed. Curr Protoc Bioinformatics. 2003, Chapter 11 (Unit11): 12-
- Gordon D: Automated finishing with autofinish. Genome Res. 2001, 11 (4): 614-625. 10.1101/gr.171401.PubMed CentralView ArticlePubMed
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.