Comparing thousands of circular genomes using the CGView Comparison Tool
© Grant et al.; licensee BioMed Central Ltd. 2012
Received: 16 March 2012
Accepted: 11 May 2012
Published: 23 May 2012
Skip to main content
© Grant et al.; licensee BioMed Central Ltd. 2012
Received: 16 March 2012
Accepted: 11 May 2012
Published: 23 May 2012
Continued sequencing efforts coupled with advances in sequencing technology will lead to the completion of a vast number of small genomes. Whole-genome comparisons represent an important part of the analysis of any new genome sequence, as they can provide a better understanding of the biology and evolution of the source organism. Visualization of the results is important, as it allows information from a variety of sources to be integrated and interpreted. However, existing graphical comparison tools lack features needed for efficiently comparing a new genome to hundreds or thousands of existing sequences. Moreover, existing tools are limited in terms of the types of comparisons that can be performed, the extent to which the output can be customized, and the ease with which the entire process can be automated.
The CGView Comparison Tool (CCT) is a package for visually comparing bacterial, plasmid, chloroplast, or mitochondrial sequences of interest to existing genomes or sequence collections. The comparisons are conducted using BLAST, and the BLAST results are presented in the form of graphical maps that can also show sequence features, gene and protein names, COG (Clusters of Orthologous Groups of proteins) category assignments, and sequence composition characteristics. CCT can generate maps in a variety of sizes, including 400 Megapixel maps suitable for posters. Comparisons can be conducted within a particular species or genus, or all available genomes can be used. The entire map creation process, from downloading sequences to redrawing zoomed maps, can be completed easily using scripts included with the CCT. User-defined features or analysis results can be included on maps, and maps can be extensively customized. To simplify program setup, a CCT virtual machine that includes all dependencies preinstalled is available. Detailed tutorials illustrating the use of CCT are included with the CCT documentation.
CCT can be used to visually compare a reference sequence to thousands of existing genomes or sequence collections (next-generation sequencing reads for example) on a standard desktop computer. It provides analysis and visualization functionality not available in any existing circular genome visualization tool. By visually presenting sequence conservation information along with functional classifications and sequence composition characteristics, CCT can be a useful tool for identifying rapidly evolving or novel sequences, horizontally transferred sequences, or unusual functional properties in newly sequenced genomes. CCT is freely available for download at http://stothard.afns.ualberta.ca/downloads/CCT/.
The analysis of a newly sequenced bacterial genome often involves comparing the sequence to previously characterized genomes. Such comparisons can help to identify novel or rapidly evolving sequences, horizontal sequence transfer events, and interesting functional differences or relationships. Tools that can be used to visualize sequence conservation in conjunction with other sequence characteristics, such as functional classifications and nucleotide composition, are particularly popular [1–3]. Continual advances in sequencing technology have contributed to the availability of numerous bacterial genome sequences—currently there are thousands of bacterial and archaeal genome sequences available, and undoubtedly many more on the way. Given the accessibility of genome sequences there is an increasing need for comparative genomics tools that can handle large numbers of sequences. The CGView Comparison Tool (CCT) is a software package designed for visually comparing bacterial, plasmid, chloroplast, or mitochondrial genomes to thousands of other genomes or sequence collections. CCT can also display sequence feature information, COG classifications (which it determines itself), sequence analysis results, and base composition plots. Maps are highly customizable, both in terms of how comparisons are conducted and how results are presented, and can be generated in several sizes and file formats, suitable for publications, presentations, and posters.
CCT consists of several custom Bash and Perl scripts that are used to create a local COG sequence database , download sequences of interest from NCBI , extract and translate ORFs, extract CDS feature translations, perform BLAST searches , assign COG functional categories to proteins, generate CGView XML files  and create graphical maps. Additional scripts are included for redrawing maps and for drawing zoomed maps showing regions of interest in more detail. Wrapper scripts simplify the use of CCT by allowing a single command to run many of the CCT components in succession. The usage of these scripts is described in the “commands” section of the CCT documentation and demonstrated in the CCT tutorials. Although a command-line interface can be intimidating, the availability of a Linux virtual machine and clear documentation and tutorials should help users with little command-line experience take advantage of CCT. Furthermore, a few simple commands can be used to create several complex maps, thanks to the included wrapper scripts. Command-line tools are widespread and popular in bioinformatics, and offer important advantages related to reproducibility, automation, and remote execution.
CCT maps typically consist of several rings depicting a reference genome and its features (coding sequences and ORFs for example), and the results of BLAST comparisons between the reference sequence and one or more comparison sequences. A separate BLAST ring is drawn for each comparison genome. When there is similarity between a portion of the reference sequence and any part of a comparison sequence, a coloured arc is drawn beneath the region of the reference sequence showing similarity, in the appropriate comparison ring. User defined features, analysis results, and base composition plots may also be displayed.
An important and potentially time-consuming step when creating maps, regardless of the mapping software used, is obtaining the reference and comparison genome sequence records to be visualized. To simplify this process CCT includes several scripts for downloading sequences of interest directly from NCBI. These scripts make use of the NCBI Entrez Utilities Web Service API and can be used to download a single record based on accession number (fetch_genome_by_accession.sh), all the records corresponding to a particular species or genus (fetch_refseq_bacterial_genomes_by_name.sh), or all the records from a particular organelle or domain (e.g. fetch_all_refseq_mitochondrial_genomes.sh). Options are included for further restricting sequence retrieval according to sequence length. These options allow, for example, plasmid sequences to be avoided when downloading bacterial chromosomes. As with the other CCT scripts, the use of these utilities is demonstrated in the tutorials and all script options are described in the “commands” section of the CCT documentation.
Several subtle but important enhancements have been made to CCT during its development to increase its utility as a genome visualization tool. For example, features parsed from reference sequence file are drawn with partial opacity by default, so that the boundaries of features are apparent even if they overlap. When CCT assigns COG functional categories, it is able to appropriately handle cases where a protein appears to fall into multiple categories (based on sequence comparisons). In such situations the coloured arrow used to indicate COG class is divided into the appropriate number of smaller arrows, each coloured to represent a single COG category. This behaviour differs from that used to create virtually all existing bacterial maps that colour proteins according to functional class. When BLAST atlases are built by CCT (using the build_blast_atlas.sh or build_blast_atlas_all_vs_all.sh scripts), the reference genome is included automatically as one of the comparison genomes. The inclusion of the reference leads to a seemingly superfluous reference vs. reference BLAST analysis that actually serves an important purpose—it reveals portions of the reference that are unable to produce BLAST hits, due to ambiguous bases, BLAST filtering, or an absence of protein-coding sequences (in the case of protein or translated BLAST searches). Without the reference vs. reference ring there is the potential for comparison rings to be interpreted incorrectly. Unlike the other BLAST atlas tools we have encountered, CCT adjusts the order of the comparison genome rings automatically (when the build_blast_atlas.sh or build_blast_atlas_all_vs_all.sh scripts are used), so that the most similar genomes tend to be placed closest to the reference sequence ring. Similarity is determined using a heuristic that considers the total number of comparison genome bases contributing to hits as well as the scores of the hits. In practice this sorting makes it much easier to visualize sequence divergence trends for the genes or proteins in a reference sequence. For example, the most labile or divergent portions of a reference genome standout as light-coloured regions adjacent to the reference ring, while well-conserved portions of the reference give rise to darkly-coloured arcs that form “spikes” of conservation extending towards the centre of the map. This sorting can also reveal genome segments whose similarity is inconsistent with the general trends revealed by the map. For example, a horizontally transferred segment can appear as a dark red or black arc in a portion of the map consisting of otherwise weakly similar genomes. Multiple regions of a comparison genome can be similar to a given region of the reference genome. For this reason CCT arranges the arcs within each BLAST ring so that low-similarity arcs do not obscure higher-similarity arcs. Finally, CCT has the ability to divide protein-search BLAST rings into six sub-rings corresponding to the six different reading frames of the reference sequence. This feature, when used with zoomed maps and CCT’s ORF drawing option, can be used to distinguish which of the ORFs or CDS features in an overlapping set are conserved.
CCT writes a configuration file to each project when it is initialized. This simple file can be edited to specify, among other things, which types of BLAST comparisons are performed. In total there are currently 16 comparison scenarios available. These are described in detail in the CCT documentation, and differ primarily in terms of which regions of the sequences are compared and whether the comparisons are done at the DNA or protein level. For example, CCT can compare the entire reference sequence to each comparison genome at the nucleotide level using blastn, or it can conduct the comparisons using the 6-frame translations of the sequences, using tblastx. Alternatively, CCT can identify and translate ORFs in each sequence, and compare the translations using blastp. Instead of determining ORFs, CCT can simply extract the CDS feature translations from the input files and compare them using blastp. Finally, instead of single genomes the comparison sequence files can consist of multiple DNA sequences in FASTA format (next-generation sequencing reads for example) or multiple protein sequences in FASTA format (a custom collection of bacteriophage proteins for example). There are options specific to such multi-FASTA files for controlling how the comparisons with the reference are performed. CCT uses file extensions to determine which sequence files should be used for a given BLAST comparison. When there are multiple files with the same extension, a separate BLAST comparison is conducted for each, and the results are shown in separate rings on the resulting map. Multiple comparison types can be shown on a single map. In this manner conserved non-translated sequences can be visualized along with conserved coding regions.
There are numerous data types users may wish to visualize on CCT maps, such as gene expression measurements, the positions of horizontally transferred segments revealed by specialized programs, the positions of SNPs identified by sequencing, and so on. To accommodate user-supplied information, all CCT projects include a “features” and an “analysis” directory. Simple tab-delimited or comma-delimited text files can be added to these directories. Files describing the positions of genes and other regions of interest are generally placed in the features directory, whereas files assigning numerical scores (positive or negative) to genome regions are placed in the analysis directory. CCT parses the files and creates a separate ring for each on the final map. The regions obtained from the features directory are drawn much like the features extracted from GenBank files, whereas those parsed from analysis files are scaled according to the accompanying score value.
The maps generated by CCT are designed to be visually appealing and informative. Nonetheless the default colours, font sizes, line widths, etc. may not always be appropriate. CCT includes functionality that allows maps to be customized without the need to repeat the computationally intensive analysis steps used to assign COG categories and identify sequence similarities. This functionality is implemented using more than 70 "customization keys", which can be passed to certain CCT commands along with their desired values. For example, the "backgroundColor = rgb(0,0,0)" key-value pair can be supplied to the build_blast_atlas.sh script, to specify that maps should be drawn with a black background. The script can reuse existing BLAST results if they are available in the relevant project directories (using the --start_at_xml option). A comprehensive list of the CCT customization keys is available on the CCT web site, and commands that make use of these keys are included in the CCT tutorials.
The CCT map creation process generates an XML file for input to the CGView program . This intermediate XML file, which uses a simple syntax to fully describe the features and appearance of the map, can be edited as another way of adjusting CCT maps. The redraw_maps.sh script can be used to quickly redraw maps directly from the modified XML file.
A final option for customization is to edit the resulting images themselves. CCT maps are drawn in PNG format by default but can instead be saved in a vector-based format (SVG). Use of SVG format allows map elements to be individually adjusted using a vector graphics editor.
There are many visual comparative genomics tools available, each with different advantages and disadvantages. For bacterial genomes, we previously created the CGView Server , which can compare a reference genome to up to three comparison genomes, using nucleotide or translated nucleotide BLAST searches. CCT greatly exceeds the capabilities of this server in all regards. For example, far more genomes can be compared, comparisons can be conducted at the level of CDS features or ORFs, much larger maps can be drawn, maps can be saved in vector or raster-based formats, COG functional categories can be assigned and shown, map appearance (colours, font sizes, feature widths etc.) can be customized extensively, comparison genome sets can be prepared much more easily, subsets of genes can be labeled, and multiple maps can be generated automatically for a set of sequences. Users also have access to the raw BLAST results, CGView XML files, and program source code. Depending on the reasons for performing the sequence comparison, other tools may be preferred. For example, ACT  and Circos  use lines to connect regions of sequence similarity between sequence ideograms, and can thus be used to visualize sequence rearrangements. Some users may prefer the default colour scheme and appearance of maps produced using the BLASTAtlas service , over those generated by CCT. The MultiPipmaker server can compare larger genomes and create linear figures . The BLAST Ring Image Generator (BRIG)  uses CGView  to render maps and is operated using a graphical user interface. The graphical interface and included functionality for draft genomes and next-generation sequencing data files make BRIG an excellent alternative to CCT for some users and data sets. It is worth noting however that CCT has advantages over all of these tools related to its visual enhancements, diversity of included analyses, the number of genomes it can handle, its integrated scripts for downloading genomes, support for map customization, and its command-line interface for rapid creation of complex maps and pipeline integration.
In summary, the CGView Comparison Tool (CCT) is a package for visually comparing circular sequences of interest to existing genomes or sequence collections. The capacity and capabilities of CCT distinguish it from existing sequence visualization tools. The ease with which large and complex maps can be created should make CCT appealing to anyone who aims to learn more about bacterial, plasmid, chloroplast, or mitochondrial genome sequences.
Project name: CGView Comparison Tool
Project home Page: http://stothard.afns.ualberta.ca/downloads/CCT/
Operating system(s): Unix/Linux
Programming language: Perl License: GNU GPL
Any restrictions to use by non-academics: none
This work was supported by funds from the Alberta Livestock and Meat Agency and the Natural Sciences and Engineering Research Council of Canada to PS.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.