Comparative genomics of 274 Vibrio cholerae genomes reveals mobile functions structuring three niche dimensions
- Bas E Dutilh1, 2, 3, 4,
- Cristiane C Thompson5,
- Ana CP Vicente5,
- Michel A Marin5,
- Clarence Lee7,
- Genivaldo GZ Silva2, 6,
- Robert Schmieder2, 6,
- Bruno GN Andrade5,
- Luciane Chimetto4,
- Daniel Cuevas2, 7,
- Daniel R Garza2,
- Iruka N Okeke8,
- Aaron Oladipo Aboderin9,
- Jessica Spangler7,
- Tristen Ross7,
- Elizabeth A Dinsdale1,
- Fabiano L Thompson4,
- Timothy T Harkins7 and
- Robert A Edwards1, 2, 4, 6, 10Email author
© Dutilh et al.; licensee BioMed Central Ltd. 2014
Received: 20 February 2014
Accepted: 30 July 2014
Published: 5 August 2014
Vibrio cholerae is a globally dispersed pathogen that has evolved with humans for centuries, but also includes non-pathogenic environmental strains. Here, we identify the genomic variability underlying this remarkable persistence across the three major niche dimensions space, time, and habitat.
Taking an innovative approach of genome-wide association applicable to microbial genomes (GWAS-M), we classify 274 complete V. cholerae genomes by niche, including 39 newly sequenced for this study with the Ion Torrent DNA-sequencing platform. Niche metadata were collected for each strain and analyzed together with comprehensive annotations of genetic and genomic attributes, including point mutations (single-nucleotide polymorphisms, SNPs), protein families, functions and prophages.
Our analysis revealed that genomic variations, in particular mobile functions including phages, prophages, transposable elements, and plasmids underlie the metadata structuring in each of the three niche dimensions. This underscores the role of phages and mobile elements as the most rapidly evolving elements in bacterial genomes, creating local endemicity (space), leading to temporal divergence (time), and allowing the invasion of new habitats. Together, we take a data-driven approach for comparative functional genomics that exploits high-volume genome sequencing and annotation, in conjunction with novel statistical and machine learning analyses to identify connections between genotype and phenotype on a genome-wide scale.
KeywordsFunctional genomics Mobile elements Phages Niche adaptation Vibrio Genome evolution Genotype-phenotype association Random forest
Species exist in a multi-dimensional niche space with dimensions representing their environmental habitat, their geographical location, and their presence in time. Within this space, each species occupies a volume with a specific shape and size. Some species may occur as relatively tight clusters, if they have a strict habitat preference (habitat), are incapable of distant migration (space), and quickly go extinct (time). Other species persist, forming large structured networks in niche space. One example of the latter is Vibrio cholerae, the causative agent of cholera that has persisted globally for centuries in clinical as well as environmental habitats.
Here, we ask the question which elements of the V. cholerae genome reflect its structured occurrence in each of the three niche dimensions. Glimpses of an answer can be found in the literature. For example, a recent phylogeographic analysis distinguished two gene pools : the vertically inherited core genome and horizontally acquired mobile or mobilizable elements. These mobile elements sweep through local subpopulations of environmental Vibrio species that occupy the same geographical niche, leading to geographic structuring . Temporal structuring of the V. cholerae genome was shown in a large-scale sequencing effort of V. cholerae strains from the sixth (1899-1923) and seventh (1961 onwards) cholera pandemics , that identified genomic elements characterizing the different temporal waves of the pandemic. Finally, habitat occupancy is mainly regulated by the presence of two phage-encoded factors. Clinical, epidemic V. cholerae strains differ from non-pathogenic environmental ones depending on the presence of the toxin co-regulated pilus (TCP) that allows the bacterium to colonize the gut and form protective aggregates; and cholera toxin (CTX), encoded by the phage CTXφ, which uses the TCP to attach to the bacterium .
We have recently outlined an approach for genome-wide association of genotypes to phenotypes (GWAS for microbes or GWAS-M) that is capable of exploiting large-scale draft genome sequencing . Here, we employ this approach for the data-driven discovery of genomic elements that structure V. cholerae across the niche dimensions space, time, and habitat. Our approach classifies the bacterial strains by phenotype by using Random Forest (RF) machine learning [6, 7], identifying which genomic elements are most important for the classification. While consistently annotated phenotypes are often lacking for sequenced bacterial strains , niche metadata are readily available for most publically available genome sequences, allowing us to include 235 published complete and draft genomes in our analysis. Moreover, we sequenced 39 additional genomes using the new Ion Torrent Personal Genome Machine (PGM) benchtop sequencer  to obtain a more complete sampling of the niche dimensions. We identify how the genomic landscape of V. cholerae, consisting of genotypic variables including SNPs, protein families, functions, and phages varies across the major niche dimensions time, space, and habitat.
Results and discussion
Vibrio choleraegenomes across niche dimensions
Breadth of the strains based on genome-wide marker SNPs
Identifying genomic structuring by random forest
Summary of the genotypic variables
We use this wide range of variables to classify the genomes by their niche, in each of the three niche dimensions of space, time, and habitat. We use the RF method, an advanced machine learning approach that uses many decision trees in parallel . For space and habitat dimensions, we applied classification of the strains into six continents and two habitats, respectively. For the time dimension, we applied regression of all genotypic variables against the year of origin. All RFs contained 10,000 decision trees.
In general, all RFs have high prediction accuracies, indicating that the annotated genotypic variables contain enough structure in each of the niche dimensions to allow for classification. The habitat-RFs classified the genomes as clinical or environmental with 89.2% accuracy. The space-RFs classified the genomes by continent with an average of 45.3% accuracy, which is high compared to an expected 16.7% accuracy if the genomes were randomly assigned to one of the six continents. The time-RF explained 62.0% of variation.
Here, we investigate how the versatile phenotype of V. cholerae, a bacterium that persists across the three major niche dimensions space, time, and habitat, is reflected in its genome. We take a data-driven approach for comparative functional genomics  that includes automated annotation of genotypic variables in 28 low-level functional categories  and RF machine learning for prioritizing the variables . This approach allows us to fully exploit the genotypic information in many complete genomes simultaneously, and identify which genotypic variables structure the strains in niche space. In each of the three niche dimensions, we find that variations in phages, prophages, transposable elements, and plasmids underlie the diversity and explain most of the structure in the data. These results confirm previous investigations of the separate niche dimensions, showing the importance of these mobile functions in shaping the V. cholerae genome. In the spatial dimension, they drive geographic endemism [1, 2]; in time they structure temporal epidemics [3, 23]; and habitat preference is determined by the presence of phage-encoded pathogenicity factors .
High-volume genome sequencing is becoming more affordable, and high-throughput pipelines like those employed here are capable of rapidly annotating thousands of genomic elements including SNPs, protein families, functions, and prophages. Exploiting these developments, novel statistical and machine learning analyses can be used to identify connections between genotype and phenotype. Whereas here we apply our approach to niche annotations which are widely available for sequenced strains, it can similarly be used for matching any phenotypic measurement to genotypes (for examples see ), and it will only gain in statistical power by including more genomes.
DNA isolation and genome sequencing
Total genomic DNA was isolated using a Wizard Genomic DNA Purification Kit (Promega). Fragment libraries were prepared using the Ion Fragment Library Kit (Life Technologies) with some modifications. First, when necessary the starting material was reduced to 200 ng of genomic DNA (the standard protocol requires 5 μg). Second, we used multiplexed sequencing using molecular barcode adaptors. Genomes were sequenced using Ion Torrent semiconductor sequencing  following standard operating procedures for the Personal Genome Machine (PGM).
Modified emulsion PCR conditions for mate pair library
Assembly, scaffolding, and RAST annotation
Ion Torrent sequencing reads obtained in this study were first cleaned of the internal adaptor (IA) tag using TagCleaner  (CTGAGACT allowing up to 1 mismatch occurring within 31 nucleotides (nt) of the 3'-end). Short (<90 nt) reads and long (>200 nt) reads, as well as regions with a quality score <10, were deleted. Reads were then assembled using gsAssembler  (version 2.6, default parameters with -notrim, or default options for the mate pair assembly). The recently published V. cholerae Amazonia strain R-18332 genome  was assembled in the same way. The 203 [3, 10, 11] strains sequenced with an Illumina platform were assembled by using Velvet  with a hash length of 31 nt and short paired read types. Assembly and scaffolding statistics are presented in Additional file 2.
For each set of assembled contigs, the most suitable reference genome was selected from the five completely closed V. cholerae genomes (2 chromosomes; N16961, O395, MJ-1236, M66-2 and LMA3984-4). To choose the reference, all the contigs were queried against those genomes using blastn (BLAST 2.2.25+ ), and the genome with the highest bitscore hit was selected. The contigs were then scaffolded against this reference using scaffold_builder 1.0 . The scaffolded contigs were uploaded to RAST  for automated gene calling (including frameshift correction) and annotation. Annotation statistics for all strains are presented in Additional file 2.
To create the Circos plot , the raw reads from the 38 other genomes sequenced here were mapped to the genome of R-18308 using blastn (BLAST 2.2.25+ ), requiring ≥80% nucleotide identity over ≥100 nt for mapping.
Sequence coverage of the genomes was estimated by locating orthologous regions that we expected to be present in all the V. cholerae genomes. These regions were identified independently of gene annotations, as follows. First, we used progressiveMauve  (default parameters) to construct a multiple genome alignment of the 31 V. cholerae genomes obtained from the SEED database (Additional file 1). We then extracted 886 local multiple alignment blocks containing all 31 genomes (length between 19 nt and 46,404 nt, average 2,688 nt) and constructed hidden Markov models (HMMs) for each region using hmmbuild (HMMER 3.0 , default parameters). These HMMs represent orthologous regions that are present in all V. cholerae genomes, whose combined length of 2,381,489 nt covers ~60% of the input genomes. Finally, we queried all 210 genomes and scaffolds with these HMMs using hmmsearch (HMMER 3.0 , default parameters) and composed multiple alignments from the hits. We included the nucleotide from the highest scoring hit domain at every position of the HMM alignment and added gaps in regions lacking hits. These orthologous genome regions were used to estimate the sequencing coverage of each genome by calculating the fraction of all the sequence alignments that did not contain gaps.
Genome-wide marker SNPs
Mutreja et al. previously identified 1,970 reliable marker SNPs in the V. cholerae genome . These SNPs are in regions of the genome that are unlikely to have been the subject of horizontal gene transfer, and can be used as reliable genomic markers for the seventh pandemic strains. The 1,970 SNP positions in the corrected version of the N16961 genome (kindly provided by Ankur Mutreja) were mapped to the 274 genomes analyzed in this study by using blastn 2.2.25+  with default parameters. Functions were annotated to SNPs that fell within a gene on the N16961 genome (Additional file 3). We inserted a gap position if a genome did not produce a blastn hit for that SNP. The same procedure was used to obtain the corresponding SNP positions in the four V. mimicus genomes (MB-451, VM223, VM573 and VM603), a species closely related to V. cholerae, that we included as an outgroup. The full list of SNPs and their variants in all the genomes is provided in Additional file 3. We created a tree from the alignment of high-resolution marker SNPs identified in the 210 V. cholerae genomes and four V. mimicus genomes (Figure 3), using PhyML 3.0  (NNI tree topology search, initial BioNJ tree, HKY85 model of nucleotide substitutions, 4 discrete categories in the gamma model with an estimated shape parameter (0.680), estimated ts/tv ratio (3.879) and 100 bootstrap iterations).
Protein families, functions, and subsystems
We constructed protein families using CD-HIT 4.5.7  with a 0.85 sequence identity threshold and a word length of 5. Families that were present in less than five genomes were excluded, yielding a total of 21,146 protein families, which were present in an average of 47.2 genomes each. Because CD-HIT greedily adds proteins to a family, this approach groups protein fragments together that might have been called separately because of frame shift errors that were missed in the automated annotation. The number of 21,146 CD-HIT clusters provides a rough estimate of the V. cholerae pan-genome. However, it should be noted that CD-HIT is not a sophisticated orthology algorithm, so we expect that this number over-estimates the size of the pan-genome, and several of the clusters should actually be merged into larger protein families. Where possible, the clusters were annotated functionally by taking the function(s) that were most often annotated to the proteins in the cluster. The full list is presented in Additional file 3.
In addition to these homology-based protein families, we scored the presence of 4,260 functions and 706 subsystems in the V. cholerae genomes using the subsystems annotated by RAST . The full list is presented in Additional file 3.
Phages are important horizontal gene transfer agents that are capable of conveying genetic material between strains that occupy the same niche. We therefore included complete prophages in the V. cholerae genomes as genomic variables. Because of their repetitive nature, prophages are one of the principal reasons that contigs cannot be combined during genome assembly. This obstacle is exemplified by the two prophages RS phage (Repeat Sequence, also known as RS1 ) and CTX (cholera toxin) in V. cholerae El Tor genomes. These two prophages share a stretch of 2,732 nucleotides with 100% identity. The RS prophage is not much longer than that length; it contains an additional 362 nt that differ from the CTX prophage, while the latter has an additional 4,548 nt that are not found in the RS prophage. Thus, the long repeat sequence makes it difficult to accurately assemble and identify these prophages. Moreover, prophages have a high recombination rate, which leads to a mosaic structure. For example, the Kappa prophage varies in length from 32,970 to 35,021 nt, and the Mu-like phage varies in length from 33,001 to 34,078 nt, depending on the host genome.
Phages present in the set of 31 V. cholerae genomes from the SEED database
Initially identified in genome
PhAnToMe id 141904.3, RefSeq id NC_015209
PhAnToMe id 141904.3, RefSeq id NC_015209
PhAnToMe id 493906.2; RefSeq id NC_010275
1587, AM-19226, B33, MJ-1236, MO10, NCTC-8457, RC9, RC27
Identified using PhiSpy 
12129(1), TM-11079-80, TMA-21, V51
Identified using PhiSpy 
Random Forest (RF) is a machine learning approach to classifying data that was developed by Breiman and Cutler . The approach is to build a forest of decision trees, each built using a different subset of the data for training. The accuracy of each decision tree is measured using the remaining testing data (cross-validation). The importance of each variable can then be calculated as the average decrease in GINI purity  of the trees that results from removing or randomizing the variable. Developed in 2001, this classification approach is becoming more popular in 'omics research [6, 34] because it can incorporate large and noisy datasets with many features, prioritizing the ones that are most important for a certain prediction. Besides the high prediction accuracy, a unique advantage of RF compared to other supervised machine learning techniques (such as support vector machines, SVM) is that many variables can be compared to relatively few classes (the niche dimensions) without the risk of over-training . The technique has been applied to classify proteomics data , genome-scale responses to extracellular stimuli , microarray data , interactome analyses  and genome-wide association studies .
Here, we use RFs to compare genomic attributes across multiple niche dimensions from hundreds of genomes to reveal which response variables (genotypes) are the most important for separating the observations (genomes) in each dimension. We assembled a matrix of classes (niches or phenotypes) containing the niche metadata for all 274 isolates (Additional file 1), including sampling date (year, time), sampling location (continent, space), and source (clinical or environmental, space). Cases that could not be retrieved were labeled “unknown” and were excluded from the corresponding RFs. We also assembled a matrix of genotypic variables and their presence or absence in every genome (see Table 1). Monotonous variables with identical values in all, or all-but-one of the draft genomes (e.g. universal housekeeping genes) were removed as they would not provide any information about genotypic clustering . Variables with redundant, highly correlating profiles were merged and added as clustered genotypic variables to the matrix (Additional file 6 and Additional file 7). This yielded a total of 25,305 informative, non-redundant genotypic variables. Where, possible, all annotated genotypes including SNPs, protein families, functions, and prophages were assigned to the SEED level-1 subsystems , providing consistent, low-level annotations in 28 categories (Table 1 and Additional file 3).
For the space and habitat dimensions, we used classification RFs of the strains into six continents (Africa, Asia, Australia, Europe, North America, and South America) and two habitats (clinical or environmental), respectively. Class imbalance was prevented by jackknife subsampling of genomes from the more abundant classes to a maximum of two times the size of the smallest class . Results are the average of 100 jackknife iterations. As the sampling year is a continuous variable, the RF was trained using regression to fit the data rather than using the classification approach as for the other two niche dimensions. All RFs consisted of 10,000 trees each (the importance scores are robust with this number of trees ) and were calculated using the randomForest package 4.5-34  (default parameters) in R version 2.11.0 (The R Project for Statistical Computing; http://www.r-project.org). The importance scores of every genomic variable in each of the RFs are shown in Additional file 5.
All the genomic variables were ranked by the importance score obtained from the RF analysis (Additional file 5). For those variables with a level-1 subsystem annotation (including CD-HIT clusters, functions, subsystems, and SNPs falling within an annotated gene) the fraction of the variables in each of the level-1 subsystem classes found among the top ranked variables was counted. Figure 4 shows the percentage of subsystems in each class found at 5% of the ranks (1,265/25,305 most important variables) for each of the three predictors (time, space, and habitat).
Availability of supporting data
Cholera toxin phage
Genome-wide association studies for microbes
Hidden Markov model
Ion torrent personal genome machine
Support vector machine
Toxin co-regulated pilus
West Africa-South America phage.
The authors thank Ankur Mutreja for sharing the most recent version of the N16961 genome. BED is supported by the Dutch Science foundation (NWO) Veni grant 016.111.075 and CAPES/BRASIL. ACPV, CCT, MAM and BNA are supported by CAPES, CNPq and IOC-FIOCRUZ. RAE is supported by NSF grants DBI: 0850356 from the Division of Biological Infrastructure and DEB: 1046413 from the Division of Environmental Biology, and by grant P116M100007 from the Department of Education. EAD is supported by NSF TUES grant 1044453. We thank Anca Segall, Forest Rohwer, and Stanley Maloy for thoughtful discussions and critical reviews of the manuscript.
- Boucher Y, Cordero OX, Takemura A, Hunt DE, Schliep K, Bapteste E, Lopez P, Tarr CL, Polz MF: Local mobile gene pools rapidly cross species boundaries to create endemicity within global vibrio cholerae populations. MBio. 2011, 2 (2): e00335-10-PubMed CentralPubMedView ArticleGoogle Scholar
- Shapiro BJ, Friedman J, Cordero OX, Preheim SP, Timberlake SC, Szabo G, Polz MF, Alm EJ: Population genomics of early events in the ecological differentiation of bacteria. Science. 2012, 336 (6077): 48-51. 10.1126/science.1218198.PubMed CentralPubMedView ArticleGoogle Scholar
- Mutreja A, Kim DW, Thomson NR, Connor TR, Lee JH, Kariuki S, Croucher NJ, Choi SY, Harris SR, Lebens M, Niyogi SK, Kim EJ, Ramamurthy T, Chun J, Wood JL, Clemens JD, Czerkinsky C, Nair GB, Holmgren J, Parkhill J, Dougan G: Evidence for several waves of global transmission in the seventh cholera pandemic. Nature. 2011, 477 (7365): 462-465. 10.1038/nature10392.PubMed CentralPubMedView ArticleGoogle Scholar
- Waldor MK, Mekalanos JJ: Lysogenic conversion by a filamentous phage encoding cholera toxin. Science. 1996, 272 (5270): 1910-1914. 10.1126/science.272.5270.1910.PubMedView ArticleGoogle Scholar
- Dutilh BE, Backus L, Edwards RA, Wels M, Bayjanov JR, van Hijum SA: Explaining microbial phenotypes on a genomic scale: GWAS for microbes. Brief Funct Genomics. 2013, 12: 366-380. 10.1093/bfgp/elt008.PubMed CentralPubMedView ArticleGoogle Scholar
- Bayjanov JR, Molenaar D, Tzeneva V, Siezen RJ, van Hijum SA: PhenoLink–a web-tool for linking phenotype to ~ omics data for bacteria: application to gene-trait matching for Lactobacillus plantarum strains. BMC Genomics. 2012, 13: 170-10.1186/1471-2164-13-170.PubMed CentralPubMedView ArticleGoogle Scholar
- Breiman L: Random forests. Mach Learn. 2001, 45 (1): 5-32. 10.1023/A:1010933404324.View ArticleGoogle Scholar
- Rothberg JM, Hinz W, Rearick TM, Schultz J, Mileski W, Davey M, Leamon JH, Johnson K, Milgrew MJ, Edwards M, Hoon J, Simons JF, Marran D, Myers JW, Davidson JF, Branting A, Nobile JR, Puc BP, Light D, Clark TA, Huber M, Branciforte JT, Stoner IB, Cawley SE, Lyons M, Fu Y, Homer N, Sedova M, Miao X, Reed B, et al: An integrated semiconductor device enabling non-optical genome sequencing. Nature. 2011, 475 (7356): 348-352. 10.1038/nature10242.PubMedView ArticleGoogle Scholar
- Overbeek R, Begley T, Butler RM, Choudhuri JV, Chuang HY, Cohoon M, de Crecy-Lagard V, Diaz N, Disz T, Edwards R, Fonstein M, Frank ED, Gerdes S, Glass EM, Goesmann A, Hanson A, Iwata-Reuyl D, Jensen R, Jamshidi N, Krause L, Kubal M, Larsen N, Linke B, McHardy AC, Meyer F, Neuweger H, Olsen G, Olson R, Osterman A, Portnoy V, et al: The subsystems approach to genome annotation and its use in the project to annotate 1000 genomes. Nucleic Acids Res. 2005, 33 (17): 5691-5702. 10.1093/nar/gki866.PubMed CentralPubMedView ArticleGoogle Scholar
- Hasan NA, Choi SY, Eppinger M, Clark PW, Chen A, Alam M, Haley BJ, Taviani E, Hine E, Su Q, Tallon LJ, Prosper JB, Furth K, Hoq MM, Li H, Fraser-Liggett CM, Cravioto A, Huq A, Ravel J, Cebula TA, Colwell RR: Genomic diversity of 2010 Haitian cholera outbreak strains. Proc Natl Acad Sci U S A. 2012, 109 (29): E2010-E2017. 10.1073/pnas.1207359109.PubMed CentralPubMedView ArticleGoogle Scholar
- Hendriksen RS, Price LB, Schupp JM, Gillece JD, Kaas RS, Engelthaler DM, Bortolaia V, Pearson T, Waters AE, Upadhyay BP, Shrestha SD, Adhikari S, Shakya G, Keim PS, Aarestrup FM: Population genetics of Vibrio cholerae from Nepal in 2010: evidence on the origin of the Haitian outbreak. MBio. 2011, 2 (4): e00157-00111.PubMed CentralPubMedView ArticleGoogle Scholar
- Thompson CC, Marin MA, Dias GM, Dutilh BE, Edwards RA, Iida T, Thompson FL, Vicente AC: Genome sequence of the human pathogen Vibrio cholerae Amazonia. J Bacteriol. 2011, 193 (20): 5877-5878. 10.1128/JB.05643-11.PubMed CentralPubMedView ArticleGoogle Scholar
- Chain PS, Grafham DV, Fulton RS, Fitzgerald MG, Hostetler J, Muzny D, Ali J, Birren B, Bruce DC, Buhay C, Cole JR, Ding Y, Dugan S, Field D, Garrity GM, Gibbs R, Graves T, Han CS, Harrison SH, Highlander S, Hugenholtz P, Khouri HM, Kodira CD, Kolker E, Kyrpides NC, Lang D, Lapidus A, Malfatti SA, Markowitz V, Metha T, et al: Genomics. Genome project standards in a new era of sequencing. Science. 2009, 326 (5950): 236-237. 10.1126/science.1180614.PubMedView ArticleGoogle Scholar
- Krzywinski M, Schein J, Birol I, Connors J, Gascoyne R, Horsman D, Jones SJ, Marra MA: Circos: an information aesthetic for comparative genomics. Genome Res. 2009, 19 (9): 1639-1645. 10.1101/gr.092759.109.PubMed CentralPubMedView ArticleGoogle Scholar
- Popovic T, Fields PI, Olsvik O, Wells JG, Evins GM, Cameron DN, Farmer JJ, Bopp CA, Wachsmuth K, Sack RB, Albert MJ, Nair GB, Shimada T, Feeley JC: Molecular subtyping of toxigenic Vibrio cholerae O139 causing epidemic cholera in India and Bangladesh, 1992-1993. J Infect Dis. 1995, 171 (1): 122-127. 10.1093/infdis/171.1.122.PubMedView ArticleGoogle Scholar
- Waldor MK, Colwell R, Mekalanos JJ: The Vibrio cholerae O139 serogroup antigen includes an O-antigen capsule and lipopolysaccharide virulence determinants. Proc Natl Acad Sci U S A. 1994, 91 (24): 11388-11392. 10.1073/pnas.91.24.11388.PubMed CentralPubMedView ArticleGoogle Scholar
- Johnson JA, Salles CA, Panigrahi P, Albert MJ, Wright AC, Johnson RJ, Morris JG: Vibrio cholerae O139 synonym bengal is closely related to Vibrio cholerae El Tor but has important differences. Infect Immun. 1994, 62 (5): 2108-2110.PubMed CentralPubMedGoogle Scholar
- Reimer AR, Van Domselaar G, Stroika S, Walker M, Kent H, Tarr C, Talkington D, Rowe L, Olsen-Rasmussen M, Frace M, Sammons S, Dahourou GA, Boncy J, Smith AM, Mabon P, Petkau A, Graham M, Gilmour MW, Gerner-Smidt P: Comparative genomics of Vibrio cholerae from Haiti, Asia, and Africa. Emerg Infect Dis. 2011, 17 (11): 2113-2121.PubMed CentralPubMedView ArticleGoogle Scholar
- Huang Y, Niu B, Gao Y, Fu L, Li W: CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics. 2010, 26 (5): 680-682. 10.1093/bioinformatics/btq003.PubMed CentralPubMedView ArticleGoogle Scholar
- Aziz RK, Bartels D, Best AA, DeJongh M, Disz T, Edwards RA, Formsma K, Gerdes S, Glass EM, Kubal M, Meyer F, Olsen GJ, Olson R, Osterman AL, Overbeek RA, McNeil LK, Paarmann D, Paczian T, Parrello B, Pusch GD, Reich C, Stevens R, Vassieva O, Vonstein V, Wilke A, Zagnitko O: The RAST Server: rapid annotations using subsystems technology. BMC Genomics. 2008, 9: 75-10.1186/1471-2164-9-75.PubMed CentralPubMedView ArticleGoogle Scholar
- Akhter S, Aziz RK, Edwards RA: PhiSpy: a novel algorithm for finding prophages in bacterial genomes that combines similarity- and composition-based strategies. Nucleic Acids Res. 2012, 40 (16): e126-10.1093/nar/gks406.PubMed CentralPubMedView ArticleGoogle Scholar
- Li W, Jaroszewski L, Godzik A: Clustering of highly homologous sequences to reduce the size of large protein databases. Bioinformatics. 2001, 17 (3): 282-283. 10.1093/bioinformatics/17.3.282.PubMedView ArticleGoogle Scholar
- Garza DR, Thompson CC, Loureiro EC, Dutilh BE, Inada DT, Junior EC, Cardoso JF, Nunes MR, de Lima CP, Silvestre RV, Nunes KN, Santos EC, Edwards RA, Vicente AC, de Sa Morais LL: Genome-wide study of the defective sucrose fermenter strain of Vibrio cholerae from the Latin American cholera epidemic. PLoS One. 2012, 7 (5): e37283-10.1371/journal.pone.0037283.PubMed CentralPubMedView ArticleGoogle Scholar
- Schmieder R, Lim YW, Rohwer F, Edwards R: TagCleaner: identification and removal of tag sequences from genomic and metagenomic datasets. BMC Bioinformatics. 2010, 11: 341-10.1186/1471-2105-11-341.PubMed CentralPubMedView ArticleGoogle Scholar
- Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, Berka J, Braverman MS, Chen YJ, Chen Z, Dewell SB, Du L, Fierro JM, Gomes XV, Godwin BC, He W, Helgesen S, Ho CH, Irzyk GP, Jando SC, Alenquer ML, Jarvie TP, Jirage KB, Kim JB, Knight JR, Lanza JR, Leamon JH, Lefkowitz SM, Lei M, Li J, et al: Genome sequencing in microfabricated high-density picolitre reactors. Nature. 2005, 437 (7057): 376-380.PubMed CentralPubMedGoogle Scholar
- Zerbino DR, Birney E: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 2008, 18 (5): 821-829. 10.1101/gr.074492.107.PubMed CentralPubMedView ArticleGoogle Scholar
- Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL: BLAST+: architecture and applications. BMC Bioinformatics. 2009, 10: 421-10.1186/1471-2105-10-421.PubMed CentralPubMedView ArticleGoogle Scholar
- Silva GGZ, Dutilh BE, Matthews TD, Elkins K, Schmieder R, Dinsdale EA, Edwards RA: Combining de novo and reference-guided assembly with scaffold_builder. Source Code Biol Med. 2013, 8 (1): 23-10.1186/1751-0473-8-23.PubMed CentralPubMedView ArticleGoogle Scholar
- Darling AE, Mau B: Perna NT: progressiveMauve: multiple genome alignment with gene gain, loss and rearrangement. PLoS One. 2010, 5 (6): e11147-10.1371/journal.pone.0011147.PubMed CentralPubMedView ArticleGoogle Scholar
- Eddy SR: A new generation of homology search tools based on probabilistic inference. Genome Inform. 2009, 23 (1): 205-211.PubMedGoogle Scholar
- Guindon S, Gascuel O: A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst Biol. 2003, 52 (5): 696-704. 10.1080/10635150390235520.PubMedView ArticleGoogle Scholar
- Hassan F, Kamruzzaman M, Mekalanos JJ, Faruque SM: Satellite phage TLCphi enables toxigenic conversion by CTX phage through dif site alteration. Nature. 2010, 467 (7318): 982-985.PubMed CentralPubMedView ArticleGoogle Scholar
- Morais LL, Garza DR, Loureiro EC, Nunes KN, Vellasco RS, da Silva CP, Nunes MR, Thompson CC, Vicente AC, Santos EC: Complete genome sequence of a sucrose-nonfermenting epidemic strain of vibrio cholerae O1 from Brazil. J Bacteriol. 2012, 194 (10): 2772-10.1128/JB.00300-12.PubMed CentralView ArticleGoogle Scholar
- Touw WG, Bayjanov JR, Overmars L, Backus L, Boekhorst J, Wels M, van Hijum SA: Data mining in the Life Sciences with Random Forest: a walk in the park or lost in the jungle?. Brief Bioinform. 2012, 14: 315-326.PubMed CentralPubMedView ArticleGoogle Scholar
- Barrett JH, Cairns DA: Application of the random forest classification method to peaks detected from mass spectrometric proteomic profiles of cancer patients and controls. Stat Appl Genet Mol Biol. 2008, 7 (2): Article4-PubMedGoogle Scholar
- Bienkowska JR, Dalgin GS, Batliwalla F, Allaire N, Roubenoff R, Gregersen PK, Carulli JP: Convergent Random Forest predictor: methodology for predicting drug response from genome-scale data applied to anti-TNF response. Genomics. 2009, 94 (6): 423-432. 10.1016/j.ygeno.2009.08.008.PubMed CentralPubMedView ArticleGoogle Scholar
- Stiglic G, Rodriguez JJ, Kokol P: Rotation of random forests for genomic and proteomic classification problems. Adv Exp Med Biol. 2011, 696: 211-221. 10.1007/978-1-4419-7046-6_21.PubMedView ArticleGoogle Scholar
- Ananthasubramanian S, Metri R, Khetan A, Gupta A, Handen A, Chandra N, Ganapathiraju M: Mycobacterium tuberculosis and Clostridium difficille interactomes: demonstration of rapid development of computational system for bacterial interactome prediction. Microb Inform Exp. 2012, 2: 4-10.1186/2042-5783-2-4.PubMed CentralPubMedView ArticleGoogle Scholar
- Chung RH, Chen YE: A two-stage random forest-based pathway analysis method. PloS one. 2012, 7 (5): e36662-10.1371/journal.pone.0036662.PubMed CentralPubMedView ArticleGoogle Scholar
- Calle ML, Urrea V: Letter to the editor: Stability of Random Forest importance measures. Brief Bioinform. 2011, 12 (1): 86-89. 10.1093/bib/bbq011.PubMedView ArticleGoogle Scholar
- Liaw A, Wiener M: Classification and regression by random Forest. R News. 2002, 2 (3): 18-22.Google Scholar
- Smoot ME, Ono K, Ruscheinski J, Wang PL, Ideker T: Cytoscape 2.8: new features for data integration and network visualization. Bioinformatics. 2011, 27 (3): 431-432. 10.1093/bioinformatics/btq675.PubMed CentralPubMedView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.