- Methodology article
- Open Access
A gene-by-gene population genomics platform: de novo assembly, annotation and genealogical analysis of 108 representative Neisseria meningitidis genomes
© Bratcher et al.; licensee BioMed Central. 2014
Received: 2 October 2014
Accepted: 4 December 2014
Published: 18 December 2014
Highly parallel, ‘second generation’ sequencing technologies have rapidly expanded the number of bacterial whole genome sequences available for study, permitting the emergence of the discipline of population genomics. Most of these data are publically available as unassembled short-read sequence files that require extensive processing before they can be used for analysis. The provision of data in a uniform format, which can be easily assessed for quality, linked to provenance and phenotype and used for analysis, is therefore necessary.
The performance of de novo short-read assembly followed by automatic annotation using the pubMLST.org Neisseria database was assessed and evaluated for 108 diverse, representative, and well-characterised Neisseria meningitidis isolates. High-quality sequences were obtained for >99% of known meningococcal genes among the de novo assembled genomes and four resequenced genomes and less than 1% of reassembled genes had sequence discrepancies or misassembled sequences. A core genome of 1600 loci, present in at least 95% of the population, was determined using the Genome Comparator tool. Genealogical relationships compatible with, but at a higher resolution than, those identified by multilocus sequence typing were obtained with core genome comparisons and ribosomal protein gene analysis which revealed a genomic structure for a number of previously described phenotypes. This unified system for cataloguing Neisseria genetic variation in the genome was implemented and used for multiple analyses and the data are publically available in the PubMLST Neisseria database.
The de novo assembly, combined with automated gene-by-gene annotation, generates high quality draft genomes in which the majority of protein-encoding genes are present with high accuracy. The approach catalogues diversity efficiently, permits analyses of a single genome or multiple genome comparisons, and is a practical approach to interpreting WGS data for large bacterial population samples. The method generates novel insights into the biology of the meningococcus and improves our understanding of the whole population structure, not just disease causing lineages.
The widespread application of parallel high-throughput ‘next generation’ sequencing (NGS) technologies has made whole genome sequence (WGS) data available for tens of thousands of bacterial isolates . Increasingly, these data are publicly available only as depositions in short-read sequence archives: in December 2013 the European Bioinformatics Institute (EBI) Sequence Read Archive (SRA), contained more than 100,000 bacterial WGS records, over 90% of which comprised millions of short sequence reads each of fewer than 200 bases in length. These data represent a major resource for studies of bacterial diversity, evolution and function; however, as the throughput of genome finishing and annotation technologies has not kept pace with sequence determination, the genomes have to be reassembled to be interpreted. Typically, this is done either by mapping to a reference sequence or by de novo assembly to generate draft genomes comprising multiple contiguous sequences (contigs).
The approach of mapping short-read sequences to a reference sequence has been effectively used to analyse WGS data from closely related isolates in numerous studies [2–9], especially by using the data obtained to reconstruct genealogies based on phylogenetic trees. This approach has a number of limitations, including: the necessity for a high-quality reference sequence with which to make the comparison; variation in sequence not present in the reference cannot be detected; the approach is poorly scalable; analyses typically have to be re-run as new genomes are obtained; and finally, the density of sequence polymorphisms in the majority of bacterial populations is such that this approach is not feasible for the study of isolates that are not genetically closely related. The use of de novo assembly methods represents an alternative, more broadly applicable approach, with assemblers based on de Brujin graphing being widely used as they deal effectively with large volumes of data [10, 11] and can assemble short-read sequences of fewer than 100 bases in length into contigs that contain the majority of the genome. Further, when paired-end sequencing strategies are employed, ‘high quality draft’ bacterial genomes can be assembled [2–9, 12–14]. Once they have been assembled, these sequences can be annotated by comparisons to known genes or genome databases , using an approach similar to that used in multilocus sequence typing (MLST), which has been widely employed for sequence-based analyses at the population scale since 1998 . The Bacterial Isolate Genome Sequence Database (BIGSdb) platform provides this functionality for WGS data .
Neisseria meningitidis, the meningococcus, is a pathogen of global significance and an informative model organism for investigating the relationship between genotype and phenotype, as it is highly diverse phenotypically and genotypically . Due to the importance of the disease, the most studied meningococcal phenotype is the propensity to invade, although most episodes of meningococcal infection result in asymptomatic carriage, which typically occurs in 10-20% of the human population [19, 20]. Only a very small number of infections result in devastating and rapidly progressing disease, in the form of septicaemia, meningitis, or both. For reasons that are incompletely understood, some meningococcal genotypes are much more likely to cause invasive disease than others. Nucleotide sequence-based typing, especially MLST and antigen sequence typing (AGST), have established that these genotypes correspond to certain genealogies, known as the ‘hyperinvasive lineages’ . There are a number of factors known to contribute to the hyperinvasive phenotype, particularly the possession of certain capsular polysaccharides, but species-level comparisons suggest that the majority of the pan-genome is widely shared among invasive and non-invasive genotypes. This has led to the conclusion that the ability to cause invasive disease is both polygenic and different among hyperinvasive lineages [22–24], but the determinants associated with particular lineages remain poorly defined. Comparative WGS of meningococcal isolate collections that include representative disease and carriage isolates have the potential to define the genetic differences which determine the hyper invasive phenotypes.
Here, WGS data collected by NGS technology were investigated with de novo assembly and population annotation to characterize 108 diverse meningococcal genomes, including the major hyperinvasive lineages observed worldwide over the last 60 years. The draft genomes were analysed for accuracy and coverage using the BIGSdb platform  which enabled comparison with 24 antigen and MLST typing loci previously characterised with Sanger sequencing and four finished reference genomes, cross-validating these technologies. These data established the robustness and reliability of using de novo draft genomes for a population-wide level of analysis for meningococcus genomes and presented a WGS description of the major hyperinvasive lineages, providing insights into their structure, evolution, and function.
Short-read sequences were assembled into draft genomes using Velvet  and VelvetOptimiser  programs, using 54 or 76 base read files. The sum total length of assembled contigs ranged from 1,975,180 bp to 2,211,536 bp and had a G + C content between 51-52%, consistent with previously finished meningococcal genomes (see Additional file 1: Table S1). Assemblies consisted of 291 to 407 contigs, with a mean of 367. The average N50, a value that represents the length at which contigs of equal or longer length contain at least 50% of the assembled sequence, across all genomes was 19,495 bp. This statistic provided an indication of the total genome coverage; however, it was not a measure of genome assembly quality. Overall, a higher k-mer setting for the assembly was associated with the higher N50 values and, within the bounds of the read length, assembled repeat regions within the genome that were under 100 bp in length.
Velvet assembly statistics of 108* genomes analyzed at 1605 core meningococcal loci
Multiplex group (number of libraries)
Sequence read length
Most common k-mer
Average contig count
Average longest contig
Average genome length
Average number of core loci identified
Average number of incomplete loci found
Average de novo Assembly Statistics
Comparison of Sanger derived MLST and AGST loci to their respective de novo assembled genome
Original Sanger derived allele
Illumina derived allele
Retested Sanger derived allele
Number of bases, likely cause of discrepancy
1, editing error
1, editing error
1, editing error
2, editing error
2, editing error
1, editing error
1, editing error
2, editing error
2, editing error
3, editing error
1, editing error
3, repeat sequence*
The four draft sequences for which finished genome sequences were available (H44/76, FAM18, Z2491, and G2136) were compared to the published closed sequences using the BIGSdb Genome Comparator tool. Sequence discrepancies were found between all four resequenced draft genomes and their respective finished reference genome. The H44/76 and G2136 reference genomes, created with Roche 454 technology and finished using capillary sequencing, had sequence differences in thirty hypothetical proteins, thirty-five annotated CDS, nine pseudogenes and five putative proteins, a total of 79 loci for these two published genomes (see Additional file 2: Table S2, sections B-E). FAM18 and Z2491 reference genomes, obtained using ABI 3700 and a combination of ABI373 and 377 respectively, had sequence discrepancies among twenty-two annotated CDS, ten pseudogenes, five putative protein sequences and fourteen hypothetical proteins; totalling 51 loci of the published CDS sequences for these genomes. The majority of these CDS affected (69.1%) had a single nucleotide change each and the remaining 30% had two or more nucleotide changes. The differences were categorized as non-synonymous or synonymous amino acid changes (see Additional file 3: Table S3). Differences caused by assembly failures (24 loci) or paralogous loci (23 loci pairs) contained cross identified reads (see Additional file 2: Table S2, sections E & F). Paralogous gene cross-identification occurred most often in CDS annotated as hypothetical proteins, a total of ten. These, plus six additional paralogous loci, were manually curated and defined using up- and down-stream sequence in order to enable the BIGSdb scanning function to correctly distinguish the divergent regions of the paralogous genes without manual curation. A list containing the identification of all CDS with sequence differences, and those loci missing in the draft genomes was generated (see Additional file 2: Table S2, section A-E).
Re-sequenced genome comparisons sequence differences identified among four re-sequenced genomes and their respective finished sequence
Failed assembly of repeat sequence tracts
Paralogous cross identification
Total number of bases
total number of bases
number of bases §
total number of bases §
Number of bases affected
Loci with identical sequence match
Loci with nucleotide sequence discrepancy
Loci that are present but incomplete
While it is possible to revise the assembly parameters and recover some of the missing data in the assemblies, this would potentially be at a cost to the overall quality of the assembly by swapping specificity and sensitivity and could in fact reduce the N50 value, therefore this option was not implemented for this analysis. Technically, the foundations resulting in the underrepresentation of these regions in the subsequent sequence reads have many sources: for example GC bias affects the stability of the DNA strand which could influence the read ability or modify the probability of a fragmentation. It has been shown that optimized or PCR-free protocols reduce GC bias affects [32–35] and if these genomes were resequenced using a PCR-free approach it is possible the overall genome coverage would increase.
All of the draft genome assemblies were annotated using a gene-by-gene approach using the BIGSdb platform as described previously [17, 36]. Each genome was scanned against defined loci contained in the PubMLST Neisseria sequence definition database using the default parameters (70% minimum identity; 50% minimum alignment; and a BLASTN word size of 15). Alleles previously identified were assigned an allele number automatically, in a process referred to as ‘tagging’, and new alleles were manually curated and submitted to the sequence definition database for allele number assignment. The genome data were subsequently rescanned to assign the new alleles to the respective genome in which it was found. Partially assembled loci, those found at the end of a contig, were tagged as present in the genome but flagged as incomplete. The average number of incomplete coding sequences (CDS) found per genome was 41 (see Additional file 1: Table S1). BIGSdb also identified sixteen paralogous CDS pairs, these included six recognized CDS (including two ribosomal protein genes), ten hypothetical genes and one putative lipoprotein (see Additional file 2: Table S2, section F).
The meningococcal core genome
The higher resolution of rMLST and cgMLST, as opposed to the seven- or twenty- locus MLST, also resolved the substructure characteristic of lineage 3 (ST-41/44 complex). This lineage sub-structure is captured in MLST by the designation of two central genotypes that are differentially associated with invasive disease, and at the sequence type level share five of the seven MLST alleles [27, 41]. Analysis of this lineage also showed that isolates associated with the ST-41 belonged to a well-defined monophyletic lineage, while the ST-44 associated isolates were a more diverse but distinct lineage. Further exploration of this complex is necessary to more fully define the relationships within this clade and the variable pathogenic nature associated with each group. The association of capsule loci with the lineage 11 (ST-11 complex), and in lineage 8 (ST-8 complex), at the cgMLST level (1605 core genes) shows the serotype B and C associated genomes on different branches, and only lineage 11 (ST-11 complex) maintains this separation at the rMLST level (53 ribosomal genes). The remaining lineages did not have sufficient numbers to clearly differentiate capsule associations; and additional studies with larger strain collections will be required to make these associations more distinctly.
Four sets of lineage specific draft genomes, thirty-four in total, were assessed for genome coverage using one of four reference genome annotations and the BIGSdb Genome Comparator tool. Each of the four sets of de novo assembled genomes contained over 98% of the CDS defined by their closest reference genome. Seven isolates in the collection belong to lineage 5 (ST-32 complex) and were compared to the H44/76 reference genome. All 1976 of the H44/76 CDS were identified across the seven de novo assembled genomes. There was an average of 1951 CDS (98.7%) identified per genome. Seven isolates belonging to lineage 8 (ST-8 complex) were compared to the G2136 reference genome. All 1911 G2136 CDS were identified across the de novo assembled genomes, with an average of 1865 CDS (97.6%) found per genome. A further ten isolates belonged to lineage 11 (ST-11 complex) and were compared to the FAM18 genome sequence. The comparison identified 1912 (99.8%) of the 1915 FAM18 genome CDS across the ten genomes, with an average CDS count of 1879 CDS (98.1%) per genome. Ten isolates also belonged to the lineage 4 (ST-4 complex) and were compared to the Z2491 genome, identifying 1936 (99.9%) of 1937 CDSs across all ten genomes. Each genome had an average of 1899 CDS (98%) per genome and the only CDS not found in all ten of the lineage 4 genomes was the coenzyme A gene, coaD.
Although the sequence read lengths employed here were relatively short (54-76 bp)  and the meningococcus has a complex genome comprising many short tandem repeats (STR) and homopolymeric tracts [46–48], the Velvet algorithm was consistently capable of assembling the majority of protein coding sequences (over 1850 complete loci per genome) to extremely high levels of accuracy. Indeed, where comparable data were available for genes previously used for sequence based-typing, the majority of the discrepancies were due to errors in the editing or labelling of the specimens used in the original Sanger sequences, and the remaining, the result of STR sequence compression during assembly . Once these errors had been taken in to account, the two approaches were in complete agreement. There was also very good agreement with complete reference genomes, although this depended on the read length of the short-read sequence data, with substantial improvement as read length increased. Read lengths of 100 bp, which are now routinely available, would reduce the missing data substantially [44, 50]. Data quality was also determined by the details of the chemistry and procedures used [51, 52], showing that NGS data are optimally useful when this information is deposited with them. Some coverage effects were seen, with sequences near the origin of replication consistently sequenced to a higher depth , than others but the genome of each assembly was adequately covered.
The BIGSdb platform accommodates sequence data derived from a particular isolate ranging from a single gene through multiple genes and contigs up to and including complete genomes . The Genome Comparator tool can either use the annotations from a reference genome, which were used to compare the reference genomes with the assembled genomes, or sets of loci defined in the PubMLST sequence definition database, for which it maintains a complete catalogue of diversity described to date . To enable consistent referencing each complete defined locus in the PubMLST Neisseria database, which can be any identifiable sequence string, was identified with a unique and arbitrary ‘NEIS’ number, which can be associated with other designations such as conventional Demeric gene names . Additional loci that represent gene fragments used in typing schemes and peptide loci representing typing antigen variable regions [31, 55], are also indexed within the database. The BIGSdb ‘autotagger’ function identified and automatically annotated an average of 1899 CDS from each assembled genome, with only a small number of paralogous loci (no more than 20) in the pan-genome. The currently identified paralogous loci require additional manual annotation, they have been found to vary between the Neisseria species and may vary among meningococcal lineages. In conclusion, the approach can be used to analyse large numbers of WGS datasets consistently and is generally applicable for use across the bacterial domain.
Ultimately the PubMLST Neisseria database can be expanded, through a process of iterative gene discovery, to become a catalogue of the meningococcal ‘pan genome’ i.e. all of the genes present in the species or genus [56, 57]. This database will develop over time by a process of community annotation but, by definition, the members of the meningococcal ‘core genome’, i.e. genes present in all meningococci, will already be present. Because every bacterial isolate is potentially an unrepresentative mutant and due to the imperfect nature of NGS assemblies, the core genome cannot be simply defined as the genes present in all isolates; however, the estimate of a core genome comprising 1605 genes generated here is in good agreement with other estimates (1532-1706) which were based on substantially fewer genomes [23, 37, 58, 59]. A total 37% of the meningococcal core genes were assigned an EC number at the time of writing, indicating the magnitude of the annotation task which NGS data generates. While the membership of the core genome will be refined over time, it is unlikely to be very different from that proposed here. An updated list of meningococcal core genes will be maintained in the database.
The genealogies reconstructed with the NeighborNet algorithm using Genome Comparator data for the cgMLST and rMLST were consistent with those previously generated with MLST and a variety of other approaches [40, 60, 61]. The ribosomal genes (rMLST) and core genome (cgMLST) data provide more resolution, demonstrating that the six major hyperinvasive lineages included in this dataset cluster in to a number of larger groups . Some lineages are more closely related to each other although the star phylogeny demonstrates a highly diverse and recombining population from which invasive lineages have emerged independently on several occasions . As suggested from multilocus enzyme electrophoresis (MLEE) and other data , the serogroup A-associated lineages 1, 4 and 10 (ST-1, ST-4 and ST-5 complexes respectively) likely share a common ancestor , as do: lineage 8 (ST-8 complex) and lineage 11 (ST-11 complex); and lineage 5 (ST-32 complex) and lineage 2 (ST-269 complex). Lineage 3 (ST-41/44 complex) is a diverse lineage comprising both more and less invasive types. These data confirm that the invasive lineages are defined by sequence variation in the core genome, although certain members of the accessory genome, for example the capsule , the meningococcal disease associated island phage [67, 68], and restriction modification systems [37, 69] are differentially distributed among lineages.
Proposed Whole Genome Lineage Nomenclature
Lineage 11 ^
Lineage 3 ^
Lineage 23 ^
WGS data has the potential to unify studies of bacteria by providing comprehensive descriptions of genomic variation. To achieve this it is necessary to: (i) make the data available in a comprehensible way, along with information describing its completeness and accuracy; and (ii) link them to provenance and phenotype information, which describes the source of the sample and its properties, as well as the known properties of the genes identified and the deduced product. These datasets will grow in completeness and accuracy over time; however, it is also necessary for these data to be presented in a stable context, enabling even incomplete information to be explored. The approach described and validated here for the meningococcus is one way of achieving this, which employs generic, freely accessible and widely used tools. The use of the web interface within the PubMLST Neisseria database enables a process of community annotation whereby different members of the community can participate in the maintenance and improvement of sequence annotation and interpretation.
Bacterial strains and genomic DNA extraction
Genomic DNA from 108 diverse Neisseria meningitidis isolates was prepared from archive stocks which have been extensively characterized and previously reported [16, 27, 30, 70]; this data set includes the 107 MLST global reference collection isolates and FAM18 [16, 47]. Cultures were incubated on Columbia horse-blood ager (Oxoid) at 37°C in an atmosphere of 5% CO2 for 24 hours, sub-cultured and genomic DNA extracted using the Wizard® Genomic DNA Purification Kit (Promega).
Standard Illumina multiplex libraries, grouped A-K, were generated. Adapter ligated DNA was amplified by PCR using Taq or Phusion® DNA polymerase and primers from the Illumina multiplexing sample preparation oligonucleotide kit, creating up to 12 libraries per group. Before and after each of these steps DNA was simultaneously cleaned up and size selected using a 1:1 (sample:beads) ratio of Ampure beads (Beckman Coulter Genomics). Libraries were pooled in equimolar ratio and a maximum of twelve tagged, paired-end library aliquots were run per flowcell lane; every eighth lane contained the control genome, phiX 174. A standard Illumina clustering protocol was used with an additional QC step after cluster amplification. Passing flowcells were sequenced using the Illumina Genome Analyser II platform. Sequence reads have been deposited in the European Nucleotide Archive [EMBL: ERS006904 to ERS007010].
Short-read sequences were assembled using the VelvetOptimiser de novo short-read assembly program optimisation script using the default parameters [25, 26]. Once generated, there was no further manipulation of the assembled draft genome sequences.
For each step of the process where variation or patterning, not associated with or inherent in the genome biology could be introduced, non-biological run nodes were recorded. These included notations of: date; technician; reagent lot used; manual and robotic library preparation methods including plate lane; and sequencing steps specifically noting chemistry changes, flowcell lane, number of samples per multiplex group, and the machine used.
BIGSdb genome annotation and locus tagging
The sequence definition database was seeded using the core loci identified in finished Neisseria meningitidis genome annotations. The locus tag identifiers, ‘NEIS’ followed by an integer, was adopted in order to allow automated accessioning of loci as they are identified and added to the database. The NEIS, (short for ‘Neisseria genus’) loci list was determined using the genome annotations of FAM18, H44/76, G2136, Z2491 and MC58 and represent, notionally, the pan-genome of the meningococcus. This included the ribosomal protein loci, a sub set of the core loci which are also orthologous across all bacterial species . The NEIS identifiers are linked to an alias table that contains additional locus nomenclature associated with each locus which is searchable and therefore cross compatible with various annotations; such as specific finished genome locus tags, KEGG EC or common name. The number of loci contained in the list of the NEIS locus identifiers is not static and will change as loci are curated and added to the database over time.
The draft genome sequences were queried within BIGSdb using BLAST against the sequence definition database to identify defined allelic variation. Alleles were automatically annotated and assigned with the appropriate allele number for those loci for which definitions exist, in a process referred to as ‘tagging’ while new alleles were manually curated and assigned a new allele accession number. For the gene sequences with frame shift mutations, internal stop codons, etc., the sequence was assigned an allele designation and flagged as having an internal stop codon. Any gene sequences with missing data, i.e. those at the ends of contigs, were flagged as incomplete and not assigned an allele number. Once identified the locus allelic variant was linked to the isolate metadata.
Reference to de novogenome comparisons
Assembled draft genome sequences were compared to their reference genome using the BIGSdb Genome Comparator tool and assessed using the finished genome CDS sequence annotation. Genes from each genome were also compared to previously typed loci, including conventional and extended MLST loci, three antigen loci, PorA VR1 and VR2, FetA VR, and fHbp, a surface antigen being explored as a vaccine candidate.
Sanger sequencing was performed for resolution of typing loci conflicts found between Illumina and Sanger derived sequences using a reserved sample of the DNA used for Illumina sequencing. Reserved Illumina DNA was amplified and sequenced using previously published methods and primers for conventional MLST, eMLST, PorA VR1, PorA VR2, or fHbp loci [16, 27, 29, 31]. Sanger trace files were assembled using the Staden sequence assembly package , and compared to the Illumina derived sequence using the MEGA5 alignment tools .
Bowtie and tablet
Read depth and sequence conflicts were checked by remapping using the Bowtie short-read aligner . For target sequence assessment the contig containing the typing loci was extracted from the sequence bin and used as the reference segment and the FAM18, Z2491, H44/74 and G2136 finished genomes were used for read mapping their resequenced genomes respectively. Briefly, the short-reads were converted to SAM files and mapped against the reference segment using a randomized alignment order to avoid mapping bias. Aligned .SAM files were visualized using the Tablet software package . Read depth and conflicting nucleotides of interest were identified and investigated.
Assembled contigs and annotation information can be accessed at PubMLST Neisseria database [http://pubmlst.org/neisseria/] using the query search, project ‘107 global collection’. Sequence reads have also been deposited in the European Nucleotide Archive (ENA) EMBL: ERS006904 to ERS007010 inclusive.
References to the data sets supporting the results of this article are included within the article and its 3 additional .pdf files.
Theresa Feltwell, John Burton, Michael Quail, and Stephen D Bently of The Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK for sequencing coordination, assistance and technical consultations. MCJM is a Wellcome Trust Senior Fellow in Basic Biomedical Sciences. This publication made use of the Neisseria Multi Locus Sequence Typing website (http://pubmlst.org/neisseria/) developed by Keith Jolley and sited at the University of Oxford . The development of this site has been funded by the Wellcome Trust and European Union.
- Medini D, Serruto D, Parkhill J, Relman DA, Donati C, Moxon R, Falkow S, Rappuoli R: Microbiology in the post-genomic era. Nat Rev Microbiol. 2008, 6 (6): 419-430.PubMedGoogle Scholar
- Dohm JC, Lottaz C, Borodina T, Himmelbauer H: SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing. Genome Res. 2007, 17 (11): 1697-1706. 10.1101/gr.6435207.PubMed CentralPubMedView ArticleGoogle Scholar
- Butler J, MacCallum I, Kleber M, Shlyakhter IA, Belmonte MK, Lander ES, Nusbaum C, Jaffe DB: ALLPATHS: De novo assembly of whole-genome shotgun microreads. Genome Res. 2008, 18 (5): 810-820. 10.1101/gr.7337908.PubMed CentralPubMedView ArticleGoogle Scholar
- Hernandez D, Francois P, Farinelli L, Osteras M, Schrenzel J: De novo bacterial genome sequencing: Millions of very short reads assembled on a desktop computer. Genome Res. 2008, 18 (5): 802-809. 10.1101/gr.072033.107.PubMed CentralPubMedView ArticleGoogle Scholar
- Farrer RA, Kemen E, Jones JD, Studholme DJ: De novo assembly of the Pseudomonas syringae pv. syringae B728a genome using Illumina/Solexa short sequence reads. FEMS Microbiol Lett. 2009, 291 (1): 103-111. 10.1111/j.1574-6968.2008.01441.x.PubMedView ArticleGoogle Scholar
- Nishito Y, Osana Y, Hachiya T, Popendorf K, Toyoda A, Fujiyama A, Itaya M, Sakakibara Y: Whole genome assembly of a natto production strain Bacillus subtilis natto from very short read data. BMC Genomics. 2010, 11: 243-10.1186/1471-2164-11-243.PubMed CentralPubMedView ArticleGoogle Scholar
- Nagarajan H, Butler JE, Klimes A, Qiu Y, Zengler K, Ward J, Young ND, Methe BA, Palsson BO, Lovley DR, Barrett CL: De Novo Assembly of the Complete Genome of an Enhanced Electricity-Producing Variant of Geobacter sulfurreducens Using Only Short Reads. PLoS One. 2010, 5 (6): e10922-10.1371/journal.pone.0010922.PubMed CentralPubMedView ArticleGoogle Scholar
- Silva A, Schneider MPC, Cerdeira L, Barbosa MS, Ramos RTJ, Carneiro AR, Santos R, Lima M, D'Afonseca V, Almeida SS, Santos AR, Soares SC, Pinto AC, Ali A, Dorella FA, Rocha F, de Abreu VAC, Trost E, Tauch A, Shpigel N, Miyoshi A, Azevedo V: Complete Genome Sequence of Corynebacterium pseudotuberculosis I19, a Strain Isolated from a Cow in Israel with Bovine Mastitis. J Bacteriol. 2011, 193 (1): 323-324. 10.1128/JB.01211-10.PubMed CentralPubMedView ArticleGoogle Scholar
- Cerdeira LT, Carneiro AR, Ramos RTJ, de Almeida SS, D'Afonseca V, Schneider MPC, Baumbach J, Tauch A, McCulloch JA, Azevedo VAC, Silva A: Rapid hybrid de novo assembly of a microbial genome using only short reads: Corynebacterium pseudotuberculosis I19 as a case study. J Microbiol Meth. 2011, 86 (2): 218-223. 10.1016/j.mimet.2011.05.008.View ArticleGoogle Scholar
- Flicek P, Birney E: Sense from sequence reads: methods for alignment and assembly (vol 6, pg S6, 2009). Nat Methods. 2010, 7 (6): 479-479.View ArticleGoogle Scholar
- Ronen R, Boucher C, Chitsaz H, Pevzner P: SEQuel: improving the accuracy of genome assemblies. Bioinformatics. 2012, 28 (12): i188-196. 10.1093/bioinformatics/bts219.PubMed CentralPubMedView ArticleGoogle Scholar
- Chain PSG, Grafham DV, Fulton RS, FitzGerald MG, Hostetler J, Muzny D, Ali J, Birren B, Bruce DC, Buhay C, Cole JR, Ding Y, Dugan S, Field D, Garrity GM, Gibbs R, Graves T, Han CS, Harrison SH, Highlander S, Hugenholtz P, Khouri HM, Kodira CD, Kolker E, Kyrpides NC, Lang D, Lapidus A, Malfatti SA, Markowitz V, Metha T, et al: Genome Project Standards in a New Era of Sequencing. Science. 2009, 326 (5950): 236-237. 10.1126/science.1180614.PubMedView ArticleGoogle Scholar
- Rodrigue S, Malmstrom RR, Berlin AM, Birren BW, Henn MR, Chisholm SW: Whole Genome Amplification and De novo Assembly of Single Bacterial Cells. PLoS One. 2009, 4 (9): e6864-10.1371/journal.pone.0006864.PubMed CentralPubMedView ArticleGoogle Scholar
- Earl AM, Eppinger M, Fricke WF, Rosovitz MJ, Rasko DA, Daugherty S, Losick R, Kolter R, Ravel J: Whole-Genome Sequences of Bacillus subtilis and Close Relatives. J Bacteriol. 2012, 194 (9): 2378-2379. 10.1128/JB.05675-11.PubMed CentralPubMedView ArticleGoogle Scholar
- Maiden MC, van Rensburg MJ, Bray JE, Earle SG, Ford SA, Jolley KA, McCarthy ND: MLST revisited: the gene-by-gene approach to bacterial genomics. Nat Rev Microbiol. 2013, 11 (10): 728-736. 10.1038/nrmicro3093.PubMed CentralPubMedView ArticleGoogle Scholar
- Maiden MCJ, Bygraves JA, Feil E, Morelli G, Russell JE, Urwin R, Zhang Q, Zhou J, Zurth K, Caugant DA, Feavers IM, Achtman M, Spratt BG: Multilocus sequence typing: a portable approach to the identification of clones within populations of pathogenic microorganisms. Proc Natl Acad Sci USA. 1998, 95 (6): 3140-3145. 10.1073/pnas.95.6.3140.PubMed CentralPubMedView ArticleGoogle Scholar
- Jolley KA, Maiden MC: BIGSdb: Scalable analysis of bacterial genome variation at the population level. BMC Bioinformatics. 2010, 11 (1): 595-10.1186/1471-2105-11-595.PubMed CentralPubMedView ArticleGoogle Scholar
- Caugant DA: Population genetics and molecular epidemiology of Neisseria meningitidis. APMIS. 1998, 106 (5): 505-525.PubMedView ArticleGoogle Scholar
- Yazdankhah SP, Caugant DA: Neisseria meningitidis: an overview of the carriage state. J Med Microbiol. 2004, 53 (Pt 9): 821-832.PubMedView ArticleGoogle Scholar
- Neal KR: Changing carriage rate of Neisseria meningitidis among university students during the first week of term: cross sectional study. BMJ. 2000, 320 (7238): 846-849. 10.1136/bmj.320.7238.846.PubMed CentralPubMedView ArticleGoogle Scholar
- Caugant DA, Maiden MC: Meningococcal carriage and disease - population biology and evolution. Vaccine. 2009, 27 (Suppl 2): B64-70.PubMed CentralPubMedView ArticleGoogle Scholar
- Marri PR, Paniscus M, Weyand NJ, Rendon MA, Calton CM, Hernandez DR, Higashi DL, Sodergren E, Weinstock GM, Rounsley SD, So M: Genome sequencing reveals widespread virulence gene exchange among human Neisseria species. PLoS One. 2010, 5 (7): e11835-10.1371/journal.pone.0011835.PubMed CentralPubMedView ArticleGoogle Scholar
- Schoen C, Blom J, Claus H, Schramm-Gluck A, Brandt P, Muller T, Goesmann A, Joseph B, Konietzny S, Kurzai O, Schmitt C, Friedrich T, Linke B, Vogel U, Frosch M: Whole-genome comparison of disease and carriage strains provides insights into virulence evolution in Neisseria meningitidis. Proc Natl Acad Sci USA. 2008, 105 (9): 3473-3478. 10.1073/pnas.0800151105.PubMed CentralPubMedView ArticleGoogle Scholar
- Joseph B, Schneiker-Bekel S, Schramm-Gluck A, Blom J, Claus H, Linke B, Schwarz RF, Becker A, Goesmann A, Frosch M, Schoen C: Comparative genome biology of a serogroup B carriage and disease strain supports a polygenic nature of meningococcal virulence. J Bacteriol. 2010, 192 (20): 5363-5377. 10.1128/JB.00883-10.PubMed CentralPubMedView ArticleGoogle Scholar
- Zerbino DR, Birney E: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 2008, 18 (5): 821-829. 10.1101/gr.074492.107.PubMed CentralPubMedView ArticleGoogle Scholar
- Zerbino D: Using the Velvet de novo Assembler for Short-Read Sequencing Technologies. Curr Protoc Bioinformatics. 2010, 11 (5): 1-12.Google Scholar
- Didelot X, Urwin R, Maiden MC, Falush D: Genealogical typing of Neisseria meningitidis. Microbiology. 2009, 155 (10): 3176-3186. 10.1099/mic.0.031534-0.PubMed CentralPubMedView ArticleGoogle Scholar
- Holmes EC, Urwin R, Maiden MCJ: The influence of recombination on the population structure and evolution of the human pathogen Neisseria meningitidis. Mol Biol Evol. 1999, 16 (6): 741-749. 10.1093/oxfordjournals.molbev.a026159.PubMedView ArticleGoogle Scholar
- Russell JE, Jolley KA, Feavers IM, Maiden MC, Suker J: PorA variable regions of Neisseria meningitidis. Emerg Infect Dis. 2004, 10 (4): 674-678. 10.3201/eid1004.030247.PubMed CentralPubMedView ArticleGoogle Scholar
- Thompson EAL, Feavers IM, Maiden MCJ: Antigenic diversity of meningococcal enterobactin receptor FetA, a vaccine component. Microbiology. 2003, 149 (Pt 7): 1849-1858.PubMedView ArticleGoogle Scholar
- Brehony C, Wilson DJ, Maiden MC: Variation of the factor H-binding protein of Neisseria meningitidis. Microbiology. 2009, 155: 4155-4169. 10.1099/mic.0.027995-0.PubMed CentralPubMedView ArticleGoogle Scholar
- Benjamini Y, Speed TP: Summarizing and correcting the GC content bias in high-throughput sequencing. Nucleic Acids Res. 2012, 40 (10): e72-10.1093/nar/gks001.PubMed CentralPubMedView ArticleGoogle Scholar
- Dohm JC, Lottaz C, Borodina T, Himmelbauer H: Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Res. 2008, 36 (16): e105-10.1093/nar/gkn425.PubMed CentralPubMedView ArticleGoogle Scholar
- Kozarewa I, Ning Z, Quail MA, Sanders MJ, Berriman M, Turner DJ: Amplification-free Illumina sequencing-library preparation facilitates improved mapping and assembly of (G + C)-biased genomes. Nat Methods. 2009, 6 (4): 291-295. 10.1038/nmeth.1311.PubMed CentralPubMedView ArticleGoogle Scholar
- Aird D, Ross MG, Chen WS, Danielsson M, Fennell T, Russ C, Jaffe DB, Nusbaum C, Gnirke A: Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries. Genome Biol. 2011, 12 (2): R18-10.1186/gb-2011-12-2-r18.PubMed CentralPubMedView ArticleGoogle Scholar
- Jolley KA, Hill DM, Bratcher HB, Harrison OB, Feavers IM, Parkhill J, Maiden MC: Resolution of a meningococcal disease outbreak from whole genome sequence data with rapid web-based analysis methods. J Clin Microbiol. 2012, 50 (9): 3046-3053. 10.1128/JCM.01312-12.PubMed CentralPubMedView ArticleGoogle Scholar
- Budroni S, Siena E, Hotopp JCD, Seib KL, Serruto D, Nofroni C, Comanducci M, Riley DR, Daugherty SC, Angiuoli SV, Covacci A, Pizza M, Rappuoli R, Moxon ER, Tettelin H, Medini D: Neisseria meningitidis is structured in clades associated with restriction modification systems that modulate homologous recombination. Proc Natl Acad Sci USA. 2011, 108 (11): 4494-4499. 10.1073/pnas.1019751108.PubMed CentralPubMedView ArticleGoogle Scholar
- Ogata H, Goto S, Sato K, Fujibuchi W, Bono H, Kanehisa M: KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 1999, 27 (1): 29-34. 10.1093/nar/27.1.29.PubMed CentralPubMedView ArticleGoogle Scholar
- Kanehisa M, Goto S: KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 2000, 28 (1): 27-30. 10.1093/nar/28.1.27.PubMed CentralPubMedView ArticleGoogle Scholar
- Jolley KA, Bliss CM, Bennett JS, Bratcher HB, Brehony CM, Colles FM, Wimalarathna HM, Harrison OB, Sheppard SK, Cody AJ, Maiden MC: Ribosomal Multi-Locus Sequence Typing: universal characterization of bacteria from domain to strain. Microbiology. 2012, 158: 1005-1015. 10.1099/mic.0.055459-0.PubMed CentralPubMedView ArticleGoogle Scholar
- Jolley KA, Maiden MC: Using MLST to study bacterial variation: prospects in the genomic era. Future Microbiol. 2014, 9: 623-630. 10.2217/fmb.14.24.PubMedView ArticleGoogle Scholar
- Loman NJ, Constantinidou C, Chan JZM, Halachev M, Sergeant M, Penn CW, Robinson ER, Pallen MJ: High-throughput bacterial genome sequencing: an embarrassment of choice, a world of opportunity. Nat Rev Microbiol. 2012, 10 (9): 599-606. 10.1038/nrmicro2850.PubMedView ArticleGoogle Scholar
- Aury JM, Cruaud C, Barbe V, Rogier O, Mangenot S, Samson G, Poulain J, Anthouard V, Scarpelli C, Artiguenave F, Wincker P: High quality draft sequences for prokaryotic genomes using a mix of new sequencing technologies. BMC Genomics. 2008, 9: 603-10.1186/1471-2164-9-603.PubMed CentralPubMedView ArticleGoogle Scholar
- Reuter S, Ellington MJ, Cartwright EJ, Koser CU, Torok ME, Gouliouris T, Harris SR, Brown NM, Holden MT, Quail M, Parkhill J, Smith GP, Bentley SD, Peacock SJ: Rapid bacterial whole-genome sequencing to enhance diagnostic and public health microbiology. JAMA Intern Med. 2013, 173 (15): 1397-1404. 10.1001/jamainternmed.2013.7734.PubMed CentralPubMedView ArticleGoogle Scholar
- Bratcher HB, Bennett JS, Maiden MCJ: Evolutionary and genomic insights into meningococcal biology. Future Microbiol. 2012, 7 (7): 873-885. 10.2217/fmb.12.62.PubMed CentralPubMedView ArticleGoogle Scholar
- Parkhill J, Achtman M, James KD, Bentley SD, Churcher C, Klee SR, Morelli G, Basham D, Brown D, Chillingworth T, Davies RM, Davis P, Devlin K, Feltwell T, Hamlin N, Holroyd S, Jagels K, Leather S, Moule S, Mungall K, Quail MA, Rajandream MA, Rutherford KM, Simmonds M, Skelton J, Whitehead S, Spratt BG, Barrell BG: Complete DNA sequence of a serogroup A strain of Neisseria meningitidis Z2491. Nature. 2000, 404 (6777): 502-506. 10.1038/35006655.PubMedView ArticleGoogle Scholar
- Bentley SD, Vernikos GS, Snyder LA, Churcher C, Arrowsmith C, Chillingworth T, Cronin A, Davis PH, Holroyd NE, Jagels K, Maddison M, Moule S, Rabbinowitsch E, Sharp S, Unwin L, Whitehead S, Quail MA, Achtman M, Barrell B, Saunders NJ, Parkhill J: Meningococcal genetic variation mechanisms viewed through comparative analysis of serogroup C strain FAM18. PLoS Genet. 2007, 3 (2): e23-10.1371/journal.pgen.0030023.PubMed CentralPubMedView ArticleGoogle Scholar
- Tettelin H, Saunders NJ, Heidelberg J, Jeffries AC, Nelson KE, Eisen JA, Ketchum KA, Hood DW, Peden JF, Dodson RJ, Nelson WC, Gwinn ML, DeBoy R, Peterson JD, Hickey EK, Haft DH, Salzberg SL, White O, Fleischmann RD, Dougherty BA, Mason T, Ciecko A, Parksey DS, Blair E, Cittone H, Clark EB, Cotton MD, Utterback TR, Khouri H, Qin H, et al: Complete genome sequence of Neisseria meningitidis serogroup B strain MC58. Science. 2000, 287 (5459): 1809-1815. 10.1126/science.287.5459.1809.PubMedView ArticleGoogle Scholar
- Wetzel J, Kingsford C, Pop M: Assessing the benefits of using mate-pairs to resolve repeats in de novo short-read prokaryotic assemblies. BMC Bioinformatics. 2011, 12: 95-10.1186/1471-2105-12-95.PubMed CentralPubMedView ArticleGoogle Scholar
- Treangen TJ, Salzberg SL: Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nature Reviews Genetics. 2012, 13 (2): 36-Google Scholar
- Quail MA, Kozarewa I, Smith F, Scally A, Stephens PJ, Durbin R, Swerdlow H, Turner DJ: A large genome center's improvements to the Illumina sequencing system. Nat Methods. 2008, 5 (12): 1005-1010. 10.1038/nmeth.1270.PubMed CentralPubMedView ArticleGoogle Scholar
- Quail MA, Swerdlow H, Turner DJ: Improved protocols for the illumina genome analyzer sequencing system. Curr Protoc Hum Genet. 2009, 18: doi:10.1002/0471142905.hg1802s62Google Scholar
- Paszkiewicz K, Studholme DJ: De novo assembly of short sequence reads. Brief Bioinform. 2010, 11 (5): 457-472. 10.1093/bib/bbq020.PubMedView ArticleGoogle Scholar
- Demerec M, Adelberg EA, Clark AJ, Hartman PE: A proposal for a uniform nomenclature in bacterial genetics. Genetics. 1966, 54 (1): 61-76.PubMed CentralPubMedGoogle Scholar
- Bambini S, De Chiara M, Muzzi A, Mora M, Lucidarme J, Brehony C, Borrow R, Masignani V, Comanducci M, Maiden MC, Rappuoli R, Pizza M, Jolley KA: Neisseria adhesin A variation and revised nomenclature scheme. Clin Vaccine Immunol. 2014, 21: 966-971. 10.1128/CVI.00825-13.PubMed CentralPubMedView ArticleGoogle Scholar
- Medini D, Donati C, Tettelin H, Masignani V, Rappuoli R: The microbial pan-genome. Curr Opin Genet Dev. 2005, 15 (6): 589-594. 10.1016/j.gde.2005.09.006.PubMedView ArticleGoogle Scholar
- Bennett JS, Bentley SD, Vernikos GS, Quail MA, Cherevach I, White B, Parkhill J, Maiden MCJ: Independent evolution of the core and accessory gene sets in the genus Neisseria: insights gained from the genome of Neisseria lactamica isolate 020-06. BMC Genomics. 2010, 11: 652-10.1186/1471-2164-11-652.PubMed CentralPubMedView ArticleGoogle Scholar
- Joseph B, Schwarz RF, Linke B, Blom J, Becker A, Claus H, Goesmann A, Frosch M, Muller T, Vogel U, Schoen C: Virulence evolution of the human pathogen Neisseria meningitidis by recombination in the core and accessory genome. PLoS One. 2011, 6 (4): e18441-10.1371/journal.pone.0018441.PubMed CentralPubMedView ArticleGoogle Scholar
- Hotopp JC, Grifantini R, Kumar N, Tzeng YL, Fouts D, Frigimelica E, Draghi M, Giuliani MM, Rappuoli R, Stephens DS, Grandi G, Tettelin H: Comparative genomics of Neisseria meningitidis: core genome, islands of horizontal transfer and pathogen-specific genes. Microbiology. 2006, 152 (Pt 12): 3733-3749.View ArticleGoogle Scholar
- Brehony C, Jolley KA, Maiden MC: Multilocus sequence typing for global surveillance of meningococcal disease. FEMS Microbiol Rev. 2007, 31 (1): 15-26. 10.1111/j.1574-6976.2006.00056.x.PubMedView ArticleGoogle Scholar
- Brehony C, Trotter CL, Ramsay ME, Chandra M, Jolley KA, van der Ende A, Carion F, Berthelsen L, Hoffmann S, Harðardóttir H, Vazquez J, Murphy K, Toropainen M, Caniça M, Ferreira E, Diggle M, Edwards G, Taha M-K, Stefanelli P, Kriz P, Gray S, Fox A, Jacobsson S, Claus H, Vogel U, Tzanakaki G, Heuberger S, Caugant DA, Frosch M, Maiden MCJ: Differential age distribution of disease-associated meningococcal lineages-Implications for vaccine development. Clin Vaccine Immunol. 2014, 21 (6): 847-853. 10.1128/CVI.00133-14.PubMed CentralPubMedView ArticleGoogle Scholar
- Watkins ER, Maiden MC: Persistence of hyperinvasive meningococcal strain types during global spread as recorded in the PubMLST database. PLoS ONE. 2012, 7 (9): e45349-10.1371/journal.pone.0045349.PubMed CentralPubMedView ArticleGoogle Scholar
- Didelot X, Falush D: Inference of bacterial microevolution using multilocus sequence data. Genetics. 2007, 175 (3): 1251-1266.PubMed CentralPubMedView ArticleGoogle Scholar
- Caugant DA, Mocca LF, Frasch CE, Frøholm LO, Zollinger WD, Selander RK: Genetic structure of Neisseria meningitidis populations in relation to serogroup, serotype, and outer membrane protein pattern. J Bacteriol. 1987, 169 (6): 2781-2792.PubMed CentralPubMedGoogle Scholar
- Olyhoek T, Crowe BA, Achtman M: Clonal population structure of Neisseria meningitidis serogroup A isolated from epidemics and pandemics between 1915 and 1983. Rev Infect Dis. 1987, 9: 665-682. 10.1093/clinids/9.4.665.PubMedView ArticleGoogle Scholar
- Harrison OB, Claus H, Jiang Y, Bennett JS, Bratcher HB, Jolley KA, Corton C, Care R, Poolman JT, Zollinger WD, Frasch CE, Stephens DS, Feavers I, Frosch M, Parkhill J, Vogel U, Quail MA, Bentley SD, Maiden MCJ: Description and nomenclature of Neisseria meningitidis capsule locus. Emerg Infect Dis. 2013, 19 (4): 566-573. 10.3201/eid1904.111799.PubMed CentralPubMedView ArticleGoogle Scholar
- Bille E, Ure R, Gray SJ, Kaczmarski EB, McCarthy ND, Nassif X, Maiden MC, Tinsley CR: Association of a bacteriophage with meningococcal disease in young adults. PLoS ONE. 2008, 3 (12): e3885-10.1371/journal.pone.0003885.PubMed CentralPubMedView ArticleGoogle Scholar
- Bille E, Zahar JR, Perrin A, Morelle S, Kriz P, Jolley KA, Maiden MC, Dervin C, Nassif X, Tinsley CR: A chromosomally integrated bacteriophage in invasive meningococci. J Exp Med. 2005, 201 (12): 1905-1913. 10.1084/jem.20050112.PubMed CentralPubMedView ArticleGoogle Scholar
- Claus H, Friedrich A, Frosch M, Vogel U: Differential distribution of novel restriction-modification systems in clonal lineages of Neisseria meningitidis. J Bacteriol. 2000, 182 (5): 1296-1303. 10.1128/JB.182.5.1296-1303.2000.PubMed CentralPubMedView ArticleGoogle Scholar
- Urwin R, Russell JE, Thompson EA, Holmes EC, Feavers IM, Maiden MC: Distribution of Surface Protein Variants among Hyperinvasive Meningococci: Implications for Vaccine Design. Infect Immun. 2004, 72 (10): 5955-5962. 10.1128/IAI.72.10.5955-5962.2004.PubMed CentralPubMedView ArticleGoogle Scholar
- Staden R: The Staden sequence analysis package. Mol Biotechnol. 1996, 5: 233-241. 10.1007/BF02900361.PubMedView ArticleGoogle Scholar
- Tamura K, Peterson D, Peterson N, Stecher G, Nei M, Kumar S: MEGA5: molecular evolutionary genetics analysis using maximum likelihood, evolutionary distance, and maximum parsimony methods. Mol Biol Evol. 2011, 28 (10): 2731-2739. 10.1093/molbev/msr121.PubMed CentralPubMedView ArticleGoogle Scholar
- Langmead B, Trapnell C, Pop M, Salzberg SL: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009, 10 (3): R25-10.1186/gb-2009-10-3-r25.PubMed CentralPubMedView ArticleGoogle Scholar
- Milne I, Bayer M, Cardle L, Shaw P, Stephen G, Wright F, Marshall D: Tablet–next generation sequence assembly visualization. Bioinformatics. 2010, 26 (3): 401-402. 10.1093/bioinformatics/btp666.PubMed CentralPubMedView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.