Global catalogue of microorganisms (gcm): a comprehensive database and information retrieval, analysis, and visualization system for microbial resources
© Wu et al.; licensee BioMed Central Ltd. 2013
Received: 4 September 2013
Accepted: 27 December 2013
Published: 30 December 2013
Throughout the long history of industrial and academic research, many microbes have been isolated, characterized and preserved (whenever possible) in culture collections. With the steady accumulation in observational data of biodiversity as well as microbial sequencing data, bio-resource centers have to function as data and information repositories to serve academia, industry, and regulators on behalf of and for the general public. Hence, the World Data Centre for Microorganisms (WDCM) started to take its responsibility for constructing an effective information environment that would promote and sustain microbial research data activities, and bridge the gaps currently present within and outside the microbiology communities.
Strain catalogue information was collected from collections by online submission. We developed tools for automatic extraction of strain numbers and species names from various sources, including Genbank, Pubmed, and SwissProt. These new tools connect strain catalogue information with the corresponding nucleotide and protein sequences, as well as to genome sequence and references citing a particular strain. All information has been processed and compiled in order to create a comprehensive database of microbial resources, and was named Global Catalogue of Microorganisms (GCM). The current version of GCM contains information of over 273,933 strains, which includes 43,436bacterial, fungal and archaea species from 52 collections in 25 countries and regions.
A number of online analysis and statistical tools have been integrated, together with advanced search functions, which should greatly facilitate the exploration of the content of GCM.
A comprehensive dynamic database of microbial resources has been created, which unveils the resources preserved in culture collections especially for those whose informatics infrastructures are still under development, which should foster cumulative research, facilitating the activities of microbiologists world-wide, who work in both public and industrial research centres. This database is available from http://gcm.wfcc.info.
KeywordsMicrobial resources Data management Data sharing
Microbial culture collections play an important and essential role in collecting, maintaining, and distributing quality assured living microbial strains. The Word Federation for Culture Collections (WFCC) is a Multidisciplinary Commission of the International Union of Biological Sciences (IUBS) and a Federation within the International Union of Microbiological Societies (IUMS).The WFCC promotes the interests of culture collections, develops shared resources, and organizes the International Conference on Culture Collections every three years. As one of its longstanding activities, the WFCC participated in the development of the WFCC World Data Centre for Microorganisms (WDCM) in the late 1960s . With additional input from the United Nations Educational, Scientific and Cultural Organization Microbial Resources Centers (MIRCEN) project, the WDCM was maintained as the WFCC-MIRCEN WDCM and become accessible as an internet page in 1997. The WDCM serves as the data center of the WFCC and provides an important information resource for all microbiological activities. Additionally, the WDCM acts as a coordination center for data activities among WFCC members. As one of the main databases in WDCM, CCINFO (Culture Collection INFOrmation database) lists 652 culture collections from 70 countries maintain more than 1.9 million strains. (http://www.wfcc.info/ccinfo/, accessed 12/3/2013).
Increasing demands on culture collections for authenticated, reliable biological material and associated information were accompanied by the growth of biotechnology and basic science. The WFCC guidelines recommend that every collection publish an online or printed catalogue regularly, both to disseminate information about strains and to promote scientific and industrial usage of materials held in their collection. However, according to the available statistics, fewer than one-sixth of collections registered in CCINFO post their catalogue online and this greatly hinders the visibility and hence the accessibility of strains in these collections without public electronic catalogs.
To help all collections establish an online catalog, the WDCM has constructed a data management system and a global catalogue to organize, make public, and explore the data resources of its member collections. This data management system, called the WFCC Global Catalogue of Microorganisms (GCM) is a scalable, reliable, dynamic and user-friendly system that helps culture collections manage, disseminate and share the information related to their holdings. It also provides a uniform interface for the scientific and industrial communities to access the comprehensive microbial resource information.
Construction and content
The Global Catalog of Microorganisms database contains information from a variety of sources:
Information provided by culture collection staff
Data from public data sources such as the US National Library of Medicine (PubMed) and the Patent database
Links to external databases
Tools for bioinformatics analysis including a search engine to enhance exploration of GCM data.
Participant list of GCM collections
BIOTEC Culture Collection
BCCM Diatom Collection Gent
Belgian Coordinated Collections of Microorganisms / IHEM Fungi colleciton
Belgian Coordinated Collections of Microorganisms / LMBP Plasmid Collection
Belgian Coordinated Collections of Microorganisms/ LMG Bacteria Collection
Mycotheque de l’Universite catholique de Louvain
BCCM/ULC Culture Collection of (sub)polar cyanobacteria
Bioresource Collection and Research Center
Belarusian Collection of non-pathogenic microorganisms
Centraalbureau voor Schimmelcultures, Filamentous fungi and Yeast Collection
Culture Collection of Algae and Protozoa
Culture Collection of Antimirobial Resistant Microorganisms
Culture Collection of Cryophilic Algae
Coleccion Espanola de Cultivos Tipo
China General Microbiological Culture Collectio Center
The Collection of the Institut Pasteur
Centre International de Ressources Microbiennes - Champignons Filamenteux
Centre International de Ressources Microbiennes - Levures (CLBP)
Centre International de Ressources Microbiennes - Levures
Coleccion de Microorganismos del Centro Nacional de Recursos Geneticos
Centro Venezolano de Colecciones de Microorganismos
Herbarium of Kharkov University (CWU) – Micro Algae Cultures Collection
Medical importance fungi culture collection
Leibniz-Institut DSMZ-Deutsche Sammlung von Mikroorganismen und Zellkulturen GmbH
Freshwater Algae Culture Collection, Chinese Academy of Sciences
Fungal Genetics Stock Center
Coleção de Leishmania do Instituto Oswaldo Cruz
Guangdong Culture Collection Centre of Microbiology
Helicobacter pylori Korean Type Culture Collection
CABI Genetic Resource Collection
Industrial Technology Development Institute Microbial Culture Collection
Belgian Coordinated Collections of Microorganisms Mycobacterial Culture Collection
Japan Collection of Microorganisms
KCTC Korean Collection for Type Cultures
Korea national Environmental Microorganisms Bank
Korea Marine Microalgae Culture Center
Korea Lichen & Allied Bioresource Center
Lembaga Ilmu Pengetahuan Indonesia , Indonesian Institute for Sciences
Microbial Culture Collection - Museum of Natural History, Museum of Natural History (MNH)
NITE Biological Resource Center
Philippine National Collection of Microorganisms
Plant Virus GenBank
TISTR Culture Collection, Bangkok MIRCEN
Ukrainian Collection of Cholera Aetiological Agents O1 and non O1 serogroups
Phaff Yeast Culture Collection
The UNILAB Clinical Culture Collection, United Laboratories
Micoteca da Universidade do Minho
UOA/HCPF University of Athens/Hellenic Collection of Pathogenic Fungi
Natural Sciences Research Institute Culture Collection
MICROBIAL CULTURE COLLECTION UNIT
All-Russian Collection of Microorganisms
Vietnam Type Culture Collection
Summary of GCM strain data
The GCM implements the WDCM Minimum Data Sets (MDS) and Recommended Data Sets (RDS) based on widely applied standards such as the OECD Best Practice Guidelines for Biological Resource Centres , the Microbial Information Network Europe (MINE) , as well as the Common Access to Biological Resources and Information (CABRI) . A detailed description, together with examples of 15 WDCM MDS items can be found at http://gcm.wfcc.info/datastandards/index.jsp (last accessed 12/3/2013).To build the GCM, each participating collection transferred their catalogue information by one of several pathways. Some collections sent Excel or XML files while others provided direct access to their database files. WDCM integrated the data into a global dataset, processed the data to identify relationships among collections (for example strains held in multiple collections), and published the strain information on the GCM web page (http://www.gcm.wfcc.info. Because not all collections use the same data schema, some of the data items provided by culture collection staff were manually reclassified by GCM staff to allow for an easier integration of catalogue information.
Publications concerning strains are collected from PubMed using both strain number and species name for keyword queries. Nucleotide sequences are extracted from GenBank , protein sequence data are collected from UniProt , and information about protein 3D structure are extracted from the PDB database . Genome sequencing information is collected from NCBI Microbial Genomes Resources (NCBI).
Organization of data
Isolation sources of Strains sorted by type of organism
Isolation source type
Genetic engineering strain
Top 20 countries from which strains were collected
Data quality control
Because original catalogue data are sometimes non-validated, quality control measures are necessary before data can be published in GCM online. The most frequent quality problem is the misspelling of species name or non-standard naming of species. For example, “Absidiapsychrophilia” was wrongly spelled as “Absidiapsychrophila” in certain collections. In such cases, GCM uses standard microbial nomenclature databases to perform a quality check of its taxonomic data. Databases include the List of Prokaryotic names with Standing in Nomenclature (LPSN) , “Species 2000”, and NCBI taxonomy  for bacteria and archaea, MycoBank  for fungi and yeast.
Result summary of species name check
Un-matched species name
Percentage of un-match
The second type of problems with the quality of information is related to data content. For example, some “Escherichia coli” strains were wrongly assigned as “Fungi” in the host collection databases. The GCM system collects and compares the lists of differences in the description of cultures in one collection with cultures of the same strains in other collections.
History information was used to do the quality check for species name as well. Totally 12147 strains contain detailed history information in GCM. The system listed all of species name and compared with their history species name in other collections. The result indicated that among 12147 strains, 1746 strains had different species name with their history strains. Further analysis on the result showed that, among the mismatch, 267 belonged to misspelling problems such as “Candida viswannathii” was wrongly spelled to “Candida viswanathii”. However, the left were mistakes or name changes occur during the strain transfer between collections.
Divergent results are forwarded to the curators of the respective collections for corrections. Performing such controls for all fields of the database greatly assist collections in correcting existing mistakes.
Interface and web tools
The database homepage contains a world map which indicates the countries and regions that have already joined the GCM project. Statistics and graphics indicate the continuing acquisition of data into the GCM. A simplified search interface allows the querying of the database by using the strain number and species name. In addition, a variety of tools have been implemented to enhance its use. The main web tools that were integrated into the GCM are the following:
Three query options are available in the advanced search section. Users may search strains within a range of values for one or several properties, including cultivation temperature, substrate, or application, before retrieving the retrieve corresponding results.
Since GCM maintains nucleotide sequences data associated with individual strains, a sequence alignment tool based on the Basic Local Alignment Search Tool (BLASTN)  is included. Results are ordered by similarity.
Bibliographic and patents queries are also possible and allow users to search by keywords in titles, abstracts of articles or patents. Search results are listed as strain numbers, strain names, publication abstracts and titles and can be exported in text file format.
With the advanced search tools, the system can perform the following searches
➢ Searching for type strains for some taxa in certain culture collections
➢ Searching for strains with specific characteristics in the list of Culture Collection (CC) or Biological Resource Center (BRC), such as range of growth temperature, transfer history, collected location and others
➢ Searching for strains with specific properties
➢ Searching strains isolated from various substrates, including sludge or wastewater, soils, sediment, fermentation products. Results are listed in table format, with the type of organism type used as column name;
➢ Searching strains with particular protein coding genes
Results are listed by strain number, species name, culture collections, and isolation sources. A few filter windows are provided in the result page to allow users to refine the results by collections, growth temperature, isolation sources or organism type.
Species tree viewer
A species2000 taxonomy tree is used for the organization of strain information. Species names are used to map between GCM data and species2000 name (http://www.sp2000.org/), and then a taxonomic tree containing the number of strains for each genus is constructed. User can then browse the taxonomy tree itself, or search a species name within it.
A variety analysis tools are also employed on both the strain information and species page. The BLAST program (Altschul SF ) was used for sequence homology searches within the database. For sequences related to the same strain or species, the ClustalW  program is provided to perform multiple sequence alignment analysis.
Data update and management
Discussion and conclusion
A large amount of microbial resources are preserved as living strains in collections, however, information describing these strains is often unavailable. Each culture collection is independently responsible for the maintenance of data associated with their microbes, there is presently no enforced data harmonization and information sharing mechanism is available. Such situation hinders both the efficient management of collections and the ability to explore statistics about world microbial resources. Therefore, there is great demand for developing a mechanism for digital, online resource sharing, which provides a fundamental tool for best practices in information management.
The major target group for such system are culture collections staff, as well as academic and industrial microbiologists. We believe that GCM will assist collections, which lack the required human resources and information technology, to publish their stock information in an efficient and standardized way that is most useful for scientific and industrial communities. Database queries via a user-friendly and web-based interface should greatly promote the sharing and use of microbial resources.
While this project is still in its early stage, we are confident that it will continue to grow with the further addition of data, analytical tools and other functionalities. In the future, additional database management tools will be provided to allow more culture collections to share their data via GCM. These tools will lead to the increased availability of accessible data pertaining to microbial strains held in public collections and their utilization for bioindustry, medicine, and research. As it grows, GCM will incorporate information related to enzymatic and metabolic pathways using developing genomics and bioinformatics tools. Ultimately, GCM is a comprehensive data platform on microbial resources that is available to the public.
Availability and requirements
The GCM database runs on a platform with both Java and MySQL server. Catalogue information gathered from associated collections is centralized within WDCM servers, which is hosted at the Institute of Microbiology, of the Chinese Academy of Sciences.
The Blast program is used for the sequence homology search in the database (BLASTN 2.2.25). Multiple sequence alignments are performed using the ClustalW program (version2.1). GCM is available at http://gcm.wfcc.info.
GCM project was initiated by WDCM and approved by the WFCC board. WDCM acknowledged the contributions of all participating collections to the GCM project. At the time of writing this article 52 collections from 25 countries have already joined the effort.
- Satoru Miyazaki HS: Networking of biological resource centers: WDCM Experiences. Data Science Journal. 2002, 1 (2): 102-107.Google Scholar
- OECD Best Practice Guidelines for Biological Resource Centres. 2007, http://www.oecd.org/health/biotech/oecdbestpracticeguidelinesforbiologicalresourcecentres.htm,
- Gams W, Hennebert GL, Stalpers JA, Janssens D, Schipper MA, Smith J, Yarrow D, Hawksworth DL: Structuring strain data for storage and retrieval of information on fungiand yeasts in MINE, the microbial information network Europe. J Gen Microbiol. 1998, 134: 1667-1689.Google Scholar
- CABRI: GUIDELINES FOR CATALOGUE PRODUCTION. Common Access to Biological Resources and Information (CABRI). 1998, http://www.cabri.org/guidelines/catalogue/CPcover.html,Google Scholar
- Benson DA CM, Clark K, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW: GenBank. Nucleic Acids Res. 2013, 41 (D1): D36-D42. 10.1093/nar/gks1195.PubMed CentralView ArticlePubMedGoogle Scholar
- Consortium TU: Reorganizing the protein space at the Universal Protein Resource (UniProt). Nucleic Acids Res. 2012, 71-75. 40Google Scholar
- Rose PW, Beran B, Bi C, Bluhm WF, Dimitropoulos D, Goodsell DS, Prlic A, Quesada M, Quinn GB, Westbrook JD, Young J, Yukich B, Zardecki C, Berman HM, Bourne PE: The RCSB protein data bank: redesigned web site and web services. Nucleic Acids Res. 2011, 39: D392-401. 10.1093/nar/gkq1021.PubMed CentralView ArticlePubMedGoogle Scholar
- Cachuela-Palacio M: Towards an index of all known species: the Catalogue of Life, its rationale, design and use. Integrative Zoology. 2006, 1 (1): 418-421.View ArticleGoogle Scholar
- Hiroshi M: Metagenome and Microbes Environmental Ontology. 2013, http://bioportal.bioontology.org/ontologies/ME,Google Scholar
- Eamonn OT: Meeting report: hackathon-workshop on Darwin Core and MIxS standards alignment. Stand Genomic Sci. 2012, 7 (1): 166-170. 10.4056/sigs.3166513.View ArticleGoogle Scholar
- Euzéby JP: List of bacterial names with standing in nomenclature: aFolder available on the internet. Int J Syst Evol Microbiol. 1997, 47 (2): 590-592.View ArticleGoogle Scholar
- Federhen S: The NCBI Taxonomy database. Nucleic Acids Res. 2012, 40 (1): D136-D143.PubMed CentralView ArticlePubMedGoogle Scholar
- Crous PW, Walter G, Stalpers JA, Vincent R, Gerrit S: MycoBank: an online initiative to launch mycology into the 21st century. Stud Mycol. 2004, 50: 19-22.Google Scholar
- Altschul SF GW, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. Journal of molecular biology. 1990, 215: 403-410.View ArticlePubMedGoogle Scholar
- Larkin MA, Blackshields G, Brown NP, Chenna R, McGettigan PA, McWilliam H, Valentin F, Wallace IM, Wilm A, Lopez R, Thompson JD, Gibson TJ, Higgins DG: ClustalW and ClustalX version 2. Bioinformatics. 2007, 23 (21): 2947-2948. 10.1093/bioinformatics/btm404.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.