- Open Access
Supporting community annotation and user collaboration in the integrated microbial genomes (IMG) system
BMC Genomicsvolume 17, Article number: 307 (2016)
The exponential growth of genomic data from next generation technologies renders traditional manual expert curation effort unsustainable. Many genomic systems have included community annotation tools to address the problem. Most of these systems adopted a “Wiki-based” approach to take advantage of existing wiki technologies, but encountered obstacles in issues such as usability, authorship recognition, information reliability and incentive for community participation.
Here, we present a different approach, relying on tightly integrated method rather than “Wiki-based” method, to support community annotation and user collaboration in the Integrated Microbial Genomes (IMG) system. The IMG approach allows users to use existing IMG data warehouse and analysis tools to add gene, pathway and biosynthetic cluster annotations, to analyze/reorganize contigs, genes and functions using workspace datasets, and to share private user annotations and workspace datasets with collaborators. We show that the annotation effort using IMG can be part of the research process to overcome the user incentive and authorship recognition problems thus fostering collaboration among domain experts. The usability and reliability issues are addressed by the integration of curated information and analysis tools in IMG, together with DOE Joint Genome Institute (JGI) expert review.
By incorporating annotation operations into IMG, we provide an integrated environment for users to perform deeper and extended data analysis and annotation in a single system that can lead to publications and community knowledge sharing as shown in the case studies.
Traditional genomic annotation relies heavily on manual expert curation [1, 2]. With the arrival of next generation technologies, genomic data grows exponentially  while expert curation increasingly lags behind . Many systems have been developed to facilitate community-based curation to address this problem. Most of these systems are “wiki-based”; e.g., Gene Wiki  and WikiGenes  for gene annotations, RNA WikiProject  and miRBase  for RNA annotations, WikiProteins  and TOPSAN  for protein annotations, and WikiPathways  for pathway annotations. In general, these systems use wiki technologies to create a wiki page or “stub” for each gene/protein/etc., remove potential duplicates, create hyperlinks to relevant information, and let registered users be contributors to add community annotations. Even though wiki technologies enable community annotations, there are additional issues to be resolved:
Some wiki pages will need to remain private in a short period of time before they are ready for publication .
The usability issue needs to be addressed to encourage community participation .
The level of community participation is not high due to lack of incentive .
Some areas require special expertise and therefore are not suitable for annotations by community at large .
In order to address the above issues, various systems introduced mechanisms to track authorship, to limit editing to registered expert users, to incorporate expert review and validation, and to improve usability [4, 5, 7, 8, 10, 11]. Proposals have also been made to provide incentive and recognition of authorship [4, 5].
There are also non-wiki based systems such as ORegAnno  and BioGPS , which are standalone curation systems implemented using database technology and web-based user interface. Such systems avoid some problems of the wiki-based systems; however, they require more development efforts to implement data storage and user interface, and still need to address issues such as information validation, usability and community participation even though integrated genomic information provided by such systems helps improving the usability, and registered users provide the base for community participation.
The exponential data growth problem has also been encountered in the Integrated Microbial Genome System (IMG). As of January 2016, IMG has more than 38,000 archaeal, bacterial and eukaryotic genomes, with more than 140 million genes in those genomes. Many genes are simply annotated as hypothetic proteins without more specific information. There is also an urgent need to support community annotation and user collaboration in IMG. Therefore, MyIMG annotation, which is a tool tightly integrated into IMG and is not wiki based, has been developed.
There are several reasons we follow the integrated system approach rather than the wiki-based approach to support annotation and collaboration. First, wiki-based approaches require identifying and removing duplications so that there won’t be redundant and confusing web pages. All the above mentioned wiki-based systems only have thousands or tens of thousands of objects (genes, proteins, etc.), and it is not too difficult to identify duplications for removal. In contrast, IMG has more than 140 M isolate genes, and it is not trivial to identify non-redundant genes. Continuous data loading and marking as obsolete, older and redundant versions of genomes in IMG further complicate the problem. Second, IMG already provides many comparative analysis tools to aid users in finding additional information or locating potential “missing” genes that were overlooked by the gene calling pipelines. That is, IMG not only offers a place for users to add and share annotations but also provides tools to help users performing annotations. In addition, IMG provides tools for users to save genomes, genes, scaffolds and functions into Workspace datasets, and users can also use these Workspace functions to reorganize various objects (e.g., genes, functions) to suit their research needs. By incorporating annotation operations into IMG, we provide an integrated environment for users to perform extended data analysis and annotation in a single system that can lead to publications and community knowledge sharing as illustrated in the rest of this paper.
The integrated system approach does not automatically resolve all problems encountered by the wiki-based approach. We still need to address issues such as usability, authorship recognition and tracing, user incentive, and reliability even though existing IMG features already provide at least partial solutions. Moreover, IMG has more than 14,000 registered users from 93 countries as of January 2016, which provides a solid base for community participation. Involvement of DOE Joint Genome Institute (JGI) experts also helps with the reliability and certain usability issues.
IMG data warehouse is a hybrid system consisting of Oracle 11 g databases, SQLite databases and file systems. Accesses to the data warehouse are through IMG UI.
More information regarding the IMG system can be found at: http://img.jgi.doe.gov/.
IMG user groups
Biologists analyzing genomic sequences usually work in groups. They wish to share private genomic annotations with collaborators until the research results are accepted for publication. After that, the annotations can be available for general public consumption. Each scientist can participate in multiple research groups, and a group can be consisting of colleagues of the same or interdisciplinary fields or mentors and students. This effectively captures all four collaboration types: peer-to-peer, mentor-student, interdisciplinary, and producer-consumer as described in .
IMG started supporting user groups in 2007. Groups were created for JGI internal research groups, university professor and student groups, collaborative annotation jamborees, etc. Initially groups were created through email requests, and each user could belong to only one group. Recent extension allows users to create their own groups using IMG’s User Interface, and a user can belong to multiple groups. There are currently more than 70 user groups (as of January 2016).
New features allow users to actively manage their groups and to share information for collaboration. There are 3 possible roles for a user group: (1) owner, (2) co-owner, who also has the administrative privilege, and (3) member. Owners and co-owners can update group description, add members to a group, or remove members from a group. Even though members cannot add or remove other members, they can decide to withdraw from a group.
All group members can post news to share with other group members. News can include notifications, new publications, links to shared documentation (e.g., Google Doc) for collaboration, etc. Members can also grant access permission of their private genomes to other group members. Group members can also share annotations and workspace data sets (to be described below).
IMG gene annotations
We have started supporting IMG gene annotations since IMG ER 2.0 ; new features are continuously being added. There are two types of gene annotations:
MyIMG gene annotation allows users to add additional information to existing genes or to make the genes obsolete. Each MyIMG gene annotation includes the following fields that can be manually edited: gene product name, gene symbol, description, enzyme EC number, Pubmed ID, notes, and whether the gene is marked deleted.
Missing gene annotation allows users to add new genes that have been missed by gene calling pipelines. Each missing gene annotation includes the following fields: gene product name, gene symbol, locus type (protein coding gene, tRNA, rRNA, etc.), locus tag, coordinates on a scaffold, strand, and enzyme EC number.
Tools for finding candidate gene product name using function comparison, finding missing enzymes using KEGG pathways, and finding missing genes using Phylogenetic Profiler have already been described . Since then new tools are constantly being added. Here, we provide comprehensive gene annotation methods using both existing and recently developed tools.
Sequence similarity based annotation
The most common way to acquire additional gene information is by using sequence similarity search such as BLAST. If a gene g1 is found to be matching a better annotated gene g2, information of g2 such as gene product name, gene symbol, enzyme and protein information can be transferred to g1. Since it is very time consuming to check each gene of a genome using BLAST, IMG provides an analysis tool for massive gene comparison using Phylogenetic Profiler as shown in Fig. 1 .
Phylogenetic Profiler allows users to find genes of a target genome with or without homologs in one or more closely related reference genomes. For those genes with homologs, functional annotation of homolog genes can be transferred to the genes. Potential missing genes can be identified for genes in reference genomes that do not have homologs in the target genome. After users identify potential missing genes, they can then go to MyIMG to add missing gene annotations of those genes. Similar to MyIMG annotations, missing gene information is private by default and can be shared among group members.
Sequence similarity based approach, though simple, has its limitations. This approach relies on the availability of closely related reference genomes with better annotations. In addition, even though Phylogenetic Profile provides a list of potential missing genes for investigation, it is still tedious and time consuming to go through the list. For better results, sequence similarity based approach can be combined with additional approaches to be described below.
Function based annotation
Users can annotate a gene with more meaningful name (i.e., other than “hypothetical protein”) simply by checking functional annotation of the same gene. For example, genes without a product name but with evidence of potential functional annotation or with product name but without any evidence of functional annotation are candidates for product name review and curation .
Function Profile is a widely used tool to check whether a set of functions is present in closely related genomes. Users can take advantage of various function categories in IMG to help gathering a set of functions for running profile. For example,
COG Category: A COG category consists of a set of COG functions .
Pfam Clan: A Pfam clan consists of a set of Pfam functions .
TIGRfam Role: A TIGRfam role consists of a set of TIGRfam functions .
KEGG Pathway: A KEGG pathway consists of a set of enzymes .
KEGG Module: A KEGG module consists of a set of KO terms .
IMG Pathway: An IMG pathway consists of a set of ordered reactions; each reaction is linked to one or more IMG terms .
IMG Parts List: An IMG parts list consists of a set of IMG terms of related function.
Figure 2 shows an example of using IMG Part List Nodulation factor biosynthesis, export and regulation, which contains 22 IMG terms for enzymes, transporters and regulators participating in biosynthesis and export of nodulation factors, to find missing IMG terms in Bradyrhizobium genomes.
IMG also provides tools for users to investigate possible missing enzymes based on KEGG pathways as shown in . The tool uses both sequence similarity search and pre-computed gene-KO (KEGG Orthology) information in the database, which includes a list of genes not being annotated with enzymes because the association did not make the strict cutoff determined by the IMG data processing pipeline. Users can review the list and decide whether to add MyIMG gene-enzyme annotations using their professional judgment.
Even though the finding missing enzyme function has been introduced since 2009, it has not been widely used. We realize that with more than 38,000 archaeal, bacterial and eukaryotic genomes and 474 KEGG pathways in IMG, trying to find missing enzymes using the above tool is like finding needle in haystack. Therefore, we recently added additional functions (at the bottom of View Map for Selected Genomes page) to show all genomes participated in the selected KEGG pathway, and potential genomes with missing enzymes to help narrowing down candidate genomes (see Fig. 3).
For many researchers, KEGG pathways are often too broad, and they’d rather rely on KEGG modules with more restricted focus. Therefore, we recently introduced colored KEGG module maps and finding missing functions using KEGG modules similar to what we have done for KEGG pathways. An example of finding genes missing KO terms is shown in Fig. 4.
IMG phenotype prediction and pathway assertion also provides a way for users to identify genes missing IMG term assignment. It is shown in  that Burkholderia sp. SJ98 contains genes for chorismate synthesis. However, the genome does not have IMG Pathway 146 Chorismate synthesis asserted. The pathway assertion status is unknown due to missing IMG term 335 shikimate dehydrogenase (EC 22.214.171.124) even though there are ortholog genes annotated with this term. After using sequence similarity search, 2 genes were found to be potential candidates of missing term assignment.
Another new tool in the gene detail page allows users to find the function distribution of other public genes in IMG with the same functional association of a particular gene. Users can then view those public genes with selected functional assignment to find a more meaningful name of the candidate gene (see Fig. 5).
Gene neighborhood based annotation
Gene neighborhood is another common tool used for gene annotations. Simply by looking at the gene neighborhood diagram, a user can sometimes tell whether a gene is too long or too short, and whether there are overlapping genes. Long intergenic region or presence of genes in reference genomes shown in the gene neighborhood can also suggest the existence of missing genes. Expert users often rely on sequence visualization and analysis tools such as Artemis  to identify missing genes.
An example of using gene neighborhood to aid MyIMG gene annotation is shown in Fig. 6.
Since microbial genes of related functions tend to locate closely together on a scaffold, gene neighborhood method can be combined with function based method to find missing genes. Following the nodulation factor example in Fig. 2, when a genome is missing a function as shown in Function Profile, there can be two possibilities: (i) one or more genes of the genome should have been annotated with this function, or (ii) the gene calling pipeline missed calling gene(s) for the function. In case (ii) a user can investigate intergenic regions of genes with functions to spot potential missing genes (Fig. 2(iv)).
Spurious genes can be added by incorrect gene calling programs. When gene neighborhood with the same top COG hit returns no result, there is a possibility that the gene may not be real. When gene neighborhood shows overlapping genes, it is also a good indication that one or more genes are incorrectly called. Domain experts can also identify erroneous genes by checking the sequence data. Spurious genes can be genes that are too long, too short, with incorrect starting codon, etc . IMG users can mark deleting genes by creating MyIMG gene annotations with “Remove Gene from Genome?” field set to Yes.
The review of genes and their functional annotations may lead to the identification of missing genes. For every marked deleted gene, it is possible to identify one or more genes in the neighborhood. For example,
Gene too short: There may be a longer gene.
Gene too long: There may be one or more shorter genes.
Incorrect starting codon: There can be a real gene downstream or upstream.
Sharing and comparing MyIMG annotations
If a user belongs to one or more IMG groups, then the user can view all MyIMG annotations by group members with the following restrictions:
The user must have access permission to the genomes. All MyIMG annotations on private genomes will only be visible to other group members that have access permission to the private genomes. Shared or public MyIMG annotations on public genomes are not restricted.
Those MyIMG annotations must be either public or shared by the authors of the annotations. (An author can selectively share MyIMG annotations with different groups; e.g., sharing with Group 1, but not with Group 2. In this case, Group 1 members can view the MyIMG annotations, but Group 2 members cannot.)
The user can only view, but not modify, MyIMG annotations by other group members.
The “View Group Annotations” option in MyIMG allows a user to view shared annotations by group members. Many IMG users have used this feature for group annotations with colleagues. For users who belong to multiple IMG groups, there will be a dropdown selection for users to switch groups.
A recently introduced “Update Group Sharing” section allows users to change the group sharing option of selected annotations. There are two options for a user to share his/her MyIMG annotations:
sharing all MyIMG annotations of selected genome(s);
sharing individually selected MyIMG annotations.
MyIMG annotations are private by default. However, IMG users can release any of their MyIMG annotations to public. Public MyIMG annotations are visible to all IMG users provided that users have access permissions to the corresponding genomes. Public MyIMG annotations on public genomes can be viewed by all users. Moreover, since IMG ER is an “Expert Review” site, all public missing gene annotations can be reviewed and modified by JGI experts.
The “Show All User Annotation” function in a Gene Detail page allows a user to view all MyIMG annotations available to him/her on this particular gene. All gene annotations together with curator names are listed in a table for easy comparison.
Some research areas require profound domain knowledge and are best left to only expert annotations. Therefore, many systems have restricted certain editing capabilities to experts only . Poor annotations can lead to multiple dubious entries . This is especially true in IMG where pathway assertion results are used to predict phenotypes . Hence, only JGI experts and a few external users with special permissions are allowed to IMG pathway curation  (Fig. 7). (Interested users can contact us to request for the pathway curation privilege.)
Due to the tremendous effort required to define a pathway, IMG currently has only 900+ IMG pathways in the database. All IMG pathways are public to all users.
Biosynthetic cluster annotations
Biosynthetic clusters and secondary metabolites (or natural products) are recent additions to the IMG system . There is currently increasing research interest in biosynthetic clusters and natural products. However, the amount of experimentally available data in this area is scarce. Among more than 1 million experimentally verified and predicted biosynthetic clusters in IMG, only less than 0.2 % of the clusters are associated with any secondary metabolites.
It is possible to associate biosynthetic clusters to secondary metabolites using sequence similarity search and pathway structures as described in a case study in . Users can also use additional pathway or KEGG module information to discover or predict secondary metabolites as shown in an example in Fig. 8.
Users can use the “Add SM Annotation” function to annotate their discovery. Each MyIMG SM annotation includes a compound selection, NCBI accession and taxon information (if any) and free text comments. The annotation will remain private until the owner decides to release the information to community at large. We are hoping that with the introduction of IMG-ABC and the new MyIMG SM annotation features, more community users will collaborate to contribute to the advance in this research area.
Workspace for annotation and collaboration
IMG Workspace allows users to store their work in progress in four types of datasets: genome sets, scaffold sets, gene sets and function sets. Unlike analysis carts, which are transient and are deleted after each session, workspace datasets are stored permanently until users explicitly delete them. It can be considered as a form of electronic notebooks for genomic data.
Besides data storage function, workspace can also be used for additional user annotation and collaboration, which cannot be achieved by using MyIMG annotations alone. We will describe these additional functions immediately below.
Workspace scaffold sets for organizing new genomes or metagenomes
Metagenome “binning” involves isolating certain scaffolds from a metagenomic dataset in order to remove contamination or to extract isolate genomes or single cells from the metagenome . Many IMG analysis tools such as Phylogenetic Distribution, Kmer Frequency, Function Profile have been widely used for metagenome binning. Isolated or de-contaminated scaffolds can be saved into workspace scaffold sets, which can be further investigated using additional analysis tools provided by IMG. Users can also export the nucleotide sequence of scaffolds in a particular dataset to resubmit to IMG as a new genome or metagenome.
Workspace gene sets for creating new biosynthetic clusters
IMG-ABC system  includes more than 1 million experimentally verified and predicted biosynthetic clusters. In each biosynthetic cluster detail page, there are additional information showing secondary metabolites associated with the cluster and pathway participation of genes in the cluster. KEGG Map display of a biosynthetic cluster shows not only genes of this cluster but also other genes in the genome not in the cluster. In this way, a user can see clearly how well a cluster covers a pathway. An example in Fig. 9 shows a biosynthetic cluster that covers only portion of a pathway, while a new cluster with additional genes upstream and downstream will be able to cover an entire path.
Workspace gene set can be used as a tool for users to annotate their own biosynthetic clusters. A user can start with loading genes of an experimentally verified or predicted biosynthetic cluster into Gene Cart. In addition to biosynthetic clusters that are already in IMG, it is also possible for users to find genes mentioned in literature but not included in any IMG gene clusters. Additional genes on the same scaffold can be added based on analysis results from various IMG tools such as the KEGG Map display example described above. It is also possible to detect genes that should have been excluded. The final analysis result can be saved as a workspace gene set with a meaningful name. A Genbank-format file can be generated to include all genes in the final result, and the file can be submitted to IMG as a new genome fragment. Alternatively, a user can also obtain a portion of the scaffold based on gene coordinates and then submit the sequence to IMG as a new genome fragment.
Workspace function sets for defining pathways
For IMG users who are not interested in active pathway curation or who do not have the curation permission, it is still possible for them to “make their own pathway” using workspace function sets. A user can start with studying an IMG pathway, a MetaCyc pathway, a KEGG pathway or a KEGG Module to collect all or some of the functions (IMG terms, KO terms or enzymes). The user can then check pathway assertion of various genomes using the Genome Set Function Profile in Workspace or using the Function-Genome Profile provided in Analysis Cart (see Fig. 10). This tool enables users to try out pathway construction that is not limited to a single type of functions (e.g., IMG terms or enzymes only) and contributes to future pathway curation.
Shared workspace for collaboration
Ever since the introduction of Workspace to the IMG system in 2011, more than 8.7 % of all IMG registered users have used Workspace to store more than 77,000 datasets for their analysis (as of January 2016). We have described how IMG users can use workspace scaffold sets or gene sets to perform “above gene level” annotations to form new genomes or biosynthetic clusters, and use workspace function sets to make their own pathways. To perform group annotation, users can share their workspace datasets with their colleagues. Previously workspace dataset sharing was achieved through exporting and importing datasets, which is not only cumbersome but also does not support interactive analysis. We have recently introduced dynamic workspace dataset sharing within IMG groups to overcome the obstacle.
If a user belongs to one or more IMG groups, then the dataset list in Workspace will have 2 additional columns:
Owner: the owner of a dataset (either “me” or name of a group member);
Shared with Group: IMG group(s) having access to this dataset.
All Workspace datasets including genome, gene, scaffold and function can be shared. Users not only can view the content of a shared dataset but also can use shared datasets to perform analysis. However, users do not have editing privilege of others’ datasets; that is, they cannot edit or delete a dataset that they do not own. Alternatively, a user can copy a shared dataset content and then edit the new version of his/her own.
With the introduction of workspace dataset sharing, users can now work together on metagenome binning, biosynthetic cluster and pathway study with colleagues. They can dynamically compare results using workspace profile functions or set operations to reach a group consensus and then resubmit the final datasets back into IMG, which can then be shared with community at large.
Results and discussion
For many years IMG users have used the community annotation and user collaboration features in IMG for their research leading to information sharing and publications. We will describe three selected use cases in this section. None of the cases use all the provided features in IMG because certain features are irrelevant to their research and/or some new features were not available at their time of work (e.g., shared workspace was not available until June 2015). Therefore, we also present a mock scenario at the end of this section to show case how new IMG users can benefit the most.
Neha Potnis (University of Florida) and 11 colleagues have formed an IMG group for their Xanthomonas research. Three Xanthomonas genomes were submitted to IMG in 2009. The group used Xanthomonas campestris pv. vesicatoria 85-10 as a reference for annotation and used the gene neighborhood feature to add MyIMG annotations. They also identified genes with incorrect starting codon, and those genes were marked obsolete. Finding missing gene function was used frequently to add genes that gene calling pipeline has failed to identify. They also found actual genes coded on opposite strands. The group commented that IMG platform allowed them to work collaboratively where scientists with expertise in different virulence systems could annotate the respective genes/clusters of their interest. After the three genomes have been properly annotated, new Genbank files were generated and submitted to NCBI. The new versions were added back into IMG in 2011:
Xanthomonas gardneri PDDCC 1620, ATCC 19865 (IMG Taxon OID: 651324109)
Xanthomonas perforans 91-118 (IMG Taxon OID: 651324110)
Xanthomonas vesicatoria Maraite, ATCC 35937 (IMG Taxon OID: 651324111)
Their research lead to a publication , which is also listed in the genome detail page of the above three genomes.
Zhe Lyu and advisor (China Agricultural University) have formed a group to study three Methanocella genomes. Their study focused on annotating genes identified by gene calling rather than adding missing genes. The users used gene neighborhood analysis to find operons, and used gene similarity search to find top homologs, orthologs and paralogs. Zhe also used other third party tools to build phylogenetic trees using results from IMG. The results from various methods assisted his MyIMG annotations. Zhe’s MyIMG annotations on the following two genomes were released to public and could be viewed by all IMG users:
Methanocella arvoryzae MRE50 (reannotation) (IMG Taxon OID: 2505679073)
Methanocella paludicola SANAE (reannotation) (IMG Taxon OID: 2505679075)
Ann Hirsch (UCLA), colleagues and students have formed one of the largest IMG groups so far (more than 20 people). Their annotation effort has been running since 2009 and is still continuing. Because of the long involvement, this group also used the most community annotation and user collaboration features. Many IMG group features were inspired by their needs.
The group has used BLAST, homolog searches, and various comparative analysis tools provided by IMG to assist MyIMG annotations. Gene neighborhood analysis was used to check conservation of genes among different bacteria. Additional enzyme information was found using the finding missing enzyme function. The results aided their MyIMG annotations.
In addition to annotating existing genes, some genes were marked for removal from the genomes when there’s good evidence, mostly resulting from not finding similar genes using neighborhood searches. New genes were added using the finding missing gene function provided by IMG as well as by checking gene neighborhoods and using sequence similarity searches.
Their research results have been accepted for many publications including , which is listed as a genome publication of Burkholderia tuberum STM678 (Burkholderia tuberum STM678T (IHQD assembly)) (IMG Taxon OID: 2512047030).
Recently the group also started experimenting with additional new features such as workspace in assisting group collaboration. There is also a plan to release MyIMG annotations upon the acceptance of their papers.
We present here a mock scenario to show case how users can benefit the most from available IMG community annotation and user collaboration features.
A principal investigator (PI) first creates an IMG group to include all collaborators. Some collaborators can be assigned the role of co-owners to help with group administration. All relevant genomes and metagenomes for this research project can be saved in one or more workspace genome sets to be shared by all group members. PI can also use the grant genome permission feature in the IMG Group to grant access of private genomes and metagenomes to group members. A news item can be posted to inform members of the shared workspace genome sets to work on. A hyperlink to shared documents (e.g., Google Doc) can also be included in the news.
Group members then start researching on the genomes listed in the shared workspace genome datasets. They can save genes of interest to various workspace gene sets to be shared with other group members. Users can load contents of private or shared gene sets into Gene Cart and use a plethora of tools provided in IMG for analysis. They can add MyIMG gene annotations for gene product names, missing enzymes and additional protein information. They can also use sequence visualization tools and phylogenetic profiler to discover potential missing genes in an isolate genome. Gene neighborhood search results can be used to add additional annotations, to spot spurious genes, or to find new missing genes. Workspace gene sets and MyIMG annotations can be shared among group members so that users can compare results. Users can also view MyIMG gene annotations highlighted in a KEGG pathway map. If the research involves biosynthetic clusters and secondary metabolites (or natural products), then MyIMG SM annotations can also be added.
Once the research is complete, the PI can consolidate all MyIMG annotations. IMG provides a function for users to include MyIMG gene annotation to generate a Genbank file of an isolate genome. A user can then review and revise the Genbank file for new submission (e.g., to the IMG submission system). After the research result is accepted for publication, PI can make private genomes and corresponding MyIMG gene annotations public to be shared with community at large. New publications can be added to corresponding projects in the Genomes OnLine Database (GOLD) , and the publication information will be available to all IMG users from the genome detail pages.
In this paper we present IMG features that support community annotation and user collaboration. IMG users can create IMG user groups to share genomes, user annotations and workspace datasets. They can also use various analysis and annotation tools in IMG to assist their research as described in detail in the implementation section. Case studies in the results section show that annotation can be part of the research process leading to knowledge sharing and academic publications.
We are able to address various issues encountered by other genomic annotation systems as follows:
Usability: IMG provides integrated genomic information and various analysis tools to help users with their research and investigation. IMG users not only can add annotations to genes, but also can perform metagenome binning and form new gene clusters using tools provided by IMG, which is difficult to achieve using only a wiki-based system.
Authorship recognition and tracing: IMG provides author recognition by linking annotations to users. Users can compare gene annotations of different authors in a list display, which is much easier and clearer than tracing through many versions of document editing. Publications (information obtained through GOLD) are listed in genome detail pages.
User incentive: IMG annotation can be part of a research process as described in this paper. Users can incorporate their annotation results in the new version of genomes or simply release their existing private annotations upon the acceptance of their research paper. Since it requires minimal additional effort, we believe that users are more willing to participate.
Reliability: Annotations are linked to real users. Genome detail pages list publications that have gone through strict peer review. Moreover, JGI experts are closely involved with public annotations, which greatly improves the reliability of the information. Help from JGI experts was acknowledged in many user publications.
Unlike most genomic annotation systems that only focus on one type of data (e.g., gene or pathway), IMG provides an integrated environment with genomes, genes, functions, pathways, etc. So far we have only provided user annotation features for genes (product name, protein information and enzyme), pathways (for experts only), biosynthetic clusters and secondary metabolites. There is no reason besides resource limitation that we cannot extend annotation capabilities to more gene features and/or to other types of objects in IMG. This will be an area for future improvement.
Consent to publish
Availability of data and material
IMG/ER (Expert Review) is available through URL http://img.jgi.doe.gov/mer. Browser requirements are described in the implementation section of this paper. IMG is committed to provide scientists worldwide free support for genome & metagenome data annotation & integration and open access comparative analysis of integrated genome and metagenomes. IMG users need to register at: JGI Single Sign On (JGI SSO) in order to obtain a login and password for gaining access to IMG’s data content and analysis tools (free of charge). Logins/passwords allow users to (i) submit their own genomes/metagenomes and keep them “private” for up to two years while they review and revise annotations; (ii) employ IMG’s curation tools for identifying and correcting annotation anomalies, such as protein products, for both private or public genomes-annotation revisions are recorded/saved in user specific “MyIMG” files on IMG’s file system; (iii) employ IMG’s Workspace which supports a persistent version of IMG’s “Carts” and performing long running analysis computations; (iv) download IMG genome and metagenome datasets via JGI’s Portals.
Clusters of Orthologous Groups
Integrated Microbial Genomes system
DOE Joint Genome Institute
Kyoto Encyclopedia of Genes and Genomes
Kyrpides NC, Ouzounis CA. Whole-genome sequence annotation: ‘Going wrong with confidence’. Mol Microbiol. 1999;32(4):886–7.
Kyrpides NC. Fifteen years of microbial genomics: meeting the challenges and fulfilling the dream. Nat Biotechnol. 2009;27(7):627–32. doi:10.1038/nbt.1552.
Stephens ZD, Lee SY, Faghri F, Campbell RH, Zhai C, Efron MJ, Iyer R, Schatz MC, Sinha S, Robinson GE. Big Data: Astronomical or Genomical? PLoS Biol. 2015;13(7):e1002195. doi:10.1371/journal.pbio.1002195.
Howe D, Costanzo M, Fey P, Gojobori T, Hannick L, Hide W, Hill DP, Kania R, Schaeffer M, St Pierre S, Twigger S, White O, Rhee SY. Big data: The future of biocuration. Nature. 2008;455:47–50.
Huss III JW, Orozco C, Goodale J, Wu C, Batalov S, Vickers TJ, Valafar F, Su AI. A Gene Wiki for Community Annotation of Gene Function. PLoS Biol. 2008;e175(7):1398–402. doi:10.1037/journal.pbio.0060175.
Hoffmann R. A wiki for the life sciences where authorship matters. Nature Genomics. 2008;40(9):1047–51. doi:10.1038/ng.f.217.
Daub J, Gardner PP, Tate J, Ramskold D, Manske M, Scott WG, Weinberg Z, Griffiths-Jones S, Bateman A. The RNA WikiProject: Community annotation of RNA families. RNA. 2008;14:2462–4. doi:10.1261/rna.1200508.
Kozomara A, Griffiths-Jones S. miRBase: integrating microRNA annotation and deep-sequencing data. Nucleic Acids Res 2010, 1-6. doi: 10.1093/nar/gkq1027
Mons B, Ashburner M, Chichester C, van Mulligen E, Weeber M, den Dunnen J, van Ommen GJ, Musen M, Cockerill M, Hermjakob H, Mons A, Packer A, Pacheco R, Lewis S, Berkeley A, Melton W, Barris N, Wales J, Meijssen G, Moeller E, Roes PJ, Borner K, Bairoch A. Calling on a million minds for community annotation in WikiProteins. Genome Biol. 2008;9:R89. doi:10.1186/gb-2008-9-5-r89.
Weekes D, Krishna SS, Bakolitsa C, Wilson IA, Godzik A, Wooley J. TOPSAN: a collaborative annotation environment for structural genomics. BMC Bioinformatics. 2010;11:426. doi:10.1186/1471-2105-11-426.
Kelder T, van Iersel MP, Hanspers K, Kutmon M, Conklin BR, Evelo CT, Pico AR. WikiPathways: building research communities on biological pathways. Nucleic Acids Res. 2012;40(Database issue):D1301–1307. doi:10.1093/nar/gkr1074.
Griffith OL, Montgomery SB, Bernie B, Chu B, Kasaian K, Aerts S, Mahony S, Sleumer MC, Bilenky M, Haeussler M, Griffith M, Gallo SM, Giardine B, Hooghe B, Loo PV, Blanco E, Ticoll A, Lithwick S, Portales-Casamar E, Donaldson IJ, Robertson G, Wadelius C, De Bleser P, Vlieghe D, Halfon MS, Wasserman W, Hardison R, Bergman CM, Jones SJM and The Open Regulatory Annotation Consortium. ORegAnno: an open-access community-driven resource for regulatory annotation. Nucleic Acids Res. 2008;36(Database issue):D107–13. doi:10.1093/nar/gkm967.
Wu C, Orozco C, Boyer J, Leglise M, James Goodale J, Batalov S, Hodge CL, Haase J, Janes J, Huss III JW, Su AI. BioGPS: an extensible and customizable portal for querying and organizing gene annotation resources. Genome Biol. 2009;10:R130. doi:10.1186/gb-s009-10-11-r130.
Kouzes RT, Myers JD, Wulf WA. Collaboratories: Doing Science on the Internet. IEEE Computer. 1996;29:40–6.
Markowitz VM, Mavromatis K, Ivannova NN, Chen IMA, Chu K, Kyrpides NC. IMG ER: a system for microbial annotation expert review and curation. Bioinformatics. 2009;25(17):2271–8.
Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, Rao BS, Smirnov S, Sverdlov AV, Vasudevan S, Wolf YI, Yin JJ, Natale DA. The COG database: an updated version includes eukaryotes. BMC Bioinformatics. 2003;4:41.
Finn RD, Bateman A, Clements J, Coggill P, Eberhardt RY, Eddy SR, Heger A, Hetherington K, Holm L, Mistry J, Sonnhammer ELL, Tate J, Punta M: Pfam: the protein families database. Nucleic Acids Res 2013, 1-9. doi:10.1093/nar/gkt1223.
Selengut JD, Haft DH, Davidsen T, Ganapathy A, Gwinn-Giglio M, Nelson WC, Richter AR, White O. TIGRFAMs and Genome Properties: tools for the assignment of molecular function and biological process in prokaryotic genomes. Nucleic Acids Res. 2007;35:D260–4.
Kanehisa M, Goto S, Sato Y, Kawashima M, Furumichi M, Tanabe M. Data, information, knowledge and principle: back to metabolism in KEGG. Nucleic Acids Res. 2014;42:D199–205.
Anderson I, Chen IA, Markowitz VM, Kyrpides NC, Ivanova NN. SOP for pathway inference in Integrated Microbial Genomes (IMG). Stand Genomic Sci. 2011;5(3):420–3.
Kumar S, Vukram S, Raghava GPS. Genome Sequence of the Nitroaromatic Compound-Degrading Bacterium Burkholderia sp. Strain SJ98. J Bacteriology. 2012;194(12):3286.
Rutherford K, Parkhill J, Crook J, Horsnell T, Rice P, Rajandream MA, Barrell B. Artemis: sequence visualization and annotation. Bioinformatics. 2000;16(10):944–5.
Pati A, Ivanova NN, Mikhailova N, Ovchinnikova G, Hooper SD, Lykidis A, Kyrpides NC. GenePRIMP: a gene prediction improvement pipeline for prokaryotic genomes. Nat Methods. 2010;7(6):455–7.
Chen IMA, Markowitz VM, Chu K, Anderson I, Mavromatis K, Kyrpides NC, Ivanova NN. Improving microbial genome annotations in an integrated database context. PLoS One. 2013;8(2), e54859.
Hadjithomas M, Chen IMA, Chu K, Ratner A, Palaniappan K, Szeto E, Huang J, Reddy TBK, Cimermancic P, Fischbach MA, Ivanova NN, Markowitz VM, Kyrpides NC, Pati A. IMG-ABC: A knowledge base to fuel discovery of biosynthetic gene clusters and novel secondary metabolites. mBio. 2015;6(4):e00932–15. doi:10.1128/mBio.00932-15.
Markowitz VM, Chen IMA, Chu K, Szeto E, Palaniappan K, Pillay M, Ratner A, Huang J, Pagani I, Tringe S, Huntemann M, Billis K, Varghese N, Tennessen K, Mavromatis K, Pati A, Ivanova NN, Kyrpides NC. IMG/M 4 version of the integrated metagenome comparative analysis system. Nucleic Acids Res. 2013;42(D1):D568–573.
Potnis N, Krasileva K, Chow V, Almeida NF, Patil PB, Ryan RP, Sharlach M, Behlau F, Dow JM, Momol MT, White FF, Preston JF, Vinatzer BA, Koebnik R, Setubal JC, Norman DJ, Staskawicz BJ, Jones JB. Comparative genomics reveals diversity among xanthomonads infecting tomato and pepper. BMC Genomics. 2011;12:146. doi:10.1186/1471-2164-12-146.
Lu Z, Lu Y. Complete genome sequence of a thermophilic methanogen, Methanocella conradii HZ254, isolated from Chinese rice field soil. J Bacteriol. 2012;194(9):2398–9. doi:10.1128/JB.00207-12.
Lyu Z, Lu Y. Comparative genomics of three Methonocellales strains reveal novel taxonomic and metabolic features. Environ Microbiol Rep. 2015;7(3):526–37. doi:10.1111/1758-2229.
Angus AA, Agapakis CM, Fong S, Yerrapragada S, Estradade Delos Santos P, Yang P, Song N, Kano S, Caballero-Mellado J, de Faria SM, Dakora FD, Weinstock G, Hirsch AM. Plant-Associated Symbiotic Burkholderia Species Lack Hallmark Strategies Required in Mammalian Pathogenesis. PLoS One. 2014;9(1):e83779. doi:10.1371/journal.pone.0083779.
Reddy TBK, Thomas AD, Stamatis D, Bertsch J, Isbandi M, Jansson J, Mallajosyula J, Pagani I, Lobos EA, Kyrpides NC. The Genomes OnLine Database (GOLD) v.5: A metadata management system based on a four level (meta)genome project classification. Nucleic Acids Res. 2014;43:D1099–1106.
We would like to thank Neha Potnis (University of Florida), Zhe Lyu (China Agricultural University) and Ann Hirsch (UCLA) for agreeing to participate in use case study. They have shared their experiences by providing detailed information regarding how they annotate genes, spot spurious genes, and identify missing genes. Their help greatly improves the content of this paper.
Director, Office of Science, Office of Biological and Environmental Research, Life Sciences Division, U.S. Department of Energy (Contract No. DE-AC02-05CH11231); Office of Science of the U.S. Department of Energy (Contract No. DE-AC02-05CH11231, resources of the National Energy Research Scientific Computing Center). Funding for open access charge: University of California.
The authors declare that they have no competing interests.
NCK, NNI and VMM conceived the original idea of supporting user annotations in IMG. IAC and KC proposed the idea of supporting workspace in IMG. IAC, KC, ES, JH and AR implemented the IMG UI features and workspace. KP designed the database schema. IAC, KP, ES, MP and MHu developed and maintained the IMG data warehouse and data loading pipelines. NCK, NNI, NM and GO contributed to gene and pathway annotations, and MHa and IAC contributed to biosynthetic cluster and secondary metabolite annotations. IAC conducted the user case studies and wrote the paper with improvement suggestions and revisions from VMM, KP and NCK. VMM and NCK supervised the IMG project. All authors read and approved the manuscript.