Overview of the CMRegNet system
The CMRegNet system is a database for transcriptional gene regulatory interactions of 20 (2 model organisms, 18 target organisms) different strains of the genera Corynebacterium and Mycobacterium. The system incorporates several bioinformatics data analysis procedures and information from different sources, in order to provide the user with all relevant information on regulatory interactions including the binding sites, protein sequences, gene annotations, and the genomic context of the regulation. The database itself runs on a MySQL 5.5 community server. The web service of CMRegNet is written in PHP (version 5.2.1) and delivered by an Apache web server (version 2.2.22).
As aforementioned in the introduction, the amount of known regulations is very scarce and only limited to a handful of model organisms. Thus, one key aspect of CMRegNet is the automated transfer of evolutionarily conserved regulations of these model organisms to the so called target organisms. For CMRegNet we exploit the same transfer pipeline which was already successfully used in CoryneRegNet [8–12] and MycoRegNet [13] in order to predict evolutionarily conserved regulations. CMRegNet may be regarded as the successor of the discontinued MycoRegNet [13] but was significantly extended: (1) we utilized ChiP-Seq data of Mycobacterium tuberculosis H37Rv in order to receive a comprehensive list of binding sites allowing M. tuberculosis to act as a model organism for the network transfer and (2) in contrast to comparable systems, CMRegNet bases the transfer of evolutionarily conserved regulations on two model organisms (Corynebacterium and Mycobacterium). This increases the predicted regulations in both, quantitative and qualitative aspects.
In order to transfer a regulation from a model organism to a target organism, we defined a simplified model of a gene regulation: A regulation requires three main drivers, namely the transcription factor, the target gene and the corresponding binding site in the upstream region of the target gene. We consider a regulation as evolutionarily conserved, if in the target organism (1) the transcription factor is conserved, (2) the target gene is conserved and (3) the target gene possesses the binding site for the transcription factor in its upstream region.
In order to detect conserved genes (i.e., homologous genes), we decided to perform a homology detection based on the clustering of the protein sequences, i.e., the reported clusters form the groups of homologous proteins. Generally, for the homology detection with a clustering tool, a similarity measure between the proteins and a meaningful parameter setting for the employed clustering tool are required. For CMRegNet we use transitivity clustering (TransClust) and followed the approach described in [14] which suggest the usage of a BLAST all-vs.-all run on the protein sequences with an E-value cut-off of 10 as similarity function. The threshold (the parameter of TransClust) was selected following the suggestions in [15]. In this study, the authors developed a measure for judging the quality of a clustering for homology detection by basically evaluating two aspects of the cluster-size distribution: (1) the number of genes in the core-genome (genes shared by all organisms) and (2) the number of unrealistically large clusters (which most likely contain false positives). The idea is now to find that threshold which maximizes (1) while minimizing (2). In the original study, the authors suggest to pick a threshold between 34 and 61 for mycobacteria and 27 and 53 for corynebacteria. For CMRegNet, we decided to use a rather non-stringent threshold of 30 which is in the middle of the two lower bounds of both suggested threshold ranges. We decided to do so because we (1) have proteins from both genera, Corynebacterium and Mycobacterium and (2) the homology detection is only one of three criteria (as described above) for the prediction of an evolutionarily conserved regulation. Thus we are convinced that this selection of the threshold does not increase the false-positive rate while providing a large basis of potential homologous proteins for the regulation transfer.
We used PoSSuMsearch [16] with a p-value cut-off of 10 in order to identify possible binding sites in the upstream region (-540 pb … +40 pb relative to start codon) of the potential target gene. With that information, we can identify evolutionarily conserved regulations in the target organisms.
The transferred regulations undergo an additional refinement process utilizing operon predictions obtained from MicrobesOnline [17]. A regulation is only considered conserved, if the target gene is also the first gene in an operon. If this condition holds, all genes in the operon of the target organism are consequentially predicted to be regulated by the transcription factor in question.
In the case that the same regulation is predicted by the network transfer of both model organisms, we store and display two in silicio evidences for this regulation and refer to the two experimental validated regulations in the model organisms.
To sum up, for the target organisms, we require gene annotations and the operon predictions. For the model organisms we additionally need information of the regulatory interactions including the binding sites of the involved transcription factors. In the following, we describe all utilized data sources for CMRegNet.
Target organisms
For the 18 target organisms included into CMRegNet, the publicly available sequences and annotation data from the National Center for Biotechnology Information (NCBI) were retrieved [18]. The operon prediction data was provided by an integrated portal for comparative and functional genomics, MicrobesOnline [17].
Model organisms
For both model organisms, we obtained the operon predictions as well as the gene annotations from the same sources as for the target organisms. For the model organisms, additional information on the regulatory interactions had to be derived. The reconstruction of both regulatory networks is mainly composed of experimental data derived from the literature. In the following section, we describe the additional data sources used.
Corynebacterium glutamicum ATCC 13032
With CoryneRegNet [12], there already exists a reference database and analysis platform for corynebacterial gene regulatory networks. The biological content of CoryneRegNet comprehensively covers transcriptional regulations in the model organism C. glutamicum ATCC 13032 and provides all necessary information for CMRegNet, inclue TFBS and regulation. We extracted a total of 1,441 known regulatory interactions, 520 TFBS, 97 regulators, and their respective target genes. The data of CoryneRegNet is derived from various wet-lab experiments such as ChiP-ChiP, ChiP-Seq, and microarrays, but mostly derived from microarray experiments [8].
Mycobacterium tuberculosis H37Rv
For M. tuberculosis H37Rv, despite being a well-established model organism, no such database providing necessary support for transcriptional gene regulatory networks exists. However, for M. tuberculosis H37Rv, the Tuberculosis Database (TBDB) serves as a database collecting all tuberculosis related research resources, e.g., expression data, metabolomic networks, relevant publications, and many more. Especially, TBDB hosts several omics data from multiple strains of M. tuberculosis, as well as data related to the genera Mycobacterium [16, 19]. In contrast to C. glutamicum, we do not have the TFBS information of each mapped regulator for the genome of M. tuberculosis H37Rv. However, TBDB provides for every regulator the upstream region of the target genes which most likely contain the TFBS. In order to extract the actual binding sites required for CMRegNet, we performed the following strategy.
Retrieving the binding sites for M. tuberculosis
Through its “Search Regulatory Binding Sites” option, the TBDB provides a table of possible regulatory genes for a given gene of interest (Fig. 1a). The information is based on ChIP-Seq experiments. We processed the following core information: (1) gene regulator, (2) the distance of start codon of the target gene, and (3) the start and stop coordinates of a region of possible TFBS.
However, to predict TFBS in these regions, we found some inconsistencies, such as overlap, distance, size and peaks, within the operons (Fig. 1a and 1c). Although, there are some reported cases where the TFBS is found in regions with a high overlap and more significant distances in M. tuberculosis [20], we followed a more stringent criterion to reduce the number of false positives. We limited the peak regions to an area between +40 bps to -540 pbs in relation to the target gene. A Perl script was used to filter the data obtained by TBDB. For each regulator a FASTA file consisting of all sequences possibly containing the TFBS was created (Fig. 1d). These FASTA files formed the input for a subsequent TFBS prediction using MEME-ChIP [21].
MEME-Chip is a tool used for predicting large-scale motif sequences. We performed a MEME run on each FASTA file using the default parameters. An extensive literature search was performed to look for experimental data on TFBS. Whenever experimental evidence for a TFBS was available, we utilized this additional information by becoming more stringent in the setting of the “Maximum width motif” parameter according to the motif reported in the literature. An overview of the pipeline analyses is depicted in Fig. 2.
At this point, we have acquired all required data of the model organisms, namely the set of regulators and target genes with their corresponding TFBS. With this information we are able to run the previously described network transfer pipeline and transfer all evolutionarily conserved regulation from both model organisms to all 18 target organisms.