Experimental data annotation
The main objective during the article annotation phase for RegTransBase was to collect experimental evidences of transcriptional regulation and experimentally characterized TF binding sites. The main steps of the data collection. Described in detail in our previous article , are the following: search for relevant articles in PubMed , entry of data through a specialized annotator interface, quality control, mapping sites and genes to genomes, additional manual corrections (if necessary) and presentation of the data in the final format. The entry quality is controlled by a number of consistency and completeness checks. The genomic location of a specific feature (site or gene) is then recorded by the annotator as a signature (a DNA sequence fragment of sufficient length) that is then used to map all the features in the database to a wide range of the NCBI RefSeq genomes [26, 27].
Each database entry describes a single experiment that is an experimentally determined relationship between several database elements. A single entry may describe an experiment and control, identical results obtained by different methods or the results of the application of one technique to several similar objects. Only original results are recorded, normally from the ‘Results’ or ‘Discussion’ sections of an article.
The types of experimental techniques form a controlled vocabulary. The following categories of experiments were accepted: (i) regulation of gene expression by a known regulator; (ii) demonstration that a gene encodes a regulatory protein (excluding proteins that do not directly bind DNA, e.g. protein kinases); (iii) experimental mapping of DNA binding sites for known regulators; (iv) identification of mutations in regulatory genes influencing expression of regulated genes; (v) computational prediction of binding sites.
The classes of elements in the database are: regulators (regulatory proteins and RNAs directly binding to DNA, with a well-defined binding site); effectors (molecules not binding DNA or physical effects such as stress, etc.); and positional elements. The latter are described as regions in DNA sequences. Positional elements form a hierarchy: locus > operon > transcript > gene and site; an elements may be a sub-elements of other elements of the same or higher levels (e.g., a site and a gene may be a sub-element of a operon).
All elements are linked to the corresponding experiments and together they are linked to the original article. As mentioned above, positional elements are mapped to genomes, thus if two independent articles describe regulation of the same gene, the data contained in these articles will be interlinked via this gene, but sites and other experimental data will be reported as independent entries.
Our original publication on RegTransBase  and the Help pages at http://regtransbase.lbl.gov provide more details on the procedure of experimental data annotation.
Putative regulons from experimental data
The Putative Regulons section of RegTransBase provides a list of experimental sites along with a non-redundant list of target genes for each regulator. The process we undertook in developing this list of putative regulons from the manually curated data includes three steps.
First, we selected a subset of experiments using the following criteria: (i) the experiment describes a single regulator, (ii) a regulator and its regulated genes belong to the same genome, (iii) no computational predictions are included.
Second, from this subset we extracted the pairs ‘regulator-regulated gene’ for each genome, taking into account operon structure, that extend the list of regulated genes by adding other members of a particular operon. In some cases we see a particular pair of a regulator and an associated regulated gene in multiple entries in RegTransBase. We removed such redundant pairs from the list of regulator-regulated genes based on positional mapping.
Third, we compiled a list of putative regulons by unifying all ‘regulator-gene’ pairs with the same regulator.
Manually curated position weight matrices
Each record in the Manually Curated PWM section of the database comprises a TFBS training set (alignment) created by an expert curator using published experimental data and manual in silico analyses. The curator first gathered information about a known transcription factor where a set of binding sites was known, created a summary of a description of this transcription factor by scanning published articles, and recorded its genomic location. The curator then annotated binding sites and their sequence, downstream gene, location in a published genome, and any published experimental evidence. In addition, curators supplied groups of organisms that they believe could be used when searching for homologous binding sites based on phylogenetic distance of organism and presence of a conserved transcription factor. Lastly, the curator recorded default scores and the expected distance a binding site would be from the start of a gene based on examination of the existing binding sites.
A PWM is automatically created in the RegTransBase database based on the TFBSs alignment. We then searched all recommended bacterial genomes using MAST . We recorded all hits that passed the following criteria into the RegTransBase database: e-value of 1e-5 or better, it did not overlap coding regions and it was upstream of a predicted gene.
With each record, we provide the binding site location with a reference to a published sequence (usually NCBI RefSeq ), the sequence, the gene which is affected by the binding site, the evidence for the binding if any, any relevant articles pertaining to that site, and the transcription factor which binds the site. We also provide for download the sequence logo for the alignment, profiles and alignments in many different formats, and recommended options in using the profiles for searching other genomes (cut-off scores, distance from gene, taxonomy).
As of November 2012, RegTransBase contains information on 666 bacterial species from 224 genera. This resource allows for access to the information on 19000 different experiments from about 7200 articles from as far back as 1977 until the present day (more details in Table 1).