PairWise Neighbours database: overlaps and spacers among prokaryote genomes
© Pallejà et al. 2009
Received: 14 January 2009
Accepted: 25 June 2009
Published: 25 June 2009
Skip to main content
© Pallejà et al. 2009
Received: 14 January 2009
Accepted: 25 June 2009
Published: 25 June 2009
Although prokaryotes live in a variety of habitats and possess different metabolic and genomic complexity, they have several genomic architectural features in common. The overlapping genes are a common feature of the prokaryote genomes. The overlapping lengths tend to be short because as the overlaps become longer they have more risk of deleterious mutations. The spacers between genes tend to be short too because of the tendency to reduce the non coding DNA among prokaryotes. However they must be long enough to maintain essential regulatory signals such as the Shine-Dalgarno (SD) sequence, which is responsible of an efficient translation.
PairWise Neighbours is an interactive and intuitive database used for retrieving information about the spacers and overlapping genes among bacterial and archaeal genomes. It contains 1,956,294 gene pairs from 678 fully sequenced prokaryote genomes and is freely available at the URL http://genomes.urv.cat/pwneigh. This database provides information about the overlaps and their conservation across species. Furthermore, it allows the wide analysis of the intergenic regions providing useful information such as the location and strength of the SD sequence.
There are experiments and bioinformatic analysis that rely on correct annotations of the initiation site. Therefore, a database that studies the overlaps and spacers among prokaryotes appears to be desirable. PairWise Neighbours database permits the reliability analysis of the overlapping structures and the study of the SD presence and location among the adjacent genes, which may help to check the annotation of the initiation sites.
The availability of fully sequenced genomes has grown exponentially over the past few years. There is a huge variety of environments for the prokaryote species, as well as different metabolic and genomic complexities. However, prokaryote genomes have common architectural principles . The prokaryote genomes contain protein-coding genes, structural RNAs and spacers between genes which are thought to typically contain regulatory signals  and the origin of replication sequence . These spacers tend to be short because of the selective pressure to minimize the non-functional DNA in prokaryotes [2, 4]. It is a consistent feature of these genomes that the genes often overlap their coding sequences . Under this scenario of genomic compactness due to their physically small environments, the overlapping genes follow the rules that impose the structure of the genetic code and the spacers between genes must adapt their lengths to the requirements of the regulatory signals .
One of the regulatory signals that we can find between genes is the Shine-Dalgarno (SD) sequence . The SD sequence is a motif, 5'-GGAGG-3', located at the 5' of the initiation codons and is complementary to the sequence, 5'-CCUCC-3', located at the end of the 16S rRNAs . The ribosome does not need a perfect distance between the SD sequence and the start codon for the initiation of translation. However, it has been studied that when the SD resides within the 4 nucleotides from the initiation codon or when it is located as far as 13 nucleotides from the initiation codon, gene expression is decreased drastically [7–9]. The prokaryote species seem to have preferred distances between the SD and the start codon and these distances vary among the species , although this sequence has been found mostly from the 7th to the 12th base upstream from the start codon [10–12]. The location of the SD can help to correct the gene annotations  and could influence the spacing length and the stop codon usage .
Among the prokaryote genomes there is a huge amount of examples of overlapping genes [15–19]. The overlapping lengths tend to be short because of the selective pressure against long overlaps, as the existence of long overlapping reading frames increases the risk of deleterious mutations. The co-directional overlaps are the most common overlaps, which reflect that this is the most common orientation for a gene pair due to the tendency to be grouped in operons in prokaryote genomes [20–22]. Among the co-directional overlaps the 4 bps overlap is extremely common [5, 15, 23, 24], which permits the upstream stop codon and the downstream start codon overlap and the gene pair is thought to be translationally coupled . The co-directional and divergent overlapping genes can arise by 5'-end elongations when the downstream gene adopts a new start codon within the upstream coding sequence , while the co-directional and the convergent overlapping genes can arise by 3'-end extensions after a loss codon event . Overlaps in prokaryotes have been hypothesized to be involved in reducing the genome size in order to increase the density of genetic information [17, 24, 26, 27, 28], and in regulating gene expression through translational coupling of functionally related polypeptides [5, 24, 26, 29, 30]. In addition, other authors have used the overlapping pairs as genetic markers for phylogenetic inferences due to its high conservation [31, 32]. Overlapping genes are better conserved across the species than non-overlapping genes . The extent of conservation of the overlapping pairs correlates with the evolutionary distances between the pairs of species .
The overlapping genes, as a common structure of the prokaryote genomes, and the spacers between genes are structural features worth studying in prokaryotes. However, the analysis of both the overlapping genes and the spacers between genes is often affected by genome annotation errors [33–35]. An accurate annotation would facilitate the experiments as well as the bioinformatic analysis of gene regulation and gene structure . In this interactive database is stored all the overlapping genes and the spacers of 678 fully sequenced prokaryote genomes. The aim of this database is to provide the users with useful information about the overlapping genes and the spacing lengths between adjacent genes. The conservation of the overlaps across the species and the SD presence and location within the intergenic regions or the overlapping sequences can be analysed. Obviously, the quality of the information given depends on the quality of the genome annotations. In fact, this database can be used to analyse suspicious cases of genome annotation errors such as wrong initiation sites or false gene predictions.
The complete genome sequences of 678 prokaryote genomes were downloaded from the NCBI ftp site ftp://ftp.ncbi.nlm.nih.gov/genomes/. Scripts implemented in Perl language were performed to extract and analyse the spacers and the overlaps between adjacent genes and all the information related (spacing & overlapping lengths, spacing & overlapping sequences, gene orientations, phases, protein functions, gene COGS and stop & start codons of the genes). The internal gene ids in this database have been formed by joining the GenBank Accession Number with the gene name. For instance, the gene id for the HI0038 gene from Haemophilus influenzae Rd KW20 is NC_000907.HI0038. Furthermore, each overlap and spacer between adjacent genes has an internal id. The spacing lengths and the overlapping genes have been classified into three types according to their transcriptional direction [2, 16, 26]: i) unidirectional (genes in the same strand overlapping the 3'-end of an upstream gene and the 5'-end of a downstream gene), ii) convergent (genes in opposite strand overlapping the 3'-ends) and iii) divergent (genes in opposite strand overlapping the 5'-ends). In this database we use the term co-directional instead of the unidirectional term. In order to study the phases between adjacent genes, as other authors have previously done [5, 19, 23], we defined three overlapping phases: (i) phase 0 where the downstream gene is in frame with the upstream gene (lengths n = ..., -12, -9, -6, -3, 0, 3, 6, 9, 12, ...), (ii) phase 1 where the downstream gene is in the reading frame +1 relative to the upstream gene frame (lengths n = ..., -11, -8, -5, -2, 1, 4, 7, 10, ...) and (iii) phase 2 where the downstream gene is in the reading frame +2 relative to the upstream gene frame (lengths n = ..., -10, -7, -4, -1, 2, 5, 8, 11, ...).
We extracted the 16S rRNAs from the NCBI ftp site ftp://ftp.ncbi.nlm.nih.gov/genomes/. For each 16S rRNA sequence of each organism we looked at the 5' direction for the first instance of the three letter motif, 5'-GAU-3', which was found consistently on the 5' end tails of the 16S rRNAs with known structures. The location of this motif was used to define, up to the end of the 3' tail, the 16S rRNA tail of each organism. For species that have two or more copies of the 16S rRNA gene, we calculated the consensus sequence of all the tails. If the different tails observed did not follow a consensus, then we used the majority of the 16S rRNA gene tails. All the 16S rRNA tails of the 678 organisms were examined manually. The SD sequences for 678 prokaryote genomes have been predicted using computer calculations of the base pairing free energy between translation initiation regions and the 16S rRNA 3' tail. The method used was developed by Starmer and co-workers ; and the scripts to calculate the free energies were downloaded from http://sourceforge.net/projects/free2bind/ and were included in our Perl scripts. We located the SD sequence by the position of the lowest ΔG° value calculated from 35 bps upstream to the initiation codon to 35 bps downstream from the initiation codon. The gene was assumed not to have the SD sequence if ΔG° > -3.4535 Kcal/mol and to have SD sequence if ΔG° ≤ -3.4535 Kcal/mol. The threshold used is based on the work of Ma and co-workers . The gene was assumed to have a strong SD sequence if ΔG° ≤ -8.4 Kcal/mol, which is the value obtained from the optimal base pairing between the 16S rRNA and the original SD sequence 5'-GGAGGU-3' . In order to point the exact SD position we used the relative spacing parameter , that means that we calculated the distance between the first residue of the start codon and the 5' A of the rRNA sequence 5'-ACCUCC-3' in each position around the start codon. If the SD motif is located before the start codon the relative spacing will be negative, while if the SD motif is located after the start codon the relative spacing will be given as a positive number. Regardless the gene pair orientation, the SD information and the graph of the ΔG° values is given for the upstream and the downstream gene.
We have developed an interactive and intuitive database that currently contains 1,956,294 gene pairs from 678 fully sequenced microbial genomes. The database is freely available at the URL http://genomes.urv.cat/pwneigh. Basically, this database provides information about the overlapping genes and the spacers between genes among the prokaryote genomes. Users can access to the information through three browsers and an advanced search engine, which are described below. In addition, this database offers the possibility of downloading the raw data and a Database Schema (Figure 1) in the Downloads section. They can find information about the overlaps and the spacers with the species name or the GenBank Accession Number, with the gene id (they can use the gene name, the short gene name or the PID) or with the internal gene id (described above in the Construction and Content section). While the users are typing the species name or any gene id the search engine helps to complete the name or the id. Interestingly, by clicking on the "TagClouds", the user can get a list of the species contained in the database, which can be sorted by the number of overlaps in a genome or by genome length in order to check at a glance the genomes with more overlaps or longer genomes. Furthermore, the database is able to provide the users with reports in TSV format at every step of their consultation just on clicking the Download TSV Data buttons.
With this browser, users can find general information about the genomes and connect to the overlapping genes or the spacers between genes contained in the genome. They can access this information by typing the name of the species (by tax name) or the GenBank Accession Number (by genbank). If they do not remember the species name or the GenBank Accession Number by clicking on "Genome (List)" the users can consult an exhaustive list of the species contained in this database and their GenBank Accession Numbers. Once the user has made a genome search, the first page obtained gives basic features of the genome including the Species name, the GenBank Accession Number, the TaxID, the genome length, the number of ORFs in the chromosome, the number of overlaps and spacers in the genome, the overlaps between ORFs ratio in the chromosome and the number of co-directional, convergent and divergent overlaps tabulated and represented graphically. By clicking the number of overlaps a list of the overlaps contained in the genome is displayed on a new page, while on clicking the number of spacers a list of the spacers contained in the genome is displayed on another new page.
The users can analyse the overlapping genes in a genome or a particular overlap of interest (by gene or by internal id). Once the user has made a genome search, the first page obtained has a list of the overlaps with the overlapping genes and their orientations as well as the distribution of the overlapping lengths represented graphically. The representation of the overlapping length distribution gives a general idea about the most common overlaps and the most common overlapping phases in the genome. Each overlap id leads to a detailed new page of the overlap including five labels that provide: overlap information, upstream gene information, upstream gene sequence, downstream gene information and downstream gene sequence. The overlap information label (General Info label) provides the internal id, chromosome name, the orientation, the overlapping phase, the overlapping length and the overlapping sequence. The upstream and downstream gene information labels (Upstream Gene and Downstream Gene label respectively) show the gene name, the gene function, the gene COG, the stop codon and the start codon. Also, on these labels is given information related to the SD location (position of the minimal ΔG° value and minimal ΔG° value) and the ΔG° values in the translation initiation region is represented graphically. The SD related information will be given in the upstream or in the downstream label depending on the gene pair orientation. The labels Fasta Up and Fasta Down contain the upstream and the downstream gene sequence in fasta format. Above the sequences there is a BLAST button. By clicking on it, the gene sequence is directly pasted in the BLAST local search engine and the conservation of one overlap across the species can be analysed. Interestingly, in the PairWise Neighbours database, the user can define the Expected threshold of the BLAST search engine among other features. Therefore the user can decide the threshold used to study the similarity among orthologous genes in order to analyse the overlapping pair conservation. In the BLAST results, by clicking on any hit, the information of the overlap is displayed on a new page.
The users can analyse the spacers between adjacent genes in a genome or a particular spacer of interest (by gene or by internal id). If the user makes a genome search, a bar chart of the spacing lengths of the genome is shown and the user can have a first view of the most common spacers in the genome. Below a list of all the spacers in the genome is displayed, providing the internal id, the genes separated by the spacer and their orientation. By clicking any internal id all the information about the spacer is displayed on a new page. On this page there are three labels that give information about: the spacer, the upstream gene and the downstream gene. Basically the information given in the fields on a general information label (General Info label) is the same as the fields on a General Info label of an overlap. However, the user can find the Spacing length instead of the Overlapping length and Spacer sequence instead of Overlapping sequence. The information provided on the Upstream and Downstream Gene labels is the same as that on the overlap labels and the SD related information is also given depending on the gene pair orientation.
In this Advanced Search it is possible to study the functionality of the genes more widely. The user can make correlations between the COG classes and the gene orientations or between the COG classes and the overlapping or spacing lengths among the prokaryote genomes. Furthermore, the user can retrieve the gene set of each organism without SD sequence, with SD sequence and with a strong SD sequence by just selecting the organism and the corresponding energy threshold (the energy thresholds are explained above in the Construction and Content section).
In this Discussion section we give a few examples that we find interesting to illustrate the uses that can be attributed to the PairWise Neighbours database.
Genes with or without SD in E. coli K12
Number of genes
Percentage of genes with SD
Percentage of genes without SD
All E. coli genes
Highly expressed genes (HEG) from E. coli(1)
Horizontally transferred genes (HGT) from E. coli(2)
Mean and standard deviation of 100 sets of 300 genes randomly selected from E. coli
69.04 ± 2.58
30.96 ± 2.58
The studies of the translation initiation mechanism, gene regulation and gene structure (such operon predictions) rely on correct annotations. With the growing number of fully sequenced prokaryote genomes, the databases that help the annotation processes are very desirable. PairWise Neighbours is an interactive and intuitive database for retrieving information about the spacers and overlapping genes among bacterial and archaeal genomes. With this information, on the one hand, it is possible to study the reliability of an overlap as well as its conservation across the species with a BLAST local system, which permits the user to study the conservation of an overlap applying their desired Expect threshold. On the other hand, with the information related to the SD sequence and the ΔG° values along the translation initiation region, the users can analyse the intergenic regions widely. They can check the reliability of the initiation site prediction, the SD location and the SD strength or the relationship between SD location and the spacing lengths. In addition, it is possible to analyse the gene functions using the COG classes and the SD predictions.
Project name: pwneigh
Project home page: http://genomes.urv.cat/pwneigh/
Operating systems: Platform independent
Programming language: Python and SQL
Other requirements: Python 2.5, mySQL 5.0, Apache 2.0 and TurboGears 1.0.7
Licence: Content by Creative commons and source code by GNU GPL
Any restrictions to use by non-academicians: None
This work has also been supported by projects BIO02003-07672 and AGL2007-65678/ALI of the Spanish Ministry of Education and Science. Also we would like to thank Richard Tuby for his help in writing the manuscript. Thanks also to the anonymous reviewers for their useful suggestions. Finally, we would like to thank Joshua Starmer and co-workers for making available their programs for detecting Shine-Dalgarno motifs, and especially thanks to Joshua Starmer for his kind assistance.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.