Identification of genomic islands
SeqWord Gene Island Sniffer (SWGIS) algorithm [19] was modified to SWGIS v2.0 and used for prediction of GI locations in chromosomal sequences. Oligonucleotide usage pattern (OUP) was denoted as a matrix of deviations Δ[ξ1…ξN] of observed over expected counts of all possible tetranucleotide permutations:
$$ {\varDelta}_{\left[{\xi}_1\dots {\xi}_N\right]}=\left({C}_{\left[{\xi}_1\dots {\xi}_N\right]\mid obs}-{C}_{\left[{\xi}_1\dots {\xi}_N\right]\mid e}\right)/{C}_{\left[{\xi}_1\dots {\xi}_N\right]\mid 0} $$
(1)
where ξn is any nucleotide A, T, G or C in the N-long word (in the case of tetranucleotides N = 4); C[ξ1…ξN]|obs is the observed count of the word [ξ1…ξ
N
]; C[ξ1…ξN]|e is the expected count and C[ξ1…ξN]|0 is a standard count estimated from the assumption of an equal distribution of words in the sequence: (C[ξ1…ξN]|0 = L
seq
× 4-N).
Expected counts of words C[ξ1…ξN]|e were calculated in accordance to the applied normalization scheme. Thus, C[ξ1…ξN]|e = C[ξ1…ξN]|0 if OU is not normalized, or C[ξ1…ξN]|e = C[ξ1…ξN]|n if OU is normalized by empirical frequencies of shorter words of the length n by Markov n-order chain normalization.
Two approaches of normalization by GC content have been exploited where the GC content was calculated either for the sliding window sequence (local normalization) or for the complete reference sequence (generalized normalization).
The distance (D) between two OUPs was calculated as the sum of absolute distances between ranks of identical words (w, in a total 4N different words that is 256 for tetranucleotides) after ordering of the words by Δ[ξ1…ξN] values (eq. 1) in two patterns i and j:
$$ D\left(\%\right)=100\times \frac{\sum \limits_w^{4^N}\left|{\mathit{\operatorname{rank}}}_{w,i}-{\mathit{\operatorname{rank}}}_{w,j}\right|-{D}_{\mathrm{min}}}{D_{\max -{D}_{\mathrm{min}}}} $$
(2)
Pattern skew (PS) is a particular case of D where patterns i and j were calculated for the same DNA molecule but for the direct and reversed strands, respectively. Dmax = 4N × (4N – 1)/2 and Dmin = 0 when calculating a D, or, in the case of PS calculation, Dmin = 4N if N is an odd number or Dmin = 4N – 2N if N is an even number.
Variance of an OU pattern was calculated by the following equation:
$$ V=\frac{\sum \limits_w^{4^N}{\varDelta}_w^2}{\left({4}^N-1\right)\times {\sigma}_0} $$
(3)
where N is the word length; Δ2
w
is the square of a word w count deviation (eq. 1); and σ0 is the expected standard deviation:
$$ {\sigma}_0=\sqrt{0.02+\frac{4^N}{L_{seq}}} $$
(4)
where L
seq
is the sequence length, and N is the word length.
SWGIS v2.0 calculates two types of variances for the patterns normalized by the GC content of a sliding window (relative variance or RV) and normalized by the GC content of the whole reference sequence (generalized relative variance or GRV). The ratio RV/GRV is then used for GI prediction. These parameters were described in more detail in previous publications [25, 26], where cut-off values for GI predictions were established empirically as the following: D larger than 1.5, PS smaller than 55 and RV/GRV larger than 1.5.
The principle improvement in SWGIS v2.0 was in calculating a reference OUP for a 300 kbp sliding window and recalculating OU for every 100 kbp. The original SWGIS algorithm calculated a reference OUP for an entire bacterial chromosome which is not representative of more heterogeneous chromosomal fragments in larger eukaryotic chromosomes. Also, in SWGIS v2.0, operons of genes encoding ribosomal RNA (rrn) were filtered out by high PS values as well as BLASTN against the SILVA database of rrn sequences of both eukaryotes and prokaryotes [27].
Test chromosomes used for artificial insertions of genomic islands
To estimate rates of false positive and false negative GI predictions, chromosomes with artificial GI insertions were created. These test chromosomes were chosen based on a preliminary run of SWGIS v2.0 to identify chromosomes that are naïve (those with no predicted GIs) and those that are non-naïve (containing other predicted GIs). From there, the relevant chromosomes were chosen to adequately represent the different kingdoms available in the database. The following naïve chromosomes were used: Candida albicans (NW_139454, NW_139474), Thalassiosira pseudonana (NC_012068, NC_012069), Torulaspora delbrueckii (NC_016501, NC_016504), Phaeodactylum tricornutum (NC_011690, NC_011693). The following non-naïve chromosomes were used: Aspergillus fumigatus (NC_007194, NC_007194), Fusarium oxysporum (CM000593, CM000594), Saccharomyces cerevisiae (BK006941, BK006942), Cryptococcus neoformans (NC_026749, NC_026750), Theileria parva (NC_007344, NC_007345), Plasmodium falciparum (NC_004329, NC_004330), Drosophila melanogaster (NC_004353, NC_00454), Caenorhabditis elegans (NC_003279, NC_003280).
False negative estimation of SWGIS v2.0
The Pre_GI [28] database were inspected for GIs that are also contained within the pathogenicity islands database (PAIDB) [29]. Of these, GIs that contained rrn sequences were discarded which related to a total of 194 GIs. The sequences of these 194 GIs were inserted into arbitrary locations of different test chromosomes using a randomization simulation. Each simulation inserted a single GI into an arbitrary location, implemented the SWGIS v2.0 algorithm and determined whether the algorithm identified the artificially inserted PAIDB GI. Thus, on each naïve or non-naïve test chromosome, 194 simulations were performed; each simulation with a different PAIDB GI, and a false negative ratio was determined based on the frequency of correctly detecting the inserted PAIDB GI and incorrectly not detecting the inserted PAIDB GI for each test chromosome.
False positive estimation of SWGIS v2.0
Random genomic fragments, acting as artificial GIs, were arbitrarily transferred between two chromosomes of the same organism with a randomization simulation. The assumption was made that the OU of two chromosomes of the same organism would be similar and detecting an artificially transferred segment can be considered as a false positive. Each simulation transferred a single arbitrary genomic fragment of 28,173 bp, the average length of the 194 GIs from PAIDB used for false negative estimation, from one chromosome to an arbitrary location on the other chromosome, implemented the SWGIS v2.0 algorithm and determined whether the artificially transferred segment was identified as a GI. Specifically, for non-naïve chromosomes, the simulation ensured that the transferred fragments do not already contain a GI(s) and is not inserted in locations that already contain a GI(s). For each test chromosome, a total of 100 simulations were performed; each simulation with a single arbitrary genomic fragment inserted into an arbitrary location and a false positive ratio was determined based on the frequency of correctly not detecting the transferred fragment and incorrectly detecting the transferred fragment.
Case studies
SWGIS v2.0 was used to identify GIs in the genomes of Aspergillus fumigatus and Drosophila ananassae as two case studies. Firstly, we compared GIs identified by SWGIS v2.0 in A. fumigatus to previously predicted atypical regions in this organism where the variation in OUP across the genome was also used to predict GIs [23]. In the cited work, GI identification was performed by using a parametric method based on local variations of genomic signatures that makes this study useful for benchmarking of SWGIS v2.0. Secondly, we tested for sequence similarity with BLASTN (e-value cut off 1− 20) between GIs predicted in D. ananassae to the Wolbachia endosymbiont (NZ_AAGB00000000) of this species, as previous reports have shown that the entire genome of Wolbachia has been transferred to the D. ananassae genome [30].
Genome sequences for database construction
Complete sequences of 1062 chromosomes of 66 eukaryotic organisms were obtained from the RefSeq database in GenBank format using the NCBI FTP server (ftp://ftp.ncbi.nih.gov/genomes/refseq). The RefSeq database was chosen to ensure only high quality assemblies in chromosome format was used and to limit the identification of potential bacterial contaminants, especially in smaller contigs of incompletely sequenced genomes. The EuGI web-resource contains the genome accession numbers of all the sequences used (http://eugi.bi.up.ac.za/eugi_source.php).
Database software and programming
MySQL package v5.1.73 for Linux was used for database creation. All programming was performed in Python 2.5.