Server construction, supported organisms and genome assemblies
The iTAR web server runs on Linux and is implemented in JSP and Java. ChIP-seq data analysis is provided for seven different organisms: human, mouse, fly, worm, chicken, zebrafish and yeast, with multiple genome assemblies available for human (UCSC hg 18 and hg19) and mouse (UCSC mm8, mm9, and mm10). In addition, support for genome assemblies is extensible to enable further genome assembly availability going forward.
Annotation files for RefSeq genes are downloaded from the UCSC Genome Browser [14]. These files provide the chromosome, strand, transcriptional start site, transcription terminal sites, structures and other genomic information for all RefSeq genes for an organism, based on the corresponding genome assembly.
The server is freely available to the public and does not require login. Analysis of a ChIP-seq dataset for target gene identification using the server is straightforward and requires only three steps: (i) uploading input files, (ii) choosing parameters, and (iii) receiving the output files.
Input files
The iTAR web server takes as input a ChIP-seq/ChIP-chip signal track file and optionally a binding peak file for a TF. The signal track file contains the binding signal of a TF at each nucleotide position of the genome as measured by sequencing in ChIP-seq or hybridization in ChIP-chip data. The server supports three signal track formats: BedGraph, Wiggle, and BigWig. We note that each format may have multiple variants. For example, Wiggle files have two main formatting options--fixedStep and variableStep, designed for data with regular and irregular intervals between data points, respectively. The server contains a module to process different formats of signal track files to ensure the best usability.
The optional peak file is in bed format, containing the chromosome, start position, end position and other information of TF binding peaks. Users can run a peak calling program, e.g. MACS or PeakSeq [9, 11], using the uploaded ChIP-seq data as input to generate this peak file.
To improve the uploading time, the signal track files must be compressed into either.rar or.gz format before uploading. Alternatively, user can support download links for track file and peak file, and then iTAR will download the input files automatically.
Parameter settings
After the input file(s) have been uploaded, users need to select the organism and the genome assembly version from the drop-down list (see Fig. 1a). The selected organism/assembly must match with the uploaded signal track file and an optional peak file to ensure accurate results in subsequent analysis. In addition, users need to specify three other parameters: the length of considering binding signals around TSS, the FDR (false discovery rate) threshold for TF targets and the P-value threshold for GO enrichment analysis. The length of promoter region will be used to select the regions of the putative promoters, the FDR threshold will be used to select a subset of target genes of the TF for subsequent GO enrichment analysis, and the P-value threshold will be used to select significantly enriched GO categories for displaying in the output webpage. After all parameters have been specified, users can click the “Submit” button to initiate online data analysis. A waiting page will be displayed and updated every 5 s to report the progress of the data analysis (Fig. 1b). Depending on the organism and the signal track file format, analysis may take 10–25 min to finish.
Output files
After data analysis has completed, the results will be displayed in an output webpage. The main output consists of five panels that include (i) a table summarizing the parameter and input settings, (ii) a characteristic binding profile of the TF, (iii) the distribution of normalized regulatory scores for all genes, (iv) a list of significant genes and (optionally) their associated TF peak binding sites, and (v) a list of significantly enriched GO terms (see Fig. 1 for an example). Significant genes in (iv) are sorted by decreasing order of their regulatory scores. For each gene, the p-value and the multiple-testing corrected FDR value are calculated, using both single normal and mixture normal models. To enable parameter setting adjustment and iterative analyses based on the output results, a “Rerun the program” option is included that accepts parameter setting modifications (to change statistical stringency requirements for GO analysis, for example) without the need for re-uploading input files.
TIP extensions
As mentioned above and previously described, TIP builds a characteristic, averaged binding profile for a TF around the TSS of all genes and then uses this profile to weight the sites associated with a given gene, providing a continuous-valued regulatory score of a TF for each gene. It then normalizes the regulatory scores into z-scores and estimates their p-values by referring to a standard normal distribution [12]. However, using a standard normal distribution to estimate the significance of regulatory scores is highly conservative, giving rise to a confident but small target gene set. This is because the distribution of the regulatory scores is typically not normal but positively skewed (Fig. 1d) or bimodal (Additional file 1: Figure S1), reflecting the fact that binding scores encompass two distinct groups: background, non-target genes with low regulatory scores, and genuine target genes with higher regulatory scores. By not taking this non-normal distributional shape into account, statistical resolution is lost and p-value estimates are conservative.
Motivated by these observations, here we refine TIP’s p-value calculation by applying a two-component Gaussian mixture model to estimate the significance of normalized regulatory scores. Our approach is as follows: suppose that each regulatory score y
1, …, y
n
is instead derived from a two-component (i.e., non-target and genuine target genes, respectively) Gaussian mixture distribution. The log-likelihood for data consisting of n observations y = (y
1,…,y
n
), assuming a normal mixture model with two components, is then given by
$$ \ell \left(\varTheta \Big|\boldsymbol{y}\right)={\displaystyle \sum_{i=1}^n} \log \left({\displaystyle \sum_{k=1}^2}{w}_kN\left({\mathrm{y}}_i\Big|{\mu}_k,{\sigma}_k\right)\right) $$
(1)
where Θ = (w, Ψ) represents all unknown parameters and N(⋅|μ
k
, σ
k
) denotes a Gaussian density function with mean μ and standard deviation σ. Here the vector w = (w
1, w
2) consists of the mixing proportions and subjects to \( {\displaystyle \sum_{k=1}^2}{w}_k=1 \), Ψ = (μ
1, μ
2, σ
1, σ
2). The maximum likelihood estimation of Θ can then be solved by
$$ \widehat{\varTheta}= \arg \underset{\varTheta }{ \max}\ell \left(\varTheta \Big|\boldsymbol{y}\right) $$
(2)
To choose the best mixture model with optimal parameters of the two Gaussian distributions, we use an expectation-maximization (EM) algorithm [15] (See Fig. 1d). EM is an iterative method for finding maximum likelihood estimation of parameters by using the following iterative algorithm,
$$ {\widehat{w}}_k=\frac{{\displaystyle {\sum}_{i=1}^n}{\widehat{z}}_{ik}}{n},{\widehat{\mu}}_k=\frac{{\displaystyle {\sum}_{i=1}^n}{\widehat{z}}_{ik}{\mathrm{y}}_i}{{\displaystyle {\sum}_{i=1}^n}{\widehat{z}}_{ik}},\kern0.5em {\widehat{\sigma}}_k^2=\frac{{\displaystyle {\sum}_{i=1}^n}{\widehat{z}}_{ik}{\left({\mathrm{y}}_i-{\widehat{\mu}}_k\right)}^2}{{\displaystyle {\sum}_{i=1}^n}{\widehat{z}}_{ik}} $$
(3)
with the posterior probabilities
$$ {\widehat{z}}_{ik}=\frac{{\widehat{w}}_kN\left({\mathrm{y}}_i\Big|{\widehat{\mu}}_k,{\widehat{\sigma}}_k\right)}{{\displaystyle {\sum}_{h=1}^2}{\widehat{w}}_hN\left({\mathrm{y}}_i\Big|{\widehat{\mu}}_h,{\widehat{\sigma}}_h\right)} $$
(4)
Once the optimal parameters are determined, we estimate p-values by comparing each z-score to the left Gaussian distribution (with the smaller mean value).
Gene ontology enrichment analysis
Given the target genes, we performed enrichment analysis with GO terms using the Fisher exact test based on a hypergeometric distribution [16]. To exclude non-informative general GO terms, we restricted our analysis to those with gene numbers < 1000. The web server shows significant GO terms ranked by p-value and provides the predicted target genes from them.