DFI: gene feature discovery in RNA-seq experiments from multiple sources
© Ozer et al.; licensee BioMed Central Ltd. 2012
Published: 17 December 2012
Skip to main content
© Ozer et al.; licensee BioMed Central Ltd. 2012
Published: 17 December 2012
Differential expression detection for RNA-seq experiments is often biased by normalization algorithms due to their sensitivity to parametric assumptions on the gene count distributions, extreme values of gene expression, gene length and total number of sequence reads.
To overcome limitations of current methodologies, we developed Differential Feature Index (DFI), a non-parametric method for characterizing distinctive gene features across any number of diverse RNA-seq experiments without inter-sample normalization. Validated with qRT-PCR datasets, DFI accurately detected differentially expressed genes regardless of expression levels and consistent with tissue selective expression. Accuracy of DFI was very similar to the currently accepted methods: EdgeR, DESeq and Cuffdiff.
In this study, we demonstrated that DFI can efficiently handle multiple groups of data simultaneously, and identify differential gene features for RNA-Seq experiments from different laboratories, tissue types, and cell origins, and is robust to extreme values of gene expression, size of the datasets and gene length.
High-throughput RNA-sequencing (RNA-seq) enables researchers to quantify genome-wide gene expression with high resolution . At the same time, it raises many new challenges for data processing and analysis. One major challenge is how to effectively combine, compare and contrast samples to identify differential gene features. The common sense answer to this question is to apply an effective inter-sample normalization procedure before starting any type of comparative analysis on the samples from different sites, as well as on the samples from the same dataset [2–4]. On the other hand, it has been shown that the choice of normalization method itself could be a major factor that determines estimates of differential expression .
After the alignment of high throughput short sequence reads to the reference genome, expression levels can be quantified in terms of total number of reads that are aligned to the genes. Then, generally, a proper normalization algorithm is used to estimate expression levels for comparative analyses. One of the problems with high throughput sequencing is longer genes are sequenced more and have larger gene counts . The first and most commonly used normalization method RPKM (reads per kilobase of exon per million mapped reads)  addresses this bias by simply scaling counts by the gene length. Later studies have shown that more sophisticated weighting methods are needed to lessen this bias [5, 8]. Another challenge with sequencing is modelling the distribution of the gene counts, as differences in relative distributions of the samples would affect the detection of differential expression . Poisson  and negative binomial distributions [9, 10] are the most commonly used ones to model the gene count data. These models are parametric i.e. require assumptions on the distribution of the data. However, in the real scenario, these distribution assumptions might not always hold true  and estimation of the model parameters can be very difficult .
Here, we introduce Differential Feature Index (DFI) to identify distinctive features across a large set of diverse experiments using read counts without any direct inter-sample normalization. The DFI method is non-parametric (i.e. calculations of DFI do not require any assumptions on the distribution of the data) and unsupervised (i.e. does not require group information to identify differential features). In this study, first, we compared DFI to currently accepted methods  such as EdgeR , DESeq  and Cuffdiff , as well as the classical t-test. Then, we evaluated the efficiency of DFI in comparing multiple groups of data from different research groups at the same time. We found that DFI was effective and robust for selecting differential gene features for RNA-Seq experiments from different laboratories, tissue types, and cell origins.
A large DFI implies that the gene varies substantially across all experiments and can be considered as a feature to differentiate them, while a small DFI means expression of this gene is quite stable across all experiments. Thus, one can order the gene features based on DFI values and identify differential features. In this paper, we selected top one percent of the gene features as differentially expressed.
mRNA-seq experiments from NCBI SRA.
Number of experiments
Illumina sequencing human kidney RNA samples to study mRNA expression levels
RNASeq expression profiling for ENCODE project
Hep2G (Liver Carcinoma)
Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing
Liver female 1
Sex-specific and lineage-specific alternative splicing in primates
Liver male 1
Evaluation of Statistical Methods for Normalization and Differential Expression in mRNA-Seq Experiments
We demonstrated that DFI is a highly effective and robust method for selecting gene features for RNA-seq experiments from multiple groups of samples. First, DFI is robust to the variation in total number of sequence reads across experiments. A recent study  suggested that incorporation of total number of sequence reads in normalization may impact comparative results. Furthermore, another study  has shown that sequencing depth itself affects the identification of genes as differentially expressed. DFI formulation is independent of total number of sequence reads, since calculations starts with pair-wise ratios of the gene counts within each experiment (Figure 1). This approach circumvents the need to normalize total number of reads. Out of 4.4 million to 54.8 million total number of sequence reads in above 86 RNA-seq experiments 0.3 million to 16.3 million reads were aligned to the reference genome (Additional file 3: Figure S1). This large variation did not affect the DFI calculations and the clustering of the experiments. For example, 16 liver experiments from 3 different studies clustered very tightly (Figure 5) although total number of sequences reads in the samples varied between 2.8 million and 14 million.
Second, DFI is robust to low and high gene counts. Two main problems occur with normalization methods and differential expression statistics : genes with high counts are more likely to be discovered as differentially expressed and genes with low counts tend to affect differential expression statistics. To test the robustness of DFI, EdgeR and DESeq to extreme values of gene counts, we eliminated genes with high or low expression and compared differential expression rankings to the full gene set. We demonstrated that DFI is the least affected by extreme values of gene counts compared to EdgeR and DESeq (Additional file 3: Figure S2).
Third, DFI is robust even when the number of samples is small. In 100 tests, we randomly selected three samples from each of the two groups in MAQC dataset and compared the DFI ranking with the gold standard for each test. ROC curves demonstrated that when testing low numbers of datasets, the accuracy of the DFI ranking only changed slightly between these 100 tests with the area under ROC curve of 0.952 +/- 0.004 (Additional file 3: Figure S3).
This adjustment will only add a constant term to the elements of array G jk . Although the mean of the array will be shifted, the standard deviation calculation will not be affected from this adjustment. Therefore, DFI calculation is theoretically independent of the gene lengths. Also, no association was observed between gene lengths and DFI values calculated in this study (Additional file 3: Figure S4).
It is conceivable that application of DFI can be extended to a wide spectrum of high throughput data. In fact, a similar metric had been applied to select normalization factor in qRT-PCR experiments  and to compare different ChIP-seq datasets . Further investigation on DFI features may lead to effective methods for integrating datasets from multiple modalities (e.g., microarray and RNA-seq).
In summary, we have developed a Differential Feature Index that allows one to accurately and effectively identify the genes that change expression in multiple RNA-seq datasets. This index obviates the need to normalize samples and can accommodate any number of datasets with multiple sizes.
This article considers 86 mRNA-seq experiments from 5 different projects in NCBI Sequence Read Archive (SRA) (Table 1): 1) 42 RNA-seq experiments from a study that evaluates the effect of flowcell and library preparation on the results of transcriptome sequencing using the Illumina Genome Analyzer . This study includes 14 experiments on Ambion's human brain reference RNA and 28 experiments on Stratagene's universal human reference (UHR) RNA which is composed of total RNA isolated from 10 different cell lines including adenocarcinoma of mammary gland, hepatoblastoma of liver, adenocarcinoma of cervix, embryonal carcinoma of testis, glioblastoma of brain, melanoma, liposarcoma, histiocytic lymphoma (macrophase, histocyte), lymphoblastic leukemia and plasmacytoma (myeloma, B lymphocyte) (SRA Project ID: SRP001847). 2) 12 RNA-seq experiments from a comparative study on sex-specific and lineage-specific alternative splicing . This study includes 6 experiments on 3 female liver samples and 6 experiments on 3 male liver samples (SRA Project ID: SRP001558). 3) 20 RNA-seq experiments from RNA-seq expression profiling study for ENCODE project common cell lines . This study includes 9 RNA-seq experiments on K562 cell line produced from a female patient with chronic myelogenous leukemia (CML), 4 RNA-seq experiments on Hep2G cell line produced from a male patient with liver carcinoma, and 7 RNA-seq experiments on GM12878 cell line produced from the blood of a female donor with northern and western European ancestry by EBV transformation (SRA Project ID: SRP000228). 4) 6 RNA-seq experiments from an assessment study on technical reproducibility of RNA-seq and its comparison with gene expression arrays . This study includes 3 RNA-seq experiments on liver and 3 RNA-seq experiments on kidney samples of a single human male (SRA Project ID: SRP000225). 5) 6 RNA-seq experiments from a study on human tissue alternative splicing complexity . This study includes 1 RNA-seq experiments on each of the brain, cerebral cortex, heart, skeletal muscle, lung and liver samples (SRA Project ID: SRP000626).
Sequence read files were downloaded from NCBI SRA in FASTQ format. Raw sequence reads were aligned to the human reference genome (UCSC hg18, NCBI build 36) using TopHat  (Version 1.0.13) that runs on Bowtie (Version 0.12.7). Only unique alignments to the reference were considered.
R/Bioconductor package Rsamtools was used to read sequence alignment results in SAM/BAM format. R/Bioconductor package GenomicRanges was used to download NCBI RefSeq gene annotations and to count total number of sequence reads on each annotated region, gene counts. A large matrix of gene counts (number of transcripts by number of experiments, 28,005 by 86) was constructed and saved as a simple text file.
The code for DFI calculation is developed in Matlab. In order to avoid infinite values in log calculations, genes with 0 counts are replaced with 1. Runtime for the algorithm is O(n2m) where n is the number of genes and m is the number of samples.
R/Bioconductor packages EdgeR  and DESeq , and R t.test function were used to calculate differentially expressed genes between 2 samples of MAQC dataset. Same gene counts table as in DFI calculations were employed for EdgeR and DEseq methods, while normalized counts (RPKM) were employed for t-test function. Cuffdiff function of Cufflinks (Version 1.0.3)  was used to identify differentially expressed genes between 2 samples. Alignment results for 42 experiments of MAQC dataset were directly given as an input to the Cufflinks software. FDR adjusted p-values and fold changes reported by these methods were used in all of the calculations.
The quantitative real-time polymerase chain reaction (qRT-PCR) data on Ambion's human brain reference RNA and Stratagene's UHR RNA samples were downloaded from Gene Expression Omnibus (GEO), GSE5350 Series, 4 Brain and 4 UHR Taqman assays . Out of 997 genes in TaqMan assay, 976 were common to NCBI RefSeq gene annotations. Then, 41 genes with no expression in any of the samples were eliminated. Expression of the remaining 935 genes were considered as gold standard to evaluate accuracy of DFI ranking for RNA-seq experiments (SRA Project SRP001558) on the same samples.
This work was supported by an award from the NIH U01 GM092655 as part of the NIH Pharmacogenomics Research Network. We thank W. Sadee for critical reading of the manuscript.
This article has been published as part of BMC Genomics Volume 13 Supplement 8, 2012: Proceedings of The International Conference on Intelligent Biology and Medicine (ICIBM): Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/13/S8
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.