Two-way AIC: detection of differentially expressed genes from large scale microarray meta-dataset
© Tsuyuzaki et al.; licensee BioMed Central Ltd. 2013
Published: 15 February 2013
Skip to main content
© Tsuyuzaki et al.; licensee BioMed Central Ltd. 2013
Published: 15 February 2013
Detection of significant differentially expressed genes (DEGs) from DNA microarray datasets is a common routine task conducted in biomedical research. For the detection of DEGs, numerous methods are proposed. By such conventional methods, generally, DEGs are detected from one dataset consisting of group of control and treatment. However, some DEGs are easily to be detected in any experimental condition. For the detection of much experiment condition specific DEGs, each measurement value of gene expression levels should be compared in two dimensional ways, or both with other genes and other datasets simultaneously. For this purpose, we retrieve the gene expression data from public database as possible and construct "meta-dataset" which summarize expression change of all genes in various experimental condition. Herein, we propose "two-way AIC" (Akaike Information Criteria), method for simultaneous detection of significance genes and experiments on meta-dataset.
As a case study of the Pseudomonas aeruginosa, we evaluate whether two-way AIC method can detect test data which is the experiment condition specific DEGs. Operon genes are used as test data. Compared with other commonly used statistical methods (t-rank/F-test, RankProducts and SAM), two-way AIC shows the highest specificity of detection of operon genes.
The two-way AIC performs high specificity for operon gene detection on the microarray meta-dataset. This method can also be applied to estimation of mutual gene interactions.
In such meta-datasets, direct application of widely used conventional statistical methods is not suitable to detect two-dimensional DEGs because such methods are intended to find special genes among all experiments to be analyzed.
For example, ANOVA [11–14] is applied very widely for multi-group analysis method, but its concludes only that differences between groups (genes) are significant or not. Therefore ANOVA can not detect simultaneously specific genes in specific experiments as two-dimensional DEGs.
Outlier detection methods are also widely used to detect DEGs, such as Shannon entropy  or Sprent's non-parametric method . In difference to ANOVA, these methods can also detect both special genes or special experimental conditions, but it is not simultaneously. It is one-dimensional and similar to ANOVA.
Multiple testing  (multiple comparisons, such as Bonferroni correction, Tukey-Kramer's method, and Games-Howell's method) also produce limited results as same as outlier detections. For an example of a dataset consisting of N genes and E experiments, it never means that the i-th gene of the j-th experiment is a DEG when multiple testing shows that the i-th gene (size E vector) is significantly different from other genes and the j-th experiment (size N vector) is significantly different from other experiments independently. This is because most multiple testing methods are conducted to ascertain differences between mean values of groups.
Herein, we propose "two-way AIC" (Akaike Information Criteria) method for simultaneous detection of significant genes and experiments on metadatasets. This method detects specific genes that are differentially expressed in specific experimental conditions. Here, we present comparison of the performance of our method to other widely used statistical methods and show that two-way AIC method has high specificity for detection of test data which tend to express in specific experiment condition.
A meta-dataset is a set of plural datasets. Each dataset consists of measurement groups of two kinds: control and treatment. Both control and treatment groups consist of one or more DNA microarray measurements. Genes (probes) are common to all microarrays (Figure 2).
where and respectively denote the arithmetic mean values of treatment and control measurements of i-th gene of j-th dataset (Figure 2). We define the row side direction of the matrix of log-FC values (log-FC matrix) as the "gene side" and the column side direction as the "experiment side".
where n is the number of outliers, and σ and s respectively denote the standard deviation and the number of non-outlier samples. Outliers are estimated as the best presumption of outliers which minimizes U. In this paper, the search range is restricted to within 25 percent of the number of data.
When the U-value method is applied in the gene side direction, specific experiments are detected as outliers for each gene. Similarly, when the U-value method is applied in the experiment side, specific genes are detected for each experiment. The detected outliers are described as 1 (positive outlier) or -1 (negative outlier).
The two-way AIC method is applied to a prokaryote gene expression meta-dataset to demonstrate its detection performance, and it is compared in specificity of detection of test data (operon genes) [21, 22], which generally tend to express simultaneously against specific experiment condition with other widely used statistical methods.
A meta-dataset is set up by calculating the log-FC matrix from P.aeruginosa DNA microarray measurements diverse experimental conditions. DNA microarray datasets are retrieved from two public databases: the Gene Expression Omnibus (GEO)  and the ArrayExpress . The measurement platform is the Affymetrix GeneChip ® Pseudomonas aeruginosa Genome Array (registered as GPL84 in GEO and A-AFFY-30 in ArrayExpress), which consists of 5883 probes (5549 protein coding genes of the PAO1 strain, 18 tRNA and rRNA of the PAO1, 117 genes from other strains and 199 intergenic sequences). We extract 5549 coding genes from 289 datasets (282 from GEO and 7 from Array- Express), which do not contain Null values (NA or missing values) or 0. RMA normalization  is applied to the microarray datasets in each study. Then the log-FC matrix is calculated.
We use test data for evaluation of our method. Here we assess the method's performance of detection of data which should be detected and evaluate its selectivity. We focus on the operon gene, one of the biological mechanism. Operon genes which prokaryote originally have are transcripted at same time and correspond to common function [26, 27]. Therefore, we think these genes must be co-expressed against specific experiment condition because of necessity of functional expression. We identify 93 operon genes in 5549 codings genes by Operon Database  at Kyoto University and the Pseudomonas Genome Database  at the University of British Columbia. When a pair of two genes is chosen from an operon, the number of all possible gene pairs is 857 for these 93 operons. Actually, Pearson's correlation coefficient of these 857 operon gene pairs is 0.734 and shows strong positive correlation, whereas that of randomly chosen gene pairs is 0.182 on the log-FC matrix. Therefore, we use operon gene as objective test data. Operon genes are not necessary to be expressed in any experimental condition. However, once some genes which belong to an operon, all the operon genes should be expressed simultaneously. Therefore, we regard operon genes which changed its expression level in specific experimental condition as correct data in the experiment condition and non-operon genes as incorrect data. Here we compare all method by evaluating how specifically detect these operon genes.
Results of comparisons of each method's performance
1. two-way AIC
2.721 × 10-5
2. t-rank/F -test
7.901 × 10-3
1.123 × 10-2
9.034 × 10-4
5. U -value (gene side)
2.085 × 10-1
6. U -value (experiment side)
5.325 × 10-4
5.202 × 10-3
4.030 × 10-4
The judgment criterion of the t-rank with F-test, the RankProducts method and SAM is set to the rank which makes the sensitivity of these methods closest to that of the two-way AIC. In the F-test, we evaluate the equality of variance (p = 0.05), and in the case of equal variances, we calculate Student's t-statistic, otherwise Welch's t-statistic with the threshold value (upper 245 genes). The RankProducts method is a non-parametric FC based DEG detection method. We used it with the threshold value (upper 312 genes). SAM is a non-parametric t-statistic based DEG detection method. We used it with the threshold value (upper 96 genes).
In the 2- and 3-σ methods, log-FC values of genes that are larger than the threshold in both sides are detected as DEGs. The threshold is the standard deviation multiplied by 2 (2σ method) and 3 (3σ method). σ is calculated for each direction.
where N is the number of operons in which the belonging genes were detected as DEGs at least once (0 ≤ N ≤ 93), M is the number of experiments in which belonging genes were detected as DEGs at least once (0 ≤ M ≤ 289), O k,j is the number of detected operon gene pairs, T k is the number of all possible operon gene pairs in k-th operon, A k,j is the number of never-detected non-operon gene pairs, P k,j is the p-value in the k-th operon, j-th experiment calculated using Fisher's exact test, F is the number of all possible combination of non-operon gene pairs (5549 C 2 - 857 = 15392069), G is the total number of genes (5549), E is the total number of all experimental conditions (289), and n j is the number of DEGs in the j-th experiment.
Results show that the two-way AIC is superior to all other method in p-value and specificity. It means that false positives of the two-way AIC is the lowest. Among other widely used methods (t-rank/F-test, RankProducts and SAM), SAM shows the highest specificity. However, specificity of our method is much higher than that of SAM. It suggest the effectiveness of two-way approach. Compared with other two-way method (2-σ, 3-σ), specificity of two- way AIC is also highest. It means specificity of U-value is superior to that of standard deviation in this case. Therefore, the two-way AIC method can detect operon genes with less noises even with all genes in an operon do not alway express proportionally .
Detection sensitivity is generally lower compared for specificity of all methods we tested. Compared to U-value method (gene side and experiment side), sensitivity of two-way AIC is not high. In general, one-way methods (U-value methods in Table 1) detects more operon genes than two-way methods because these methods are considered as one-pass outlier filtering while two-way methods are double filtering. However result show that double filtering cause much low false positive and choose genes that should be detected.
Any statistic including the t-test can be applied in two-way approach to meta-datasets in general, however, how to set the detection criterion or threshold of outliers is a major concern in these approaches. Introducing a model selection criteria AIC does not needed trial and error to find optimal threshold.
Supplemental material such as meta- dataset of P.aeruginosa and R script used in this paper are available on the web (http://www.ps.noda.tus.ac.jp/2way-aic/).
The publication costs for this article were funded by the corresponding author's institution.
This article has been published as part of BMC Genomics Volume 14 Supplement 2, 2013: Selected articles from ISCB-Asia 2012. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/14/S2.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.