A study of inter-lab and inter-platform agreement of DNA microarray data
© Wang et al; licensee BioMed Central Ltd. 2005
Received: 30 July 2004
Accepted: 11 May 2005
Published: 11 May 2005
As gene expression profile data from DNA microarrays accumulate rapidly, there is a natural need to compare data across labs and platforms. Comparisons of microarray data can be quite challenging due to data complexity and variability. Different labs may adopt different technology platforms. One may ask about the degree of agreement we can expect from different labs and different platforms. To address this question, we conducted a study of inter-lab and inter-platform agreement of microarray data across three platforms and three labs. The statistical measures of consistency and agreement used in this paper are the Pearson correlation, intraclass correlation, kappa coefficients, and a measure of intra-transcript correlation. The three platforms used in the present paper were Affymetrix GeneChip, custom cDNA arrays, and custom oligo arrays. Using the within-platform variability as a benchmark, we found that these technology platforms exhibited an acceptable level of agreement, but the agreement between two technologies within the same lab was greater than that between two labs using the same technology. The consistency of replicates in each experiment varies from lab to lab. When there is high consistency among replicates, different technologies show good agreement within and across labs using the same RNA samples. On the other hand, the lab effect, especially when confounded with the RNA sample effect, plays a bigger role than the platform effect on data agreement.
Diversity of microarray data poses some unique and interesting questions on cross-experiment comparisons and the analysis tools needed for such comparisons. Since the invention of the microarray technology in 1995 , statistical methods and data mining techniques specific for microarray data have mushroomed , many of which have been packaged into commercial software such as GeneSpring and Spotfire. Such tools are useful for handling individual experiments, including quality control, significance testing, and clustering. However, researchers have questioned whether studies across different labs and technology platforms will have an acceptable level of agreement.
Possible incompatibility of results between similar microarray experiments is a major challenge that needs to be addressed, even though the data produced within a single experiment may be consistent and easy to analyze. Different labs produce microarray data in different ways using different technology platforms, such as Affymetrix GeneChip, spotted cDNA array, and spotted oligo array. Affymetrix GeneChip uses one fluorescent dye while the spotted array uses two fluorescent dyes in the experiments. Direct comparison of raw data obtained from different technologies may not be meaningful. Instead, the final form of the data is often presented as relative expression levels, mostly ratios of intensities, after some statistical treatments including filtering, normalization and model-based estimation. Experiments using different technologies require different protocols for analyzing the raw data to derive the ratios. Scientists have published microarray data in a variety of formats including raw intensities and ratios of intensities. Does it make a difference which technology platform is chosen? Can we make use of the studies from different platforms and labs? To answer these questions and provide some guidance for platform comparisons, we report on a comparative study of three different platforms. The experiment is a simple two-tissue comparison between mouse liver and spleen. We used previously published data sets from two different sources as well as new data sets produced in house. There are several similar studies published in recent years [3–8]. A noticeable difference of this study from earlier ones is that we considered lab as a major factor in the comparison. In addition, we compared three major types of technology platforms, namely Affymetrix GeneChip, spotted cDNA array and spotted long oligo array. This study aims to provide a basis for further development of methodologies for comparing microarray data across different experiments and for the integration of microarray data from different labs.
Summary of data collection
CI 15K cDNA
Li and Wong
Riken16K cDNA by Agilent
Riken16K Oligo by Agilent
Consistency of replicates
For the KC data, we can see that two pairwise comparisons gave slightly higher agreement than the other four. It is, we believe, due to the double spots of each gene on the array. The two comparisons with higher agreement are the comparisons between the replicates within the slides.
Pairwise comparisons among data sets
Correlation coefficients for pairwise comparisons between data sets. Pearson correlation coefficients (PCC), kappa coefficients (Kappa), intraclass correlation coefficients (ICC) and intra-transcript correlation coefficients (ITC) for pairwise comparisons.
No. of Matched Unigene IDs
GNF vs. KC
GNF vs. CC
GNF vs. CO
GNF vs. KAV
GNF vs. KLW
GNF vs. KRMA
KC vs. CC
KC vs. CO
KC vs. KAV
KC vs. KLW
KC vs. KRMA
CC vs. CO
CC vs. KAV
CC vs. KLW
CC vs. KRMA
CO vs. KAV
CO vs. KLW
CO vs. KRMA
KAV vs. KLW
KAV vs. KLW
KLW vs. KRMA
Sensitivity check of the overall comparison among all the data sets
Before leaving out
Leave out GNF
leave out KLW
leave out KC
leave out CC
leave out CO
In a recent study, Jarvinen et al.  compared Affymetrix GeneChip, commercial cDNA array and a custom cDNA array using the same RNA samples from human cancer cell lines. They found that the data were more consistent between two commercial platforms and less consistent between custom arrays and commercial arrays. Their conclusion is consistent with our findings. In our study, KC is a custom cDNA array, whereas CC and CO are commercial cDNA and oligo arrays, respectively. If we do not consider comparisons involving GNF, we found that KC_CC (shorthand for KC versus CC) and KC_CO comparisons were ranked at the bottom. The samples used in KC, CC, and CO were identical, so biological variability is not an issue here. The variability among those data sets was mainly due to technical factors, such as platforms and labs conducting the experiments. Jarvinen et al.  analyzed the experiments conducted in one lab, and therefore their study was mostly concerned with the platform difference. Another study by Culhane et al.  used the co-inertia analysis to compare overall expression profiles across different platforms. Their analysis could be used on matched genes, as well as on all the data from different platforms. While they considered the agreement for the overall expression profiles, we focused on the agreement in expressions at the gene level. Using the within-platform variability as a benchmark, we found that these technology platforms exhibited an acceptable level of agreement. For example, the overall comparison of all five data sets using KLW for the in-house Affymetrix gives ICC = 0.662, as compared to the ICC around 0.8 for replicates within each data set. These results indicate that the agreement of different technologies is decent.
UniGene has been widely used to match genes on different microarrays. Using UniGene and sequence similarity, Thompson, et al.  reported a number of gene markers that showed platform-independent expression profile. When we matched genes from different arrays using the mouse UniGene IDs, we found that there were multiple gene IDs in an array corresponding to one UniGene ID. Those genes were considered as "duplicate" genes, which made the cross matching of genes more complicated. A common approach is to average the expressions of those "duplicate" genes; however, we considered these "duplicate" genes as replicates in the technology and lab comparisons. One observation we should make is that the variability among these "duplicate" genes can be large. For example, in the CC arrays, there are 11,301 genes (upon filtering), corresponding to 8,318 UniGene IDs. Among the 8,318 UniGene IDs, 1,708 of them have "duplicate" genes. For the 12 Unigenes that have 10 or more "duplicate" genes, the ICC of the replicates decreased from 0.99 to 0.86 due to gene matching by Unigene IDs. Note that gene matching was performed only for the comparisons across data sets, and that we did not use the UniGene IDs in measuring the consistency within the same data set. This means that the agreement measures we obtained across different data sets were expected to be slightly lower than those from the replicates, even if the actual agreement was the same within or between data sets. In earlier studies, such as the one reported by Jarvinen et al., it has been shown that using different clones on different arrays is a major factor for the discrepancies between platforms. Using UniGene IDs to match the clones on different platforms can be problematic and result in biased comparisons. Sequence validation of the clones in different arrays may help resolve some of the problems and make the data from different platforms and labs more comparable.
From the present study, we showed that the GNF data set had the lowest agreement with the other data sets. This difference is compounded with the facts that the consistency of replicates within the GNF data set is the lowest and the sample used to generate the data is different from those used by other labs. We believe that the technology platform plays a relatively minor role in the disagreement, but the variation introduced by sample differences is one of the major factors. It has been shown that the expression level can vary significantly between genetically identical mice . Variation among different individuals can be a significant factor for sample differences. Our analysis also indicates that data generated from different labs may have different quality even among the replicates, and thus quality control is important.
Obviously, the present study has limitations. The results were generated from a very limited number of data sets. Using UniGene IDs for gene matching across data sets can also be questioned. Further research is clearly needed to address these limitations.
In this paper, we aim to address several issues in comparing microarray data across different platforms and different labs. We demonstrated that the consistency of replicates in each experiment varied from lab to lab. With high consistency among replicates, different technologies seemed to show good agreement within and across labs using the same RNA samples. A closer look at the results indicated that the variability between two labs using the same technology was higher than that between two technologies within the same lab. The source of RNA samples can make a difference in microarray data, however in our present study we do not show conclusive results pertaining to possible sample or lab effects, because we did not have data collected from two different samples within one lab.
For the spotted arrays (KC, CC, and CO), we used raw intensity data from both Cy5 and Cy3. We filtered out the non-expressive data points using median plus three times median absolute deviation (MAD, ) of the negative control genes as a criterion. We then performed global lowess normalization on each slide. For the KC data, we also performed paired-slide normalization following the method in Yang et al.  because of the dye swap in the experiment.
For the in-house Affymetrix data, we used three summarization methods to generate the probe set level signals. KAV and KRMA are based on the R package affy of Bioconductor using the average difference (AV) between Perfect Match (PM) and Mis-Match (MM) probe pairs, and the Robust Multi-Array Average (RMA) expression measure developed by Irizarry et al. , and KLW is the model-based expression indexes developed by Li and Wong . The GNF Affymetrix data were available to us only in the format of average differences. For both GNF and in-house Affymetrix probe set signals, we performed global lowess normalization for the three pairwise combinations of the liver and spleen slides, and used the averaged lowess adjustment for the normalization. The genes with probe set signals lower than 20 in either liver or spleen tissue were filtered out.
To stabilize the variances of data across the full range of gene expressions, we also performed the generalized log transformation for all the data sets following Durbin et al. .
Gene matching across arrays
There are five different data sets in this study. The origins of the genes vary in those datasets. In order to study inter-lab agreement, we have to first identify common genes represented in different arrays. Based on the annotation of each data set, we found that we could maximize the number of cross-matched genes using the mouse UniGene IDs (Build 107). Based on the common UniGene IDs, we found 551 common genes across all five different data sets. But in the pairwise comparisons, the number of common genes ranged from 1,374 to 5,018 (see Table 2). All comparisons between data sets were made from the matched genes.
Statistical procedures for inter-platform and inter-lab comparisons
In the analyses, the ratio is defined as normalized and transformed intensity from liver samples versus that from spleen samples.
Agreement of two-fold changes using kappa coefficients
An intuitive measurement of agreement is to count the percentage of genes falling in the same categories (two-fold up-regulated, no change, and two-fold down-regulated). However, this percentage can be high even if the data obtained from different platforms are not so compatible. Usually the ratios for the great majority of genes do not show a two-fold change, and the percentage of agreement can be high just due to chance. To adjust for this excess agreement expected by chance, we prefer to use the kappa coefficient, which is a popular measure of inter-rater agreement in many other areas of science. The kappa coefficient was first proposed by Cohen  for analysing dichotomous responses, and was extended later to more than two categories of responses. We applied this measure to three categories (two-fold up-regulated, no change, and two-fold down-regulated), and computed the kappa coefficients between two data sets from 3 by 3 frequency tables. For a study of q categories, the kappa coefficient is calculated by: , where is the overall agreement probability, is the measure of the likelihood of agreement by chance, and n ij is the number of subjects in the (i, j) cell, ni+is the sum of the i th row, n+ jis the sum of the j th column, and n is the total number of subjects.
Frequency table for KAV and KC
Correlation coefficients of the ratios
We introduced the ITC for pairwise comparisons as described below. For each gene i, we defined ρ i to be the square root of the ratio of within dataset sum of squares (SSW) and the total sum of squares (TSS). A common SSW was used in comparing lab pairs to avoid the problem of having seemingly higher ITC's due to unusually large within-lab variability at some lab. We applied logit transformation to each ρ i to get γ i , and then calculated the average γ. Converting γ back to the correlation scale, we obtained .
This study was supported in part by the NIH Grant No. 2 P30 AR41940-10. We would like to acknowledge Al Bari for his work on printing the mouse cDNA array at the Keck Center, and thank the referees for helpful comments that led to improvements of the present paper.
- Schena M, Shalon D, Davis RW, Brown PO: Quantitative monitoring of gene expression patterns with complementary DNA microarray. Science. 1995, 270: 467-470.PubMedView ArticleGoogle Scholar
- Quackenbush J: Computational analysis of microarray data. Nat Rev Genet. 2001, 2 (6): 418-427. 10.1038/35076576.PubMedView ArticleGoogle Scholar
- Culhane AC, Perriere G, Higgins DG: Cross-platform comparison and visualisation of gene expression data using co-inertia analysis. BMC Bioinformatics. 2003, 4: 59-10.1186/1471-2105-4-59.PubMedPubMed CentralView ArticleGoogle Scholar
- Jarvinen A, Hautaniemi S, Edgren H, Auvinen P, Saarela J, Kallioniemi O, Monni O: Are data from different gene expression microarray platforms comparable?. Genomics. 2004, 83: 1164-1168. 10.1016/j.ygeno.2004.01.004.PubMedView ArticleGoogle Scholar
- Kuo WP, Jenssen T, Butte AJ, Ohno-Machado L, Kohane IS: Analysis of matched mRNA measurements from two different microarray technologies. Bioinformatics. 2002, 18: 405-412. 10.1093/bioinformatics/18.3.405.PubMedView ArticleGoogle Scholar
- Yuen T, Wurmbach E, Pfeffer RL, Ebersole BJ, Sealfon SC: Accuracy and calibration of commercial oligonucleotide and custom cDNA arrays. Nucleic Acids Res. 2002, 30: e48-10.1093/nar/30.10.e48.PubMedPubMed CentralView ArticleGoogle Scholar
- Kothapalli R, Yoder SJ, Mane S, Loughran TP: Microarray results: how accurate are they?. BMC Bioinformatics. 2002, 3: 22-10.1186/1471-2105-3-22.PubMedPubMed CentralView ArticleGoogle Scholar
- Li J, Pankratz M, Johnson JA: Differential gene expression patterns revealed by oligonucleotide versus long cDNA arrays. Toxicol Sci. 2002, 69: 383-390. 10.1093/toxsci/69.2.383.PubMedView ArticleGoogle Scholar
- Su AI, Cooke M, Ching KA, Hakak Y, Walker JR, Wiltshire T, Orth AP, Vega RG, Sapinoso LM, Moqrich A, Patapoutian A, Hampton GM, Schultz PG, Hogenesch JB: Large-scale analysis of the human and mouse transcriptomes. Proc Natl Acad Sci. 2002, 99 (7): 4465-4470. 10.1073/pnas.012025199.PubMedPubMed CentralView ArticleGoogle Scholar
- Thompson KL, Afshari CA, Amin RP, Bertram TA, Car B, Cunningham M, Kind C, Kramer JA, Lawton M, Mirsky M, Naciff JM, Oreffo V, Pine PS, Sistare FD: Identification of platform-independent gene expression markers of cisplatin nephrotoxicity. Environmental Health Perspectives. 2004, 112: 488-494.PubMedPubMed CentralView ArticleGoogle Scholar
- Pritchard CC, Hsu L, Delrow J, Nelson PJ: Project normal: Defining normal variance in mouse gene expression. Proc Natl Acad Sci. 2001, 98: 13266-13271. 10.1073/pnas.221465998.PubMedPubMed CentralView ArticleGoogle Scholar
- Huber PJ: Robust Statistics. 1981, Cold Spring Harbor, John Wiley & SonsView ArticleGoogle Scholar
- Yang YH, Dudoit S, Luu P, Speed TP: Normalization for cDNA microarray Data. Microarrays: Optical Technologies and Informatics. 2001, SPIE BIOS San Jose, CAGoogle Scholar
- Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP: Exploration, Normalization, and Summaries of High Density Oligonucleotide Array Probe Level Data. Biostatistics. 2003, 4: 249-264. 10.1093/biostatistics/4.2.249.PubMedView ArticleGoogle Scholar
- Li C, Wong WH: Model-based analysis of oligonucleotide arrays: Expression index computation and outlier detection. Proc Natl Acad Sci. 2001, 98: 31-36. 10.1073/pnas.011404098.PubMedPubMed CentralView ArticleGoogle Scholar
- Durbin BP, Hardin JS, Hawkins DM, Rocke DM: A Variance-Stabilizing Transformation for Gene-Expression Microarray Data. Bioinformatics. 2002, 18: S105-S110.PubMedView ArticleGoogle Scholar
- Cohen J: A coefficient of agreement for nominal scales. Educational and Psychological Measurement. 1960, 20: 37-46.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.