A model system for assessing and comparing the ability of exon microarray and tag sequencing to detect genes specific for malignant B-cells
© Kloster et al.; licensee BioMed Central Ltd. 2012
Received: 25 May 2012
Accepted: 11 October 2012
Published: 5 November 2012
Malignant cells in tumours of B-cell origin account for 0.1% to 98% of the total cell content, depending on disease entity. Recently, gene expression profiles (GEPs) of B-cell lymphomas based on microarray technologies have contributed significantly to improved sub-classification and diagnostics. However, the varying degrees of malignant B-cell frequencies in analysed samples influence the interpretation of the GEPs. Based on emerging next-generation sequencing technologies (NGS) like tag sequencing (tag-seq) for GEP, it is expected that the detection of mRNA transcripts from malignant B-cells can be supplemented. This study provides a quantitative assessment and comparison of the ability of microarrays and tag-seq to detect mRNA transcripts from malignant B-cells. A model system was established by eight serial dilutions of the malignant B-cell lymphoma cell line, OCI-Ly8, into the embryonic kidney cell line, HEK293, prior to parallel analysis by exon microarrays and tag-seq.
We identified 123 and 117 differentially expressed genes between pure OCI-Ly8 and HEK293 cells by exon microarray and tag-seq, respectively. There were thirty genes in common, and of those, most were B-cell specific. Hierarchical clustering from all dilutions based on the differentially expressed genes showed that neither technology could distinguish between samples with less than 1% malignant B-cells from non-B-cells. A novel statistical concept was developed to assess the ability to detect single genes for both technologies, and used to demonstrate an inverse proportional relationship with the sample purity. Of the 30 common genes, the detection capability of a representative set of three B-cell specific genes - CD74, HLA-DRA, and BCL6 - was analysed. It was noticed that at least 5%, 13% and 22% sample purity respectively was required for detection of the three genes by exon microarray whereas at least 2%, 4% and 51% percent sample purity of malignant B-cells were required for tag-seq detection.
A sample purity-dependent loss of the ability to detect genes for both technologies was demonstrated. Taq-seq, in comparison to exon microarray, required slightly less malignant B-cells in the samples analysed in order to detect the two most abundantly expressed of the selected genes. The results show that malignant cell frequency is an important variable, with fundamental impact when interpreting GEPs from both technologies.
KeywordsExon microarray Tag-seq Gene expression Detection limit Sample purity
Malignant cells in tumour tissues of B-cell origin account for 0.1% to 98% of the total cell content depending on the specific disease entity [1, 2]. Varying content in the B-cell tumours of normal leucocytes, lymphocytes and stromal cells with their own individual GEPs poses a challenge for identification of specific malignant B-cell signatures . Although several studies have successfully identified gene expression patterns of tumour samples, their interpretation is often confounded by a lack of information about the varying presence of normal cells within the tumour biopsies. In recent years, microarray technology has been the default technology for detection of global GEP in different cancer types . However, it is difficult to quantify expression of low-abundant mRNA transcripts by microarrays, since low hybridisation intensity signals are difficult to distinguish from background levels arising from non-specific hybridisation. Due to an absence of background signals, NGS technologies, including tag-seq, have the potential to out-perform or at least supplement the exon microarray [5, 6]. However, there is a need to explore the technical limits of the exon microarray and tag-seq for identification and interpretation of mRNA transcript profiles, in particular malignant B-cells in a pool of non-malignant cells.
The Affymetrix GeneChip Human Exon 1.0 ST Array is a dense microarray designed for GEP, featuring 6.5 million probes corresponding to known targets and predicted exons spanning the entire human genome . The probes interrogate different regions of the same mRNA transcript showing variation in probe-target hybridisation intensities across the transcript. This variation is sequence-dependent and affected by probe-target binding strength. The probe-target binding strength is affected by competing formation of probe-probe dimers and secondary structures in probes, resulting in background noise from which low hybridisation intensity signals can be difficult to distinguish [5, 8]. Gene expression levels are measured as probe-target hybridisation intensities, and all values are background corrected, with the inter-array quantile normalised to remove systematic biases, providing relative rather than absolute gene expression levels . The next-generation sequencing-based tag-seq technology uses two restriction enzymes, NlaIII and MmeI, to generate cDNA tags that are reverse transcribed from mRNA transcripts, by cutting from the 3′-end CATG to the poly(A)-tail and 17–19 bp downstream of the first restriction enzyme site [6, 10, 11]. Clonal copies of each cDNA tag are generated on a solid surface flow cell using a bridge amplification approach without any predefined array attached probes before sequencing. During sequencing, the numerical frequencies of each tag are recorded and provide absolute gene expression values .
The present study is based on a model system, where malignant B-cells were serially diluted into embryonic kidney cells at eight different cellular frequencies illustrating varying levels of malignant B-cell sample purities. Total RNA from each of the cell populations was tested in parallel to generate data, both by exon microarrays using the Affymetrix GeneChip Human Exon 1.0 ST platform and tag-seq using the Illumina Genome Analyzer platform, as well as RT-qPCR validation of three selected genes. The goal for this work was to develop a quantification method that could be used for assessing and comparing the ability to detect malignant B-cell transcripts based on exon microarray and tag-seq as a function of sample purity.
Detection ability of exon microarray and tag-seq
Library and annotation results for tag-seq data
Differentially expressed genes by exon microarray and tag-seq
The 30 overlapping common expressed mRNA transcripts between the OCI-Ly8 and HEK293 cell lines identified by exon microarray and tag-seq
Hierarchical clustering of all samples
For both exon microarray and tag-seq, hierarchical clustering of all samples based on the 123 and 117 differentially expressed genes resulted in three individual concordant clusters (Figures 1C and 1D). One cluster contained the samples with frequencies of 0%, 0.5% and 1% malignant B-cells, a second cluster of 5%, 10%, 20% and 30% malignant B-cells, and a cluster of the sample containing 100% malignant B-cells. Samples with less than 1% malignant B-cells were not distinguishable by hierarchical clustering from the non-B-cells of HEK293 in either exon microarray or tag-seq. Hierarchical clustering of all samples based on all genes are shown in Additional file 5.
Comparison of background level and instrument detection limit (IDL)
Of the 123 genes differentially expressed between the 100% and 0% samples detected by exon microarray, we identified 79 B-cell specific genes that were expressed more highly in the 100% B-cell sample. To see whether these 79 B-cell specific mRNA transcripts were present in the background of the non-B-cells, the proportion of B-cell specific mRNA transcripts with DABG was plotted as a function of sample purity (Figure 1B), indicating an increased ability to detect B-cell specific mRNA transcripts as a function of sample purity. Only 4% (3 out of 79) of the B-cell specific mRNA transcripts had DABG for the pure HEK293 cell line sample. The expression values for the 3 transcripts with DABG were 6.21, 7.05, and 7.02 (HIST1H3I, ST6GAL1, and ELF) and therefore only borderline expressed. This indicates that very few false-positive B-cell specific mRNA transcripts were detectable in the pure HEK293 cell line sample. Of the 117 differentially expressed genes between 100% and 0% samples detected by tag-seq, 61 genes were expressed more highly in the 100% B-cell sample. All had counts equal to zero except for 4 genes (PHYH, DNAJC2, TCF4 and ST6GAL1) that had counts of 2, 1, 3 and 1 respectively in the pure HEK293 cell lines, demonstrating very few, if any, false-positive B-cell specific mRNA transcripts using the tag-seq technology.
Comparison of method detection limit (MDL)
The ability to detect genes was quantified by plotting the MDL as a function of sample purity (Figure 2). For both exon microarray and tag-seq, the decrease in the ability to detect mRNA expression levels was inversely proportional to sample purity. The MDL met the IDL at 100% sample purity.
Validation of gene expression levels by RT-qPCR
Spearman’s rank correlation coefficient of mRNA expression levels between exon microarray, tag-seq, and RT-qPCR
The presence of tumour-infiltrating non-malignant cells is expected to mask the detection capacity of GEPs of malignant B-cells . This study, however, suggests a novel quantitative tool for the assessment and comparison of the ability of microarrays and next generation sequencing to detect mRNA transcripts from malignant B-cells in a pool of non-malignant cells. Several studies have compared exon microarrays and next generation technologies [5, 13–16]. To the best of our knowledge, no studies have developed quantitative methods which are able to assess the ability of exon microarray and tag-seq to detect transcripts as a function of sample purity. We deliberately chose to make a model system that ensured distinct differences in cellular origin in order to observe clearly differentially expressed genes, enabling us to identify possible difficulties within such a model system.
A comparable number of differentially expressed genes between the pure samples of OCI-Ly8 and HEK293 cell lines was identified by the two technologies (Figure 1A). Although the number of differentially expressed genes in common between the two technologies was small across a number of FDR settings, it was not caused by pure chance (Additional file 2). One factor for this could be false negatives entering due to the lack of replicates at 0% and 100% sample purity. One could also speculate that the small number of commonly expressed genes may be explained by different shortcomings of the platforms, as probes on the exon microarray detect differentially expressed genes that either contain or do not contain the NlaIII restriction enzyme site, whereas tag-seq only catches differentially expressed genes with the unique CATG sequence. Therefore, some mRNA transcripts may not be detected by tag-seq due to absence of the CATG sequence, and some mRNA transcripts may not be detected by exon microarrays due to inadequate probe design for exon microarrays . However, we only found 1 gene differentially expressed by exon microarray that did not contain the NlaIII restriction site.
A majority of the differentially expressed genes that overlap between exon microarray and tag-seq were B-cell specific mRNA transcripts, including CD20, CD74, CD79A, HLA-DRA, BCL6, BANK1, C13orf18, and TCL1A (Table 2). Most of the B-cell specific genes were related to cell surface-expressed antigens, which is consistent with the importance of interactions with external environment in defining the characteristics of B-cells.
Uniform patterns of relatedness of samples between exon microarray and tag-seq were observed by hierarchical clustering. Even though only 30 of the differentially expressed genes were common between exon microarray and tag-seq, the underlying expression patterns of mRNA transcripts were sufficient to ensure similar results on the relatedness between samples by exon microarray and tag-seq. Samples with ≤1% malignant B-cells were indistinguishable from the pure non-B-cell sample for exon microarray and tag-seq. Thus, >1% malignant B-cells should be presented in biopsies for detection of a malignant B-cell profile by exon microarray and tag-seq, given the model system and data at hand (Figure 1C and 1D).
Based on concepts from analytical chemistry, it was possible to show how the ability to detect single genes (MDL) increases with sample purity (Figure 2). Both exon microarray and tag-seq showed limitations when studying low-abundant mRNA transcripts. A topic for future work is to define the precision estimates of the IDLs, MDLs and background levels as a function of dilution density and replicates. These results are important for designing future detection ability studies. When the precision estimates have been improved with other data sets it will be possible to establish guidelines on how low expression levels of mRNA transcripts are detectable in the original sample for a given sample purity, and thus, give advice on the detection abilities of e.g. low-abundant transcription factors and stem cell genes.
Variance inhomogeneity was observed for the residuals of the linear model used for the exon microarray (Additional file 6, Figure A and B). We noticed that, for fitted values below the IDL, there is a clear tendency to measurements being below the regression line. This is probably due to the measurements being below background in the region and a horizontal regression line would be more appropriate, suggesting a piecewise regression model. This is an important observation as the usual convolution models used in the de-convolution of signals from measurement in array data does not take an IDL into account. This will, however, lead to more complicated IDL and background correction calculations and is left for future research. Over-dispersion of gene count data from NGS data is well documented see e.g [17, 18]. In this paper we resolved this by using the quadratic mean variance negative binomial model, i.e. NB2, when detecting differentially expressed genes, whereas we used the linear mean variance negative binomial model, i.e. NB1, when analysing the IDL. NB1 is not supported in the edgeR package but we discovered that NB1 was sufficient for the negative binomial regression by residual plots (Additional file 6, Figures C & D, and E & F) and comparison of deviances. We found that resolving the problem of finding the most appropriate dispersion model estimates for NGS data was outside the scope of the present paper, but it is an important topic for future research.
The ability to detect single genes by exon microarray and tag-seq was exemplified by analysing the mRNA expression levels as a function of sample purity for three different B-cell mRNA transcripts. The mRNA expression levels of CD74, HLA DRA, and BCL6 followed a linear relationship as a function of sample purity for exon microarray, whereas larger fluctuations were observed for tag-seq (Figure 3). Feng et al. demonstrated an increased coefficient of variation when detecting low-abundant mRNA transcripts by tag-seq . With increased sequencing depth, detection of low-abundant mRNA transcript will probably become more accurate . Feng et al. described the relationship between total sequence volume and information of mRNA expression levels as a sigmoid function. This means the deeper the sequencing, the greater the information . The present work demonstrates that the difference in the ability to detect the three genes above depends on the sample purity and the initial abundance in the pure B-cell sample. Tag-seq showed a tendency to have higher ability to detect highly abundant transcripts compared to exon microarray, supporting previous observations [5, 14].
Based on the analysis of the three selected single genes, we demonstrated that the best detection ability corresponds to the highly abundant malignant B-cell transcripts, and there was an actual loss in the ability to detect low-abundant mRNA transcripts in samples with low purity. For BCL6, as a low-abundant gene, it was evident that a malignant B-cell frequency above 20-50% was required for reliable transcript detection. This supports the observed proportional relationship between the ability to detect genes and increasing sample purity.
The next obvious step for a follow-up study within this theme is to make serial dilutions of malignant B-cells into normal B-cells, T-cells and macrophages. Our results show that basic requirements of detection abilities can be identified and detection capabilities can be studied in a formal framework. Other directions for future studies include RNA-Seq on the entire poly-A fraction. This may help to detect even very low-abundant transcript, possibly helping to identify malignant B-cell specific transcript, i.e. alternative splicing isoforms, which are missed by tag-seq or more interestingly the fusion transcripts, which are cancer-specific and may dramatically help to create the conditions for establishing a future model system.
In the present study, it was demonstrated that the developed methodological tools are valuable in assessing the ability to detect genes by high-throughput platforms capable of screening samples for tumour-specific mRNA transcripts, but more extensive studies are needed to give general advice on the choice of optimal technology. An important future perspective when assessing and comparing the detection capabilities of high-throughput platforms is to validate the usefulness of the proposed model system and analysis setup, by diluting pure malignant lymphoma samples with normal lymph node tissue.
Samples of cell lines
The cell line, OCI-Ly8, provided by Dr. Hans Messer and Dr. Andreas Rosenwald was established at Ontario Cancer Institute (OCI). OCI-Ly8 was established from a patient with diffuse large B-cell lymphoma . This cell line was maintained in RPMI1640 medium (GIBCO Invitrogen, NY, USA) including 10% foetal bovine serum (FBS, GIBCO Invitrogen, NY, USA). Cell line HEK293 (ATCC) was generated by transformation of normal human embryonic kidney cells with sheared adeonivrus DNA . This cell line was maintained in Eagle’s Minimum Essential Medium (GIBCO Invitrogen, NY, USA) supplemented with 10% FBS. Both cell lines were cultured at 37°C in a CO2-incubator (BINDER, NY, USA) in a humidified atmosphere supplemented with 5% CO2. Cells were harvested in the exponential growth phase and sorted by a BD FACSAria II Cell Sorter (BD Bioscience, San Jose, CA, USA) to obtain eight cell populations, diluting OCI-Ly8 into HEK2932 in the cellular concentrations 0%, 0.5%, 1%, 5%, 10%, 20%, 30%, and 100% of OCI-Ly8 and subsequently stored in Trizol Reagent (GIBCOBRL Invitrogen, NY, USA). Cell line identity was analysed using 0.2 ng/μl extracted RNA from the cell lines as template in a multiplex PCR using the AmpFISTR Identified PCR amplification kit (Applied Biosystems, CA, USA). A fragment analysis of the amplified PCR product was performed by capillary electrophoresis by Eurofins Medigenomix GmbH, Applied Genetics, Germany. The resulting FSA files were analysed using the Osiris software (http://www.ncbi.nlm.nih.gov/projects/SNP/osiris/) given the following STR profiles: OCI-Ly-8: D5S818-4 (10, 13), D13S317-2 (11, 11), D7S820-1 (11, 12), D16S539-2 (9, 11), VWA-3 (17, 19), TH01-2 (6, 9.3), AMEL-4 (X, X), TPOX-3 (9, 11), CSF1PO-1 (11, 12), and HEK293: D5S818-4 (8, 9), D13S317-2 (12, 12), D7S820-1 (11, 11), D16S539-2 (9, 13), VWA-3 (16, 19), TH01-2 (7, 9.3), AMEL-4 (X, X), TPOX-3 (11, 11), CSF1PO-1 (11, 12).
Total RNA extraction
Total RNA was extracted from the eight cell populations using the Trizol Reagent and purified using the mir Vana miRNA Isolation Kit (AMBION INC., NY, USA) according to the manufacturer’s protocol. The quantity of extracted total RNA was measured on a NanoDrop ND-1000 spectrophotometer (Thermo Scientific, Denmark), and the quality of the total RNA was assessed using the Agilent RNA 6000 Nano kit and Nano Series II assay in combination with the Agilent 2100 Bioanalyzer (RIN values >9, Agilent Technologies, Denmark). Total RNA was stored at −80°C until use.
Affymetrix exon array
Exon microarray data were generated using 100 ng of total RNA. Ambion WT Expression Kit (AMBION INC, NY, USA) generated amplified and biotylated sense-strand cDNA from the total RNA according to the manufacturer’s protocol. Afterwards, single-strand cDNA was fragmented by UDG (uracil glycosylase) and APE1 (apurinic/apyrimidinic endonuclease) and biotin-labelled using the GeneChip Terminal Labeling and hybridisation Kit (Affymetrix, CA, USA) according to the manufacturer’s protocol. 5 μg of biotin-labelled cDNA were incubated with the Affymetrix GeneChip Human Exon 1.0 ST array at 45°C for 17 hours for hybridisation. Following hybridisation, non-specific bound material was removed by washing with the GeneChip Hybridisation, Wash and Stain kit (Affymetrix, CA, USA) and the GeneChip Fluidics Station 450 (Affymetrix, CA, USA). Detection of bound material was performed by scanning the array using the GeneChip Scanner 7G (Affymetrix, CA, USA).
Tag preparation and sequencing
For tag preparation, 2 μg of total RNA were incubated with magnetic Oligo-(dT) beads to capture the polyadenylated RNA fraction. Bound to the beads, total RNA was reverse transcribed into double-stranded cDNA using the SuperScript II kit (Invitrogen, NY, USA). The double-stranded cDNA was digested with the restriction enzyme NlaIII (New England Biolabs, MA, USA) to generate cDNA fragments from the most 3′-end CATG to the poly(A) tail. cDNA without the NlaIII recognition site was removed by washing. A biotinylated adaptor containing the recognition site for the restriction enzyme MmeI was linked to the 5′-end of the digested double stranded cDNA. Next, tags of 17 bp of double stranded cDNA were released from the Oligo-(dT)25 beads by MmeI (New England Biolabs, MA, USA). Tags were purified from a polyacrylamide gel by phenol-chloroform extraction and bound to Streptavindin beads. A second adapter containing an identification key of 3 bp unique for each sample was ligated to the tags. The adaptor-ligated tags from all samples were pooled and amplified by PCR. The PCR products were purified by excision from a 6% polyacrylamide gel (Invitrogen, NY, USA), followed by elution, ethanol precipitation and re-suspension (Illumina, MA, USA). The amount of adaptor-ligated tags was quantified using the NanoDrop ND-1000 spectrophotometer (Thermo Scientific, Denmark). The sample was stored at −20°C until sequencing. The sequencing procedure was carried out according to the manufacturer’s protocol. Adaptor-ligated tags from all samples were sequenced in an individual lane of a flow cell by the Solexa/Illumina Whole Genome Sequencer 1G analyser. Clustering and sequencing were performed with reagents from the cluster generation kit and sequencing kit (Illumina, MA, USA).
RT-qPCR data were generated using total RNA reverse transcribed into cDNA by the SuperScript III First-Strand Synthesis Supermix (Invitrogen, NY, USA ) and TaqMan reagents and assays (Applied Biosystems, CA, USA) according to the manufacturer’s protocol. 500 ng total RNA was mixed with 50 μM Oligo(dT) primer, 50 ng/μl random hexamer primer, and Annealing buffer. The sample was incubated in the thermal cycler at 65°C for 5 minutes to unfold loops together with secondary structures. Following incubation, 2x First-Strand Reaction mix and SuperScript III/RNaseOUT Enzyme mix were added to the sample, and the sample was incubated in the thermal cycler at 25°C for 5 minutes, 50°C for 50 minutes, and then 80°C for 5 minutes. The cDNA was used directly in RT-qPCR or stored at −20°C.
For RT-qPCR, 2x TaqMan Universal Master mix without AmpErase UNG (Applied Biosystems, CA, USA) and 20x TaqMan gene expression assay was mixed. The cDNA product was diluted 10 times, and 4 μl of the cDNA product was used as template. The cDNA product was amplified using a two-step thermal profile: 95°C for 10 minutes, followed by 50 cycles at 95°C for 15 seconds and 60°C for 1 minute.
The RT-qPCR was performed using cDNA product and the TaqMan gene expression assays (Applied biosystems, CA, USA) specific for CD74 (Hs00959496_m1), HLA-DRA (Hs00219576_m1), and BCL6 (Hs00153368). To correct for sample-to-sample variations in efficiencies and errors in quantification, the TaqMan Pre-Developed endogeneous control assay (Applied Biosystems, CA, USA), TBP (333769F), were used for normalisation. Amplified mRNA was detected by the Mx3000P QPCR instrument (Stratagene, CA, USA).
Data processing and analysis
Generation of .CEL-files and gene summarization
CEL-files were generated by the Affymetrix GeneChip Command Console Software (AGCC) and deposited at the NCBI Gene Expression Omnibus (GEO) under accession number GSE40311. The .CEL-files were imported into Bioconductor using the aroma.affymetrix package. The .CEL-files were background corrected and quantile normalised by the RMABackgroundCorrection and QuantileNormalization functions from the aroma.affymetrix package . Core probe intensities were summarised to gene expression levels by means of the ExonRMAPlm(, mergeGroups=TRUE) function using the customised cdf-file for core transcripts downloaded from http://bcgc.lbl.gov/cdfFiles/HuEx-1_0-st-v2,A20071112,EP/HuEx-1_0-st-v2,coreR3,A20071112,EP.cdf.
Extraction, annotation and gene summarization of tag sequences
The Illumina Pipeline software version 1.3 was used for image analysis and base calling, omitting chastity filtering, otherwise using default settings. Data were imported into the CLC Genomics Workbench 4.5 and sequence error corrected by the SAGEScreen method. The first 17 bp of the tag sequences were extracted from the output file according to the second adaptor containing the identification key for each sample. The unique tags were sorted and counted for each sample using the CLC Genomics Workbench 4.5 and exported to .txt-files. These text files were imported into Bioconductor. Due to a large library-to-library variation in the amount of obtained tags the set of tags in the libraries were randomly sub-sampled without replacement to equal the size of the smallest library.
Each unique tag sequence was annotated by mapping the reads with the anchoring NlaIII restriction site to the ENSEMBL Homo Sapiens cDNA reference transcriptome (release 63) downloaded from ftp://ftp.ensemble.org on both the coding and non-coding strands using Bowtie version 0.12.7 . The setting –v1 was used allowing Bowtie to accept a maximum of one mismatch. ENSEMBL identifiers were via Entrez identifiers, mapped to gene symbols by use of the R-package Biomart and the org.hs.eg.db database. The total expression for each gene was calculated by summing all tags mapped to the same gene. Tag-seq data were deposited at GEO under accession number GSE40311.
All statistical analyses were done with R Version 2.13.2 (64 bit Windows version) and a number of Bioconductor packages  as described below. Detailed session information is contained in Additional file 7. P-values were false discovery rate (FDR) corrected by the Benjamini-Hochberg procedure and 0.05 was used as significance level.
Differentially expressed genes
this means that P-values of the z-scores z g = (Yg 1 − Yg 2)/s can be used to test the hypothesis of no difference in gene expression levels. The P-values of the z-scores can be FDR corrected by the Benjamini-Hochberg procedure as implemented in the Bioconductor package fdrtool.
where Mi is the library size, pgi is the relative abundance of gene g, and ϕg is the dispersion parameter. When the negative binominal distribution was fitted, a common dispersion for all tags was estimated using the Cox-Reid profile-adjusted likelihood method. FDR corrected P-values for differentially expressed genes were determined using an exact test.
Detection above background
In order to determine genes detected above background, the DABG algorithm of Affymetrix Power Tools (APT)  version 188.8.131.52 with the instruction “apt-probeset-summarize –a dabg” was used to find P-values for the hypothesis that each exon belongs to the background. The .PGF and .CLF files were used for the exon array, and for the background correction, the antigenomic background probes were used. Summarising to gene level was performed as follows : For each gene in each sample, the number of present exons was counted – presence is assumed whenever the P-value was below 0.005. If half or more than half of the exons were present, the gene was assumed to be present in that sample, or otherwise it was absent.
A hierarchical cluster analysis was performed on the differentially expressed genes. The hierarchical cluster analysis was based on the R-function hclust using complete linkage as the agglomeration method and Eucledian distance as dissimilarity measure.
Instrumental detection limit and method detection limit
Exon microarray measurement model
where ε ki ~ N(0, σ2). This is done by means of least squares regression as implemented in the lm-function in R. Model control is performed by plotting the standardised residuals vs. the fitted values and normal probability plots of the standardised residuals. Standardised residuals above 3 are considered outliers and plotted on a log-transformed scale.
Tag-seq measurement model
For tag-seq, concentration, t, should be understood as the expected count number of a particular mRNA transcript. It is normally assumed that the expected count number is directly proportional to the number of mRNA transcripts in the sample. The coefficient of proportionality depends on e.g. the sequencing method and coverage depth. Hence, for tag-seq, it is assumed that b = 0. In the following we calculate the IDL for two models incorporating over-dispersion.
This implies that in order to reliably observe an mRNA transcript, the expected count for the mRNA transcript should be more than θ/log20(1 + θ) counts in the specific experimental setup.
This implies that in order to reliably observe an mRNA transcript, the expected count for the mRNA transcript should be more than (20 θ − 1)/θ counts in the specific experimental setup.
where a k = log t k and x = 1, 2. This is done by means of maximum likelihood estimation of the parameters as implemented with the gamlss-function in R . Model control can be performed by plotting the quantile residuals vs. the fitted values and normal probability plots of the quantile residuals. Models can be compared by the deviance.
Cross platform reproducibility analysis
We assessed the cross platform reproducibility by means of correspondence curves . A correspondence curve is constructed by ranking the genes according to their significance for each of the two experiments. Then the number of genes in common is plotted against the number of significant genes. Under perfect concordance this plot will behave like y = x and under random sampling it will behave like y=x 2 . The acceptance region at a given level of significance for the hypothesis of random sampling can be constructed by the hypergeometric distribution .
For determination of mRNA expression levels by RT-qPCR measurements were performed in triplicates, and the average cycle thresholds (Cq) were used to determine fold-change. The mRNA expression levels were determined by the ∆∆Cq method  using TBP as reference gene and the pure OCI-Ly8 cell line as reference sample. No reverse transcription and no template control samples were used as controls.
This work was primarily funded by SEQNET, Grant number 09–063764, a programme developing software tools for next generation sequencing supported by the Danish Agency for Science, Technology and Innovation. Additional funding was obtained from the EU 6th FP MSCNET (LSHC-CT-2006-037602, coordinator HEJ), the Danish Research Agency to CHEPRE (#2101-07-0007, coordinator KD) and the KE Jensen Foundation (2006–2010). The excellent technical assistance of Ann-Maria Jensen, Louise Hvilshøj Madsen, and Helle Høholt is gratefully appreciated.
- Kuppers R, Rajewsky K, Zhao M, Simons G, Laumann R, Fischer R, et al: Hodgkin disease: Hodgkin and Reed-Sternberg cells picked from histological sections show clonal immunoglobulin gene rearrangements and appear to be derived from B cells at various stages of development. Proc Natl Acad Sci USA. 1994, 91: 10962-10966. 10.1073/pnas.91.23.10962.PubMed CentralView ArticlePubMedGoogle Scholar
- Swerdlow SH, Campo E, Harris NL, Jaffe ES, Pileri SA, Stein H, et al: WHO Classification of Tumours of Haematopoietic and Lymphoid Tissues. 2008, Lyon: IARCGoogle Scholar
- Garnis C, Buys TP, Lam WL: Genetic alteration and gene expression modulation during cancer progression. Mol Cancer. 2004, 3: 9-10.1186/1476-4598-3-9.PubMed CentralView ArticlePubMedGoogle Scholar
- Hermeking H: Serial analysis of gene expression and cancer. Curr Opin Oncol. 2003, 15: 44-49. 10.1097/00001622-200301000-00006.View ArticlePubMedGoogle Scholar
- 't Hoen PA, Ariyurek Y, Thygesen HH, Vreugdenhil E, Vossen RH, de Menezes RX, et al: Deep sequencing-based expression analysis shows major advances in robustness, resolution and inter-lab portability over five microarray platforms. Nucleic Acids Res. 2008, 36: e141-10.1093/nar/gkn705.PubMed CentralView ArticlePubMedGoogle Scholar
- Morrissy AS, Morin RD, Delaney A, Zeng T, McDonald H, Jones S, et al: Next-generation tag sequencing for cancer gene expression profiling. Genome Res. 2009, 19: 1825-1835. 10.1101/gr.094482.109.PubMed CentralView ArticlePubMedGoogle Scholar
- Lockhart DJ, Dong H, Byrne MC, Follettie MT, Gallo MV, Chee MS, et al: Expression monitoring by hybridization to high-density oligonucleotide arrays. Nat Biotechnol. 1996, 14: 1675-1680. 10.1038/nbt1296-1675.View ArticlePubMedGoogle Scholar
- Held GA, Grinstein G, Tu Y: Relationship between gene expression and observed intensities in DNA microarrays – a modeling study. Nucleic Acids Res. 2006, 34: e70-10.1093/nar/gkl122.PubMed CentralView ArticlePubMedGoogle Scholar
- Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, Speed TP: Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Res. 2003, 31: e15-10.1093/nar/gng015.PubMed CentralView ArticlePubMedGoogle Scholar
- Nielsen KL, Hogh AL, Emmersen J: DeepSAGE–digital transcriptomics with high sensitivity, simple experimental protocol and multiplexing of samples. Nucleic Acids Res. 2006, 34: e133-10.1093/nar/gkl714.PubMed CentralView ArticlePubMedGoogle Scholar
- Nielsen KL: DeepSAGE: higher sensitivity and multiplexing of samples using a simpler experimental protocol. Methods Mol Biol. 2008, 387: 81-94. 10.1007/978-1-59745-454-4_6.View ArticlePubMedGoogle Scholar
- Fury W, Batliwalla F, Gregersen PK, Li W: Overlapping Probabilities of Top Ranking Gene Lists, Hypergeometric Distribution, and Strigency of Gene Selection Criterion. 2006, New York City, USA: Proceedings of the 28th IEEE EMBS Annual International Conference, 5531-5534.Google Scholar
- Bradford JR, Hey Y, Yates T, Li Y, Pepper SD, Miller CJ: A comparison of massively parallel nucleotide sequencing with oligonucleotide microarrays for global transcription profiling. BMC Genomics. 2010, 11: 282-10.1186/1471-2164-11-282.PubMed CentralView ArticlePubMedGoogle Scholar
- Asmann YW, Klee EW, Thompson EA, Perez EA, Middha S, Oberg AL, et al: 3' tag digital gene expression profiling of human brain and universal reference RNA using Illumina Genome Analyzer. BMC Genomics. 2009, 10: 531-10.1186/1471-2164-10-531.PubMed CentralView ArticlePubMedGoogle Scholar
- Feng L, Liu H, Liu Y, Lu Z, Guo G, Guo S, et al: Power of deep sequencing and Agilent microarray for gene expression profiling study. Mol Biotechnol. 2010, 45: 101-110. 10.1007/s12033-010-9249-6.View ArticlePubMedGoogle Scholar
- Malone JH, Oliver B: Microarrays, deep sequencing and the true measure of the transcriptome. BMC Biol. 2011, 9: 34-10.1186/1741-7007-9-34.PubMed CentralView ArticlePubMedGoogle Scholar
- Robinson MD, Smyth GK: Small-sample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics. 2008, 9: 321-332.View ArticlePubMedGoogle Scholar
- Oberg AL, Bot BM, Grill DE, Poland GA, Therneau TM: Technical and biological variance structure in mRNA-Seq data: life in the real world. BMC Genomics. 2012, 13: 304-10.1186/1471-2164-13-304.PubMed CentralView ArticlePubMedGoogle Scholar
- Mehra S, Messner H, Minden M, Chaganti RS: Molecular cytogenetic characterization of non-Hodgkin lymphoma cell lines. Genes Chromosomes Cancer. 2002, 33: 225-234. 10.1002/gcc.10025.View ArticlePubMedGoogle Scholar
- Graham FL, Smiley J, Russell WC, Nairn R: Characteristics of a human cell line transformed by DNA from human adenovirus type 5. J Gen Virol. 1977, 36: 59-74. 10.1099/0022-1317-36-1-59.View ArticlePubMedGoogle Scholar
- Bengtsson H, Simpson K, Bullard J, Hansen K: Aroma.affymetrix: a generic framework in R for analyzing small to very large Affymetrix data sets in bounded memory. 2008, CA, USA: Technical report 745Google Scholar
- Langmead B: Aligning short sequencing reads with Bowtie. Curr Protoc Bioinformatics. 2010, Chapter 11: Unit-PubMedGoogle Scholar
- Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, et al: Bioconductor: Open software development for computational biology and bioinformatics. Genome Biol. 2004, 5: R80-10.1186/gb-2004-5-10-r80.PubMed CentralView ArticlePubMedGoogle Scholar
- Livak KJ, Schmittgen TD: Analysis of relative gene expression data using real-time quantitative PCR and the 2−ΔΔCT method. Methods. 2001, 25: 402-408. 10.1006/meth.2001.1262.View ArticlePubMedGoogle Scholar
- Affymetrix: Statistical algorithms reference guide. 2001, Santa Clara, CA: AffymetrixGoogle Scholar
- Affymetrix: Identifying and validating alternative splicing events. An introduction to managing data provided by GeneChip Exon arrays. 2006, Santa Clara, CA: AffymetrixGoogle Scholar
- Mocak J, Bond AM, Mitchell S, Scollary G: A statistical overview of standard (IUPAC and ACS) and new procedures for determing the limit of detection and quantification to voltammetric and stripping techniques. Pure Appl Chem. 1997, 69: 297-328. 10.1351/pac199769020297.View ArticleGoogle Scholar
- Evans DA: Experimental evidence concerning contagious distributions in ecology. Biometrika. 1953, 40: 186-211.View ArticleGoogle Scholar
- Stasinopoulos DM, Rigby RA: Generalized additive models for location scale and shape (GAMLSS) in R. J Stast Softw. 2007, 23: 1-46.Google Scholar
- Li Q, Brown JB, Huang H, Bickel PJ: Measuring reproducibility of high-throughput experiments. Ann App Stat. 2011, 5: 1752-1779. 10.1214/11-AOAS466.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.