- Research article
- Open Access
Increased sensitivity of next generation sequencing-based expression profiling after globin reduction in human blood RNA
© Mastrokolias et al; licensee BioMed Central Ltd. 2012
- Received: 15 July 2011
- Accepted: 18 January 2012
- Published: 18 January 2012
Transcriptome analysis is of great interest in clinical research, where significant differences between individuals can be translated into biomarkers of disease. Although next generation sequencing provides robust, comparable and highly informative expression profiling data, with several million of tags per blood sample, reticulocyte globin transcripts can constitute up to 76% of total mRNA compromising the detection of low abundant transcripts. We have removed globin transcripts from 6 human whole blood RNA samples with a human globin reduction kit and compared them with the same non-reduced samples using deep Serial Analysis of Gene Expression.
Globin tags comprised 52-76% of total tags in our samples. Out of 21,633 genes only 87 genes were detected at significantly lower levels in the globin reduced samples. In contrast, 11,338 genes were detected at significantly higher levels in the globin reduced samples. Removing globin transcripts allowed us to also identify 2112 genes that could not be detected in the non-globin reduced samples, with roles in cell surface receptor signal transduction, G-protein coupled receptor protein signalling pathways and neurological processes.
The reduction of globin transcripts in whole blood samples constitutes a reproducible and reliable method that can enrich data obtained from next generation sequencing-based expression profiling.
- Digital Gene Expression Profile
- Sage Library
- Globin Reduction
- Globin Transcript
- Globin Sequence
Transcriptomics technologies have successfully been used for biomarker discovery and the study of physiological and pathophysiological mechanisms . Recently, advances in sequencing technology have allowed for direct identification of transcript specific sequences (tags) that are digitally counted, and the analysis of differences in gene expression with unprecedented accuracy . One specific application of next generation gene expression analysis is DeepSAGE, a tag sequencing method on the Illumina high-throughput sequencing platform that is analogous to LongSAGE . Such sequencing-based technologies offer distinct advantages over expression micro-arrays, such as a higher dynamic range, the increased power for detection of low abundance transcripts, and the detection of novel transcripts and transcript variants [4, 5]. Furthermore, next generation sequencing technologies show less variation between different study sites than microarray technology and are not content-limited .
Transcriptome analysis of peripheral blood is of great interest for clinical research; where significant differences between samples obtained in a minimally invasive and cost-effective manner can be translated into gene signatures of disease stage, drug response and toxicity . Blood comes into contact with almost every tissue and organ of the human body and due to its cellular composition it can reflect both physiological and pathogenic stimuli. Gene expression differences in peripheral whole blood have been used to determine signatures related to acute myeloid leukaemia , but also in neuropsychiatric disorders and Huntington's disease where significant correlation was found between blood and brain gene expression [9, 10]. Furthermore disease signatures can be robust across tissues and experiments and in a large meta-analysis study performed by Dudley and colleagues, it was demonstrated that gene expression profiles in various tissues from the same disease were more similar than gene expression profiles from identical tissues from different diseases . Furthermore, 80% of genes expressed in peripheral blood cells are shared with other important tissues . These results suggest that transcriptome analysis of the blood is often informative even when the originating pathology stems from a different tissue.
Blood is composed of three main cell types. The main component is red blood cells (95%), including the progenitor erythrocytes called reticulocytes, followed by platelets (5%) and white blood cells (< 1%). While white blood cells make up the minority, they are the most informative and in the past, many studies have been performed on the isolated white blood cell fraction. However, white blood cell isolation kits can induce a technical bias that can confound the original expression profile of the samples and delayed sample processing can affect gene expression profiles [13, 14]. This means that white blood cell separation protocols require quick and accurate processing, which can be difficult in a clinical setting. The expression profile of a sample is better preserved by a whole blood collection method, like the PAX gene blood RNA system . Another advantage of this system is that samples can be frozen for up to 2 years without affecting the expression profile . However, cell sorting and counting is not possible because blood cells are lysed directly after collection. Therefore, abundant transcripts in abundant cell types may conceal the more interesting less abundant transcripts from less abundant cell types. Particularly, the presence of globin transcripts originating from reticulocytes in whole blood samples may limit the sensitivity of gene expression profiling experiments , since globin transcripts can constitute up to 70% of the total whole blood mRNA population . While in microarray experiments the presence of globin transcripts may reduce the amount of fluorescent label available for other transcripts but otherwise just results in a saturated spot on the microarray, the high abundance of globin transcripts is more of a concern in sequencing-based expression profiling studies. Since measuring absolute abundance, globin transcripts will be sequenced over and over again, while limiting the coverage of other transcripts.
To deal with the abundance of globin transcripts in whole blood mRNA, several globin reduction protocols have been successfully used in gene expression studies [18–21]. The removal of alpha and beta globin mRNA can be achieved by selective hybridization of biotinylated globin sequence specific oligos with the globin transcripts and depleting them from the total mRNA population through magnetic beads . Another approach developed by Affymetrix and PreAnalytiX utilizes 3' specific PNAs to inhibit reverse transcription during cDNA synthesis. Alternatively, a high abundance transcript depletion protocol adopts the properties of the Kamtchaka crab duplex specific nuclease to selectively reduce the most abundant RNA transcripts .
Our aim in the present study is to investigate the effect of globin transcript reduction on the specificity and sensitivity of digital gene expression profiling (deep SAGE).
Globin transcripts can successfully be reduced in whole blood samples
Sequence Statistics and RNA qualities
Non- globin reduced
To examine the reproducibility of globin reduction we performed a logarithmic transformation on the count data from the 6 globin reduced samples and calculated the correlation. Correlation values ranged from 0.88 to 0.94, showing good reproducibility after globin reduction (Additional file 2).
Differential Gene Expression Differences between Globin Reduced and Non-Reduced Samples
Top ten most differentially expressed transcripts
Trub sudouridine synthase
To examine why 83 non-globin transcripts were reduced by the globin reduction procedure, we examined possible non-target specific hybridization of the globin oligos. Using the sequences of the ten biotinylated oligonucleotides from the globin reduction kit we performed a blastn search against the human mRNA reference genome but only the 4 globin transcripts that were found to be most significantly decreased in our globin-reduced samples showed significant hits in blastn. Moreover, oligonucleotide sequences were analyzed with RNAhybrid, an online tool for determining the minimum free hybridization energy between a long and a short RNA  but no conclusive evidence was found for non-specific hybridization between oligonucleotides from the globin reduction kit and the 83 non-globin transcripts that were reduced in our globin-reduced samples. Finally we wanted to investigate if any of the 83 transcripts could have been co-captured with the globin transcripts. Hence we investigated whether there was sequence homology between the 83 top reduced transcripts and the globin transcripts but there was no indication of co-capturing based on sequence homology. Furthermore, more than half the transcripts reported as non-specifically detected at lower levels were derived from pseudogenes or expressed at very low levels.
Transcripts detected in globin-reduced samples only
Functional enrichment and gene biotype analysis
GO term enrichment analysis using DAVID
% of genes
Metal ion binding
Transition metal ion binding
Zinc ion binding
Regulation of transcription
Intracellular non membrane bound organelle
Plasma membrane part
Sequencing Depth vs. Globin Reduction - Genes Discovered
Number of transcripts identified across different sequencing depths
Total number of sequences
We performed globin transcript reduction on RNA from human peripheral blood to investigate if this improved the sensitivity of SAGE digital gene expression profiling. In the non-reduced samples, globin percentages were highly variable between samples and ranged from 52% to 76%. These numbers match previous reports for globin transcript abundances in blood . The globin reduction process with biotinylated oligonucleotides complementary to globin transcripts was successful since there was a > 99.6% reduction in globin transcripts. This was consistent with previous studies . We observed that after globin reduction there was a slight decrease in total RNA quality as well as an RNA yield loss of 5-9%. The slight reduction in RNA quality and yield had no effect on the quality of the data. From the 21,633 transcripts that were detected across all 12 samples, 11,633 transcripts were detected at significantly higher levels in the 6 globin reduced samples, while only 87 transcripts were reported as detected at significantly lower levels. We can not explain why there were transcripts, other than the globin transcripts, found to be expressed at lower levels in the globin reduced samples. In silico analysis showed that this was not likely due to cross hybridization with the globin transcripts or with the globin oligonucleotides from the globin reduction kit.
We detected robust expression of 2112 transcripts in globin reduced samples that could not be detected in non-globin reduced samples. This is similar to what has been previously reported using classical microarray gene expression platforms . The important advantage of more detectable transcripts is that there will be increased statistical power to detect differences between samples. Furthermore, this increased number of detected transcripts after globin reduction provides more information about low abundance transcripts that could give new insights into the underlying disease mechanism.
The reduction of globin transcripts provides a practical and reproducible method for improving the number of transcripts detected in human peripheral whole blood samples, and reduces required sequencing capacity by a factor two. Deep SAGE is a technique that detects short tags of 21-22 base pairs representing specific transcripts. For this reason the data complexity is lower compared to other digital gene expression profiling techniques such as whole transcriptome sequencing (RNA-seq). For RNA-seq, where the aim is to also identify all the transcript isoforms, saturation is definitely not reached at 25 million reads, which would make globin transcript reduction in blood even more advantageous for RNA-seq.
The reduction of globin transcripts together with whole blood collection methods, such as the PAX system, constitutes a reproducible and reliable method that increases the number of transcripts detected in next generation sequencing-based gene expression profiling. This will increase the statistical power to detect disease relevant signatures in patient-control studies.
Blood Collection & RNA isolation
Samples from six healthy individuals of Caucasian decent (4 females, 2 males) aged from 39 to 70 years old were collected after informed consent and with ethical approval. Whole blood was drawn into PAX gene tubes (Qiagen, Venlo, The Netherlands) and inverted 10 times. Samples were allowed to equilibrate at room temperature for 2 hours, placed at -20°C overnight and stored at -80°C until further processing. Before total RNA extraction, samples were thawed overnight at 4°C and total RNA was isolated using the PAX RNA isolation kit following the manufacturer's instructions, including DNAse treatment. RNA quality and the RIN values were determined using the RNA Nano LabChip assay (Agilent Technologies, Santa Clara, CA, USA). The concentration of each sample was validated using the Nanodrop spectrophotometer (NanoDrop Technologies, Wilmington, DE, USA).
From each sample, 1.5 μg of total RNA was treated using the GLOBINclear™ Human Kit (Ambion, Austin, TX, USA) according the manufacturer's instructions. The kit contains 10 globin sequence specific biotinylated oligonucleotides http://www.freepatentsonline.com/y2006/0257902.html. Transcripts hybridized to the oligonucleotides are removed through streptavidin magnetic beads. Concentration and quality of the samples were checked as described above.
Quantitative Real-Time PCR
cDNA was synthesized from 1 μg of total RNA using the Transcriptor First Strand cDNA Synthesis Kit (Roche, Mannheim, Germany) with Random Hexamer primers at 50°C for 1 hour. The primer sequences designed to target genes of interest and the reference gene ACTB are described in Additional file 4. The qPCR was performed using 1 μl of 4x diluted cDNA, 2x FastStart Universal SYBR Green Master mix (Roche), 2.5 pmol forward primer,2.5 pmol reverse primer and PCR grade water to a total volume of 10 μl. The reaction was performed using the LightCycler 480 (Roche) with initial denaturation 10 min. at 95°C, followed by 45 cycles of 10 sec. denaturation at 95°C, 30 sec. annealing at 60°C and 20 sec. elongation at 72°C and final elongation was performed 5 min. at 72°C. This was followed by melting curve analysis from 60°C to 98°C with a ramp rate of 0.02°C per sec. The primer efficiencies and relative expression of the transcript levels was determined using LinRegPCR_v11.3.
SAGE library production
SAGE libraries were produced according to the Illumina protocol . In short, after hybridization of 1 μg total RNA to polydT magnetic beads (Dynabeads, Invitrogen Life Technologies, Carlsbad, CA, USA), first and second strand synthesis was performed. The beads attached to the double stranded DNA were digested with NlaIII restriction endonuclease to produce short double stranded constructs starting at the most 3' CATG of the transcript. After ligation of the GEX adapter 1, the construct was digested with MmeI to create a 21 base pair fragment downstream of the GEX adapter 1. This fragment was then ligated to GEX adapter 2 to complete the cassette. The adaptor ligated constructs were amplified by 15 cycles of PCR and the products were loaded on 6% Novex TBE precast acrylamide gels (Invitrogen). The 96 base pair band corresponding to the NlaIII construct was excised and purified from the gel slice using the soak and crush method followed by ethanol precipitation. Sample quality was checked on a DNA 1000 Lab-on-a-Chip (Agilent). Sequencing was performed at the Leiden Genome Technology Center on Illumina GA2 sequencer (Illumina, San Diego, CA, USA). Purified samples were diluted to 10 nM and loaded on a single lane of the flow cell where, after cluster amplification, samples were put through an ultra short 18 cycle sequencing run.
Illumina Pipeline Software version 1.5 was used for data sequence processing. The FASTQ files were analysed using the open source GAPSS_B(v2) pipeline http://www.lgtc.nl\GAPSS. All sequences were trimmed to 17 base pairs to remove the first lower quality base pair from the 3' end of the sequences. After trimming, the NlaIII recognition site (CATG) was added to the 5' end of the sequence to create the complete 21-22mer nucleotide sequences. Sequences were aligned using the Bowtie short read aligner (version 0.12.7) against the UCSC hg19 reference genome, allowing for a maximum of one mismatch and a maximum of two possible positions in the genome (options: -k 1 -m 2 -n 1 --best --strata -solexa1.3-quals).
A custom Perl script was used to create reference region files from the SAGE region files that were composed of the overlapping tags from all samples. A second Perl script was then used to link all individual region files to the reference region file, reporting the number of tags in each individual region of the reference region file. Finally, the reference region file was annotated with transcript information using BIOMART (Ensembl build 60). For all reported downstream statistical and biological analysis only sequences aligned to known exons were used. Analyses were performed at the gene level, and in case of multiple SAGE tags per gene, e.g. as a consequence of alternative polyadenylation, tags were summed. The data discussed in this publication have been deposited in NCBI's Gene Expression Omnibus and are accessible through GEO Series accession number GSE33701 http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE33701.
To identify genes detected at lower or higher levels, data files containing the count expression data were analyzed using the edgeR package (version 2.0.5) [30, 31] in R (version 2.12.0). Data was normalized creating libraries of equal size (11 million tags). To determine differences in detection levels between the two groups an exact test for the negative binomial distribution was used. The P-values were adjusted for multiple testing using the Benjamini and Hochberg's approach for controlling the false discovery rate (FDR) and genes were considered to be different between reduced and non-reduced samples when the P value was less than 0.01. For pathway enrichment analysis, the online tool DAVID was used. To identify gene biotypes, Ensembl genes IDs were uploaded to the BioMart site and supplied gene biotype attributes (ENSEMBL build 60, Homo sapiens genome reference GRCH37.p2).
To examine if transcripts are uniquely and consistently identified in the globin reduced or non-reduced samples only, we applied a tag count threshold of 5 tags or more and this threshold had to be met in 5 out of the 6 samples in the respective group.
The blood samples were kindly provided by Prof Dr R.C. van der Mast and Dr E. van Duijn at the Psychiatry Department, LUMC, The Netherlands. This research was partly funded by the European Commission 7th Framework Program, Project no. 261123, GEUVADIS, Project no. 201413, ENGAGE, by the Centre for Medical Systems Biology within the framework of the Netherlands Genomics Initiative/Netherlands Organisation for Scientific Research and Dutch Centre for Biomedical Genetics.
- Le-Niculescu H, Kurian SM, Yehyawi N, Dike C, Patel SD, Edenberg HJ: Identifying blood biomarkers for mood disorders using convergent functional genomics. Mol Psychiatry. 2009, 14: 156-174. 10.1038/mp.2008.11.View ArticlePubMedGoogle Scholar
- Shendure J, Ji H: Next-generation DNA sequencing. Nat Biotechnol. 2008, 26: 1135-1145. 10.1038/nbt1486.View ArticlePubMedGoogle Scholar
- Nielsen KL, Hogh AL, Emmersen J: DeepSAGE digital transcriptomics with high sensitivity, simple experimental protocol and multiplexing of samples. Nucleic Acids Res. 2006, 34: e133-10.1093/nar/gkl714.PubMed CentralView ArticlePubMedGoogle Scholar
- Feng L, Liu H, Liu Y, Lu Z, Guo G, Guo S: Power of deep sequencing and agilent microarray for gene expression profiling study. Mol Biotechnol. 2010, 45: 101-110. 10.1007/s12033-010-9249-6.View ArticlePubMedGoogle Scholar
- van Iterson M, 't Hoen PA, Pedotti P, Hooiveld GJ, Den Dunnen JT, Van Ommen GJ: Relative power and sample size analysis on gene expression profiling data. BMC Genomics. 2009, 10: 439-10.1186/1471-2164-10-439.PubMed CentralView ArticlePubMedGoogle Scholar
- 't Hoen PA, Ariyurek Y, Thygesen HH, Vreugdenhil E, Vossen RH, de Menezes RX: Deep sequencing-based expression analysis shows major advances in robustness, resolution and inter-lab portability over five microarray platforms. Nucleic Acids Res. 2008Google Scholar
- Pahl A: Gene expression profiling using RNA extracted from whole blood: technologies and clinical applications. Expert Rev Mol Diagn. 2005, 5: 43-52. 10.1586/1473722.214.171.124.View ArticlePubMedGoogle Scholar
- Valk PJ, Verhaak RG, Beijen MA, Erpelinck CA, Barjesteh van Waalwijk van Doorn-Khosrovani S, Boer JM: Prognostically useful gene-expression profiles in acute myeloid leukemia. N Engl J Med. 2004, 350: 1617-1628. 10.1056/NEJMoa040465.View ArticlePubMedGoogle Scholar
- Borovecki F, Lovrecic L, Zhou J, Jeong H, Then F, Rosas HD: Genome-wide expression profiling of human blood reveals biomarkers for Huntington's disease. PNAS. 2005, 102: 11023-11028. 10.1073/pnas.0504921102.PubMed CentralView ArticlePubMedGoogle Scholar
- Sullivan PF, Fan C, Perou CM: Evaluating the comparability of gene expression in blood and brain. Am J Med Genet B Neuropsychiatr Genet. 2006, 141B: 261-268. 10.1002/ajmg.b.30272.View ArticlePubMedGoogle Scholar
- Dudley JT, Tibshirani R, Deshpande T, Butte AJ: Disease signatures are robust across tissues and experiments. Mol Syst Biol. 2009, 5: 307-PubMed CentralView ArticlePubMedGoogle Scholar
- Liew CC, Ma J, Tang HC, Zheng R, Dempsey AA: The peripheral blood transcriptome dynamically reflects system wide biology: a potential diagnostic tool. J Lab Clin Med. 2006, 147: 126-132. 10.1016/j.lab.2005.10.005.View ArticlePubMedGoogle Scholar
- Muller MC, Merx K, Weisser A, Kreil S, Lahaye T, Hehlmann R: Improvement of molecular monitoring of residual disease in leukemias by bedside RNA stabilization. Leukemia. 2002, 16: 2395-2399. 10.1038/sj.leu.2402734.View ArticlePubMedGoogle Scholar
- Debey S, Schoenbeck U, Hellmich M, Gathof BS, Pillai R, Zander T: Comparison of different isolation techniques prior gene expression profiling of blood derived cells: impact on physiological responses, on overall expression and the role of different cell types. Pharmacogenomics J. 2004, 4: 193-207. 10.1038/sj.tpj.6500240.View ArticlePubMedGoogle Scholar
- Rainen L, Oelmueller U, Jurgensen S, Wyrich R, Ballas C, Schram J: Stabilization of mRNA expression in whole blood samples. Clin Chem. 2002, 48: 1883-1890.PubMedGoogle Scholar
- Ovstebo R, Lande K, Kierulf P, Haug KB: Quantification of relative changes in specific mRNAs from frozen whole blood - methodological considerations and clinical implications. Clin Chem Lab Med. 2007, 45: 171-176. 10.1515/CCLM.2007.035.View ArticlePubMedGoogle Scholar
- Field LA, Jordan RM, Hadix JA, Dunn MA, Shriver CD, Ellsworth RE: Functional identity of genes detectable in expression profiling assays following globin mRNA reduction of peripheral blood samples. Clin Biochem. 2007, 40: 499-502. 10.1016/j.clinbiochem.2007.01.004.View ArticlePubMedGoogle Scholar
- Vartanian K, Slottke R, Johnstone T, Casale A, Planck SR, Choi D: Gene expression profiling of whole blood: comparison of target preparation methods for accurate and reproducible microarray analysis. BMC Genomics. 2009, 10: 2-10.1186/1471-2164-10-2.PubMed CentralView ArticlePubMedGoogle Scholar
- Wright C, Bergstrom D, Dai HY, Marton M, Morris M, Tokiwa G: Characterization of globin RNA interference in gene expression profiling of whole-blood samples. Clin Chem. 2008, 54: 396-405. 10.1373/clinchem.2007.093419.View ArticlePubMedGoogle Scholar
- Tian Z, Palmer N, Schmid P, Yao H, Galdzicki M, Berger B: A Practical Platform for Blood Biomarker Study by Using Global Gene Expression Profiling of Peripheral Whole Blood. PLoS ONE. 2009, 4:Google Scholar
- Debey S, Zander T, Brors B, Popov A, Eils R, Schultze JL: A highly standardized, robust, and cost-effective method for genome-wide transcriptome analysis of peripheral blood applicable to large-scale clinical trials. Genomics. 2006, 87: 653-664. 10.1016/j.ygeno.2005.11.010.View ArticlePubMedGoogle Scholar
- Bogdanova EA, Shagin DA, Lukyanov SA: Normalization of full-length enriched cDNA. Mol Biosyst. 2008, 4: 205-212. 10.1039/b715110c.View ArticlePubMedGoogle Scholar
- Rehmsmeier M, Steffen P, Hochsmann M, Giegerich R: Fast and effective prediction of microRNA/target duplexes. RNA. 2004, 10: 1507-1517. 10.1261/rna.5248604.PubMed CentralView ArticlePubMedGoogle Scholar
- Huang dW, Sherman BT, Lempicki RA: Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc. 2009, 4: 44-57.View ArticleGoogle Scholar
- Human Genome Sequencing Consortium: Finishing the euchromatic sequence of the human genome. Nature. 2004, 431: 931-945. 10.1038/nature03001.View ArticleGoogle Scholar
- Wright MW, Bruford EA: Naming 'junk': human non-protein coding RNA (ncRNA) gene nomenclature. Hum Genomics. 2011, 5: 90-98.PubMed CentralView ArticlePubMedGoogle Scholar
- Taft RJ, Pang KC, Mercer TR, Dinger M, Mattick JS: Non-coding RNAs: regulators of disease. J Pathol. 2010, 220: 126-139. 10.1002/path.2638.View ArticlePubMedGoogle Scholar
- Liu J, Walter E, Stenger D, Thach D: Effects of globin mRNA reduction methods on gene expression profiles from whole blood. J Mol Diagn. 2006, 8: 551-558. 10.2353/jmoldx.2006.060021.PubMed CentralView ArticlePubMedGoogle Scholar
- Ruijter JM, Ramakers C, Hoogaars WMH, Karlen Y, Bakker O, van den Hoff MJB: Amplification efficiency: linking baseline and bias in the analysis of quantitative PCR data. Nucleic Acids Research. 2009, 37:Google Scholar
- Robinson MD, Smyth GK: Small-sample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics. 2008, 9: 321-332.View ArticlePubMedGoogle Scholar
- Robinson MD, McCarthy DJ, Smyth GK: edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010, 26: 139-140. 10.1093/bioinformatics/btp616.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.