Analysis of transcript and protein overlap in a human osteosarcoma cell line
© Klevebring et al; licensee BioMed Central Ltd. 2010
Received: 16 March 2010
Accepted: 2 December 2010
Published: 2 December 2010
An interesting field of research in genomics and proteomics is to compare the overlap between the transcriptome and the proteome. Recently, the tools to analyse gene and protein expression on a whole-genome scale have been improved, including the availability of the new generation sequencing instruments and high-throughput antibody-based methods to analyze the presence and localization of proteins. In this study, we used massive transcriptome sequencing (RNA-seq) to investigate the transcriptome of a human osteosarcoma cell line and compared the expression levels with in situ protein data obtained in-situ from antibody-based immunohistochemistry (IHC) and immunofluorescence microscopy (IF).
A large-scale analysis based on 2749 genes was performed, corresponding to approximately 13% of the protein coding genes in the human genome. We found the presence of both RNA and proteins to a large fraction of the analyzed genes with 60% of the analyzed human genes detected by all three methods. Only 34 genes (1.2%) were not detected on the transcriptional or protein level with any method. Our data suggest that the majority of the human genes are expressed at detectable transcript or protein levels in this cell line. Since the reliability of antibodies depends on possible cross-reactivity, we compared the RNA and protein data using antibodies with different reliability scores based on various criteria, including Western blot analysis. Gene products detected in all three platforms generally have good antibody validation scores, while those detected only by antibodies, but not by RNA sequencing, generally consist of more low-scoring antibodies.
This suggests that some antibodies are staining the cells in an unspecific manner, and that assessment of transcript presence by RNA-seq can provide guidance for validation of the corresponding antibodies.
Several studies have attempted to compare protein and transcript expression levels to investigate the central dogma of the cell, i.e. the relation between DNA, RNA and protein content in a cell [1–8]. Microarrays have been the prevalent platform to measure the abundance of transcripts in a sample, although other technologies such as SAGE have also been employed. The corresponding protein abundance estimates have frequently been obtained through mass spectrometry or protein arrays. The resulting correlation coefficients in these comparative analyses have varied significantly, from 0.3 to 0.9, comparing 10 s of genes up to 1000 s of genes. Sub-groups representing functionally different Gene Ontology groups could, however, display both higher and lower correlations depending on their role in the cellular machinery .
To improve an estimate of correlation between RNA and protein molecules a more unbiased approach combined with a digital gene expression profile is needed. Massive DNA sequencing technology offers a new possibility to achieve a comprehensive and quantitative view of all genes being transcribed in a sample [9–11]. Here, we compare global IHC and IF protein expression in a human osteosarcoma cell line, U-2 OS (from the Human Protein Atlas program, HPA), with massive DNA sequencing of the corresponding transcriptome (RNA-seq).
The aim of this study was to compare the transcriptome of human U-2 OS cells with presence of the corresponding proteome. The transcriptome was extensively surveyed close to saturation by performing massive SOLiD DNA sequencing [Additional file 1: Supplemental figure S1]. In total, approximately 15 million high quality 35-bp reads were obtained and mapped onto the human reference genome (hg18) and quantitative measures were computed on a per gene basis. Analysis of the transcription pattern demonstrated that the majority of all Ensembl genes (73.4%; 15536/21146 genes) were expressed in U-2 OS, i.e., a transcript being represented by at least one uniquely mapped read. The frequency distribution is presented in the additional information [Additional file 1: Supplemental figure S2].
Number of antibodies
Number of genes analyzed
Percentage of total number of genes analyzed
Number of genes present of common subset
Percentage present of common subset
Protein distribution and overlap with transcriptional data
Interestingly, 9.4% (222/2123+222) of the genes in the HPA subset that were detected on the transcript level are not detected on the protein level. For IF, this number is 24.4% (574/1771+574). It is not clear if this is caused by a subset of genes that are transcribed but not translated, or if this is due to a limited sensitivity in the protein measurements. The fact that this fraction is higher for IF than IHC indicates that unspecific antibody-protein interaction in IHC in combination with a group of transcripts that do not undergo translation is the major contributor to this affect, since the sensitivity of IF is generally higher than that of IHC (see below).
Comparison over three technology platforms
Since RNA-seq provides quantitative measures of gene expression levels, we investigated transcript levels for the genes in the subgroups defined in Figure 3A, where this was possible (Figure 3B). This showed that the groups 'C' and 'BC' (detected in RNA-seq only and detected in IF and RNA-seq positive, respectively) were expressed at significantly lower levels than all genes combined (Kolmogorov-Smirnov (KS) test, p = 3.8 × 10-4 and p = 4 × 10-4, respectively). This indicates that IF has a higher sensitivity to detect transcriptionally low expressed genes than IHC.
Overlap between RNA and protein based on RNA expression bins.
Percent of present genes
Antibodies with supportive Western blot
Fraction of genes based on transcript level
Gene Set Enrichment Analysis
DAVID [13, 14] is a tool that performs gene set enrichment analysis for several different categories (GO, KEGG, protein domains etc) and has the option to group similar categories into functional groups based on similarity. When the HPA subset (2749 genes) was analyzed for enrichment of gene categories against a background of all protein coding genes using DAVID some Gene Ontology categories emerged as over-represented. These include development, apoptosis, proteins related to direct protein sequencing, the cytoplasm and protein binding (data not shown).
DAVID enrichment for certain categories
1, 03 × 10-14
8,22 × 10-23
Glycosylation site:N-linked (GlcNAc...)
9,97 × 10-6
4,39 × 10-7
Western blot and IF validation score analysis
The Western blots performed within the HPA program are manually investigated and assigned a validation score based on the number of detected bands, approximate size of the bands etc [Additional file 1: Supplemental table S1] using a standardized protein lysate panel. We analyzed these scores for all groups defined in Figure 3A. We observe that antibodies raised against genes in the A group (positive only in IHC) or the AB group (IHC+, IF+, RNA-) generally contain more low-scoring antibodies (KS-test, p = 1.6 × 10-4 and KS-bootstrap test, p = 2 × 10-3) (Figure 4A). This suggests that some of these antibodies are staining the cells in an unspecific manner (false positives) and the RNA data can thus provide guidance for the validation of the corresponding antibodies. Western blot data was not available for our particular U-2 OS cell line analysed in this study, so a direct expression comparison between WB and other methods was not possible.
For IF images in the HPA program, a validation score is added after manual investigation. Comparative analysis of these scores could only be done for groups with staining according to IF, since a validation score of 7 is used to define absence. In the ABC group (present in all platforms), the fraction of antibodies receiving a supportive score [Additional file 1: Supplemental table S1] is about three times higher than that in the B group (present only in IF) and is confirmed significant (KS-bootstrap test, 2 × 10-3).
A quantitative comparison of the transcriptome and the proteome in a cell (or tissue) is instrumental in deciphering fundamental regulatory pathways and mechanisms. The relationship between these two sets of biomolecules also has wide implications for the identification of new biomarkers and classifiers in the treatment of disease. Previous efforts to compare the transcript and protein abundance, at single or multiple gene levels, have demonstrated a great variation in outcome, attributed to both biological and technical issues [1–8]. From a technical perspective, transcriptome monitoring has recently been greatly improved by RNA-seq, which provides a digital and comparably unbiased profile of all genes transcribed in a cell. We show that almost 75% of all genes are represented by one or more transcripts, which is in line with other recent studies . We also demonstrate that lowly expressed genes are less likely to be identified at the protein level.
The proteome has in most cases been represented by indirect methods such as arrays or by mass spectometry of a solubilized proteome. Here, we present a large-scale comparison of in-situ protein abundance in cells with the transcription levels provided by RNA-seq. The proteins are assessed by antibodies targeting the proteome, and the presence of a particular protein species is visualized by immunohistochemistry and immunofluorescence microscopy, from which an abundance is estimated. These are two semi-quantitative methods and we therefore performed the comparison at the level of presense/absense.
For 87% of all detected proteins, we were able to detect a corresponding transcript. For the remaining 13% of all detected proteins (where we found no transcript), it can indicate several different phenomena: 1) Some genes are very lowly expressed as transcripts, but efficiently translated into stable protein products or 2) Some antibodies are cross-reactive, yielding false positive protein detection. Both of there effected would yield result where proteins are detected but the transcript is not.
We observe approximately 2/3 overlap between the RNA and its corresponding protein species, irrespective of method. Even if one of the two methods for generation of protein expression data is considered false, only 3.2% - 5.2% of all investigated genes are not detected on the protein or transcript level. Overall, immunohistochemistry identifies slightly more proteins than immunofluorescence, and we suggest that this elevated detection is in part due to false IHC positives, as indicated by the fact that IF demonstrates higher sensitivity for the lowly expressed genes (Figure 3B). Interestingly, the transcriptome-negative but IF- and IHC-positive group (8%), contained an over-representaion of genes related to glycosylation and needs to be confirmed with alternative methods such as mass spectrometry and RT-PCR. The group of genes detected by RNA-seq and IF but not IHC is significantly enriched for secreted proteins. Of 69 genes in this GO category, 33 are present in this group (p = 4.4 × 10-7).
Our study demonstrates that RNA-seq is useful as a validation tool for the HPA program. In the case of previously uncharacterized proteins or conflicting data between protein array, Western blot, IHC and IF, we could use both the presence/absence call and the quantitative estimation of RNA molecules to decide on the quality of data. Indeed, presence of RNA does indicate that a corresponding protein can be expected in the analyzed sample. Furthermore, RNA-seq has been shown to efficiently identify splice variants , and this may also be used to discriminate between multiple antibodies directed towards different parts of the target protein.
We used the quantitative estimation of RNA molecules to show that the highest expressed genes are more likely represented by a corresponding protein. For the 50% highest expressed genes we detect the corresponding protein in more than 80% (IF) or 95% (IHC) of the cases. These numbers reveal a signficantly closer relationship between the presence of RNA and protein than what has been seen in previous studies. We believe that our data indicate that both RNA and proteins are expressed to a more significant extent than previously anticipated and that this means that cells are regulated at the level of protein abundance rather than on mere presence/absence. The next step in our understanding of the cell machinery will therefore require more sensitive and quantitative measures of proteins.
The osteosarcoma cell line U-2 OS (ATCC-LGC Promochem, Borås, Sweden) was cultivated in a 5% CO2 environment at 37°C in McCoy's 5A media, as suggested by the provider, with the addition of 10% Fetal Bovine Serum (FBS) and an antibiotic/antimycotic solution (both from Invitrogen).
RNA sample preparation and cDNA synthesis
Cells were harvested and RNA was extracted using the RNeasy extraction kit as instructed by the manufacturer (Qiagen) and quality-assessed using the RNA nano kit on a BioAnalyzer 2100 (Agilent). The BioAnalyzer can interpret the generated data and score it with a RNA Integrity Number (RIN) ranging from 1 (very degraded) to 10 (no degradation). 10 μg of high-quality (RIN >9.5) total RNA was used as input material for depletion of ribosomal fragments using RiboMinus (Invitrogen). 250 ng ribosome-depleted RNA was adjusted to 4.5 μl in nuclease-free water and fragmented in 95°C for 20 minutes, after which it was immediately transferred to ice. 1 μl of biotinylated tagged random hexamers (Biotin-TEG-CTTTCCTCTCTATGGGCAGTCGGTGATNNNNNN, 1 pmol/μl, Operon) was added and the mixture was denatured at 70°C for 10 minutes and on ice for 2 minutes. During incubation, a cDNA-synthesis master mix consisting of (per sample) 6 μl 5x First-strand buffer (Invitrogen), 3 μl 0.1 M DTT (Invitrogen), 7.5 μl dNTP mixture (2 mM/dNTP) and 6.5 μl nuclease-free water was assembled on ice. Of this master mix, 23 μl was added to the denatured RNA:hexamer mixture together with 2 μl SuperScript III (Invitrogen) on ice. The first strand cDNA synthesis reaction was incubated at 20°C for 10 minutes followed by 37°C for 10 minutes and 42°C for 45 minutes. First-strand cDNA was purified by addition of 70 μl nuclease-free water using a MinElute spin column following the manufacture's instructions (Qiagen). Elution was carried out twice, each in 10 μl of EB-buffer, with a 180° rotation of the column in the centrifuge between the elutions. A second strand synthesis master mix was assembled on ice. This consisted of (per sample) 79 μl nuclease-free water, 30 μl 5x second-strand buffer (Invitrogen), 15 μl dNTP mixture (2 mM/dNTP), 1 μl 10 U/μl E. coli DNA Ligase (Invitrogen), 4 μl 10 U/μl E. coli DNA polymerase I (Invitrogen) and 1 μl 2 U/μl RNase H (Invitrogen). Of this, 130 μl was added to the ≈18 μl eluate containing the RNA:cDNA hybrid. The reaction was incubated at 16°C for 2 hours, after which 1.5 μl 3 U/μl T4 DNA Polymerase (New England Biolabs) was added, and the reaction was incubated for 5 more minutes at 16°C.
Enrichment and SOLiD DNA sequencing
20 μl of Streptavidin Dynabeads M-270 (Invitrogen) were washed in 50 μl 1× BW-buffer (20 mM Tris, 2 mM EDTA, 1 M NaCl) and pelletized using a MPC-6 magnetic particle concentrator (Invitrogen). To the pellet, the 150-μl second-strand syntheis reaction and 150 2× BW buffer were added, mixed by gentle vortexing and incubated on gentle rotation using a RotaMix (Elmi) at room temperature (≈22°C) for 15 minutes. The bead-DNA complex was then washed three times in 100 μl sterile deionised water and pelletized. A PNK-master mix consisting of (per reaction) 15 μl sterile deionised water, 2 μl PNK buffer (Invitrogen), 2 μl 10 mM ATP and 1 μl 10 U/μl PNK was assembled on ice and 20 μl was added to the pelletized beads and mixed by pipetting. The reaction was incubated at room temperature for 15 minutes. To create a blunt-ended dsDNA adapter sequence suitable for ligation to the beads, a mixture consisting of 440 μl sterile deionised water, 5 μl 100 pmol/μl RDV primer (AACTGCCCCGGGTTCCTCATTCTCT, MWG-Biotech), 5 μl 100 pmol/μl aRDV primer (AGAGAATGAGGAACCCGGGGCAGTT, MWG-Biotech) and 50 μl PNK buffer (Invitrogen) was assembled and incubated at 95°C for 3 minutes and allowed to cool to room temperature on a lab bench for 30 minutes. A ligation master mix consisting of 4 μl 5× Ligase buffer (Invitrogen), 1 μl 1 pmol/μl RDV:aRDV duplex, 14 μl sterile deionised water and 1 μl 3 U/μl T4 DNA Ligase was assembled on ice, added to the pelletized beads and mixed by pipetting. The reaction was incubated for 16 hours on a RotaMix at room temperature. The ligation reaction was washed three times in sterile deionised water and the beads were resuspended in 20 μl of sterile deionised water.
An amplification master mix was assembled, consisting of 5 μl 5× HF buffer (Finnzymes), 10 μl sterile deionised water, 5 μl dNTP mix (2 mM/dNTP), 1 μl 10 pmol/μl RDV primer (AACTGCCCCGGGTTCCTCATTCTCT, MWG-Biotech), 1 μl 10 pmol/μl LAmpFDV (CCACTACGCCTCCGCTTTCCTCTCTATGGGCAGTCGGTGAT, MWG-Biotech) and 1 μl 2 U/μl Phusion polymerase (Finnzymes). 23 μl amplification master mix and 2 μl beads were mixed denatured at 95°C for 30 seconds and cycled as follows: 30 seconds at 95°C, 30 seconds at 55°C, 30 seconds at 72°C for 16 cycles. After a final extension at 72°C for 10 minutes, the PCR product was purified using a MinElute column, following the manufacturers instructions with the elution step as described earlier.
The PCR product was subjected to emPCR and SOLiD sequencing following the manufacturers instructions (Life Technologies/Applied Biosystems). Fifteen million 35-base pair reads passed quality filters including filtering against adaptors.
Mapping of DNA reads and defining the transcriptome
The reads were mapped to the human genome (hg18) following the manufacturers instructions. A read was considered to be unique if is mapped to a single location with N mismatches and nowhere with N+1 or N+2 mismatches (i.e. a clear zone of 2). In total, 6 442 847 reads aligned uniquely to EnsEMBL genes. Only unique reads with a maximum of three mismatches were used to calculate expression values. A gene was considered present if at least one read fell entirely inside an exon of the gene. Raw sequence data has been deposited to the NCBI Short Read Archive with accession number SRA023713.1.
Production of Antibodies
The antibodies used in this study have been generated within the Human Protein Atlas program http://www.proteinatlas.org. For each application using the generated antibodies, a standard set of categories has been established, and these are grouped into three main validation scores: (i) supportive, (ii) uncertain, or (iii) non-supportive [Additional file 1: Supplemental table S1] .
Western blot analysis
Prior to immunohistochemistry and immunofluorescence the HPA antibodies were analysed by Western blot as previously described . All membranes were incubated with the primary antibodies diluted 1:500. The secondary HRP-conjugated antibody (Swine Anti-Rabbit Immunoglobulin/HRP, DakoCytomation) was diluted 1:3000 and detection was carried out using a CCD-camera. Seven categories (1-87) for the Western blot validation have been proposed in which three (grade 1-3) are supportive, two (grade 4-5) are uncertain, and two (grade 6-7) are non-supportive.
Immunohistochemistry was performed on cell microarrays (CMA) where U-2 OS cells were represented, as previously described . In brief, cells were harvested, fixed in formalin and dispersed into agarose. After histoprocessing and paraffin embedding of the cell pellets resulting in donor blocks, duplicate 0.6 mm punches were sampled and put into one recipient CMA. 4 micron sections were subsequently cut and immunohistochemically stained using an Autostainer Plus instrument (Dako, Glostrup, Denmark). An automated slide-scanner system, Scanscope T2 (Aperio Technology, Vista, CA, USA) was used to image the stained sections, resulting in digital images representing separated cell spots from the CMA.
The images were analyzed by an automated image analysis software, TMAx (Beecher instruments, Sun Praire, WI, USA), which identifies cells, measures the immunostaining and counts the fraction of stained cells, all according to processing logic previously described . The measured staining intensity was further categorized into a IHC graded scale (negative, weak, moderate and strong). Here, the analyzed protein was considered "absent" when the staining intensity score was negative and "present" when the staining intensity score was weak, moderate or strong.
Immunofluorescent stainings were performed as previously described . Briefly, cells were seeded in 96-well glass bottom plates, fixed with paraformaldehyde and permeabilized with Triton X-100 before immunofluorescently stained. The entire procedure was automated using a pipetting robot. Besides the HPA antibody staining, organelle markers for microtubules, endoplasmic reticulum and nuclei were included. Image acquisition was performed manually using a LSM 510 Meta confocal laser-scanning microscope equipped with a 63x oil-immersion objective (Carl Zeiss GmbH, Jena, Germany). For each sample, two representative four-channel images were acquired. The laser power and detector gain were adjusted for each sample to obtain as good signal to noise ratios as possible and to use the entire dynamic range of the detector.
The images were visually inspected and annotated in terms of staining intensity and subcellular localization. A graded scale (negative, weak, moderate and strong) was used to categorize the staining intensity based on the used laser power and detector gain. For each antibody a validation score was set based on how well the observed subcellular localization agreed with information in the UniProt database. The IF validation scores consist of a nine-graded scale that can be merged into three main categories: supportive, uncertain or not supportive [Additional file 1: Supplemental table S1]. Here, the analyzed protein was considered "absent" when the validation score was 7 (no staining, see [Additional file 1: Supplemental table S1]) and "present" otherwise.
The authors with to thank Dr. Johan Lindberg and Dr. Magnus Bjursell for valuable insights during data analysis. We would also like to thank the Uppsala Genome Center for SOLiD sequencing. This work was supported by the Knut and Alice Wallenberg Foundation, the Swedish Research Council and the Swedish Cancer Society.
- Fu, et al: Comparison of protein and mRNA expression evolution in humans and chimpanzees. PLoS ONE. 2007, 2 (2): e216-10.1371/journal.pone.0000216.PubMed CentralPubMedView ArticleGoogle Scholar
- Greenbaum, et al: Analysis of mRNA expression and protein abundance data: an approach for the comparison of the enrichment of features in the cellular population of proteins and transcripts. Bioinformatics. 2002, 18 (4): 585-96. 10.1093/bioinformatics/18.4.585.PubMedView ArticleGoogle Scholar
- Greenbaum, et al: Comparing protein abundance and mRNA expression levels on a genomic scale. Genome Biol. 2003, 4 (9): 117-10.1186/gb-2003-4-9-117.PubMed CentralPubMedView ArticleGoogle Scholar
- Gry, et al: Correlations between RNA and protein expression profiles in 23 human cell lines. BMC Genomics. 2009, 10: 365-10.1186/1471-2164-10-365.PubMed CentralPubMedView ArticleGoogle Scholar
- Gygi, et al: Correlation between protein and mRNA abundance in yeast. Mol Cell Biol. 1999, 19 (3): 1720-30.PubMed CentralPubMedView ArticleGoogle Scholar
- Lu, et al: Absolute protein expression profiling estimates the relative contributions of transcriptional and translational regulation. Nat Biotechnol. 2007, 25 (1): 117-24. 10.1038/nbt1270.PubMedView ArticleGoogle Scholar
- Shankavaram, et al: Transcript and protein expression profiles of the NCI-60 cancer cell panel: an integromic microarray study. Mol Cancer Ther. 2007, 6 (3): 820-32. 10.1158/1535-7163.MCT-06-0650.PubMedView ArticleGoogle Scholar
- Trauger, et al: Correlating the transcriptome, proteome, and metabolome in the environmental adaptation of a hyperthermophile. J Proteome Res. 2008, 7 (3): 1027-35. 10.1021/pr700609j.PubMedView ArticleGoogle Scholar
- Nagalakshmi, et al: The Transcriptional Landscape of the Yeast Genome Defined by RNA Sequencing. Science. 2008, 320 (5881): 1344-1349. 10.1126/science.1158441.PubMed CentralPubMedView ArticleGoogle Scholar
- Mortazavi: Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Meth. 2008, 8-Google Scholar
- Cloonan: Stem cell transcriptome profiling via massive-scale mRNA sequencing. Nat Meth. 2008, 7-Google Scholar
- Berglund, et al: A genecentric Human Protein Atlas for expression profiles based on antibodies. Mol Cell Proteomics. 2008, 7 (10): 2019-27. 10.1074/mcp.R800013-MCP200.PubMedView ArticleGoogle Scholar
- Dennis, et al: DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biol. 2003, 4 (5): P3-10.1186/gb-2003-4-5-p3.PubMedView ArticleGoogle Scholar
- Huang, et al: Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc. 2009, 4 (1): 44-57. 10.1038/nprot.2008.211.View ArticleGoogle Scholar
- Wang, et al: Alternative isoform regulation in human tissue transcriptomes. Nature. 2008, 456 (7221): 470-6. 10.1038/nature07509.PubMed CentralPubMedView ArticleGoogle Scholar
- Nilsson, et al: Towards a human proteome atlas: high-throughput generation of mono-specific antibodies for tissue profiling. Proteomics. 2005, 5 (17): 4327-37. 10.1002/pmic.200500072.PubMedView ArticleGoogle Scholar
- Bjorling and Uhlen: Antibodypedia, a Portal for Sharing Antibody and Antigen Validation Data. Molecular & Cellular Proteomics. 2008, 7 (10): 2028-2037.View ArticleGoogle Scholar
- Andersson, et al: Analysis of protein expression in cell microarrays: a tool for antibody-based proteomics. J Histochem Cytochem. 2006, 54 (12): 1413-23. 10.1369/jhc.6A7001.2006.PubMed CentralPubMedView ArticleGoogle Scholar
- Strömberg: A high-throughput strategy for protein profiling in cell microarrays using automated image analysis. Proteomics. 2007, 2142-50.Google Scholar
- Barbe, et al: Toward a Confocal Subcellular Atlas of the Human Proteome. Molecular & Cellular Proteomics. 2007, 7 (3): 499-508.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.