As our ability to investigate molecular mechanisms in biology at finer resolutions improves, there is increasing interest in generating reliable gene expression profiles for smaller biological samples, down to the level of the single cell and potentially subcellular compartments. Single-cell gene expression profiling provides a powerful tool to analyze the composition of complex cell populations . There are many contexts in which the focus is shifting towards understanding the cellular networks of individual cells [2, 3] and the similarities and differences between individual cells at the transcriptional and translational level [4, 5].
Limitations to the sensitivity and resolution of current technologies for studying gene expression mean that when using samples as small as those generated from single cells we are inevitably faced with amplifying cellular mRNA. Although the most common method for evaluating large-scale gene expression is through microarray technology [6, 7], the problem will be the same for any experimental method that requires transcript amplification to produce useful quantities of material to be analyzed, including real-time PCR and serial analysis of gene expression (SAGE) . The amplification stage may, however, introduce significant distortions in the measured gene expression levels, especially for genes with small numbers of transcripts in the material under study. This distortion is introduced by sampling effects that arise from inefficiencies in the processes of copying and amplifying the original mRNA pool.
In a complex mRNA population with small absolute numbers of individual transcripts, such as that from a single eukaryotic cell, sampling effects can result in only a subset of the population of starting RNA molecules being represented in the final amplified population. This is particularly problematic for low copy number transcripts in single cell samples: in the first step of the process, reverse transcription may fail for a small proportion of the original mRNA molecules, which would therefore be eliminated from subsequent amplification and detection. For genes with only a small number of transcripts in the starting material, this will create a variable (assuming the failures are random) distortion in the relative representation of transcript abundances in the final experimental sample, potentially leading to the absence of such low abundance transcripts in the final amplified population. The first round of PCR amplification will have a similar effect, and subsequent rounds will have effects of diminishing importance, in terms of complete dropout of low-abundance transcripts.
The overall effect of random dropouts of low abundance transcripts from amplified single cell cDNA populations would be that random sets of transcripts would be called as absent in different cells. Observations consistent with such sampling effects in single cell expression analysis have been reported previously, leading to the proposal that there are limits to the reliable detection of gene expression from small samples. For example, one estimate is that there is a lower limit of 80 copies of a single mRNA per cell for detection of two-fold differences between samples. Despite these empirical predictions, the nature and significance of sampling effects for single cell expression profiling have not been systematically studied to date.
The magnitude of the overall sampling effect will, in theory, depend on two factors: the transcript abundance distribution, which is the variation of transcript number among genes being expressed in a cell (and in particular the relative numbers of genes with low transcript numbers); and the copying and amplification efficiencies for conversion of the original population of mRNA molecules into DNA or RNA detectable by the expression profiling platform in use. We have previously demonstrated that a global polyadenylation and PCR-based amplification technique generates reliable data from picogram amounts of RNA , although that study did not measure the efficiency of conversion of original mRNA transcripts into cDNA copies. The copying and amplification efficiencies can be estimated from experimental data. However, the estimation of the transcript abundance distribution poses two distinct problems: knowing the form of the distribution; and evaluating the shape and scale parameters for the distribution.
There are conflicting reports of the transcript abundance distribution in a typical eukaryotic cell, ranging from a distribution with a median value for mRNA transcript copies per gene of less then one , to a distribution with a median of approximately 100 copies . The difficulty is that, in general, the transcript abundance distributions of real single cells are not known but are inferred from population measurements (for estimates from cDNA library and SAGE library sequencing of whole tissues, see references [8, 13, 14]). Based on published data [9, 12], a simple approximation is that the transcript abundance distribution is log-log-normal, as this distribution captures certain key features of our current understanding of the single cell transcript abundance distribution: there is a high number of genes with transcript abundances lower then 10–20 and relatively few genes with high transcript abundances (exceeding 1000 copies per cell). For the purposes of modeling single cell expression data we use that distribution for this work, with the additional assumption that such a population-based distribution is reflected in the underlying single cell transcript abundance distributions.
The purpose of this work was to systematically evaluate the presence and significance of sampling effects in PCR-based global amplification-based single cell expression profiling. We investigated whether observed variations in gene expression levels in single cell samples could be artifacts of the experimental method, how much sampling effects contribute to variability in single cell expression measurements, and, finally, if global amplification techniques can be reliably used for the detection of differences in gene expression among single cells. We conclude that significant differences in gene expression levels exist between phenotypically identical cells in vivo, and that these differences exceed any noise contribution from global mRNA amplification.