Using Mixtures of Biological Samples as Genome-Scale Process Controls 1 2

Genome-scale “-omics” measurements are challenging to benchmark due to the enormous variety of unique biological molecules involved. Mixtures of previously-characterized samples can be used to benchmark repeatability and reproducibility using component proportions as truth for the measurement. We describe and evaluate experiments characterizing the performance of RNA-sequencing (RNA-Seq) measurements. The parameters of a model fit to a measured -omic profile can be evaluated to assess bias and variability of the genome-scale measurement of a mixture. A linear model describes the behavior of expression measures of mixtures and provides a context for performance benchmarking. Residuals from fitting the model to experimental data can be used as a metric for evaluating the effect an individual step in an experimental process has on the linear response function and precision of the underlying measurement while identifying signals affected by interference from other sources. Effective benchmarking requires well-defined mixtures, which for RNA-Seq requires knowledge of the messenger RNA (mRNA) content of the individual components. We demonstrate and evaluate an experimental method suitable for use in genome-scale process control and lay out a method utilizing spike-in controls to determine mRNA content. Genome-scale process controls can be derived from mixtures. These controls relate prior knowledge of individual components to a complex mixture, allowing assessment of measurement performance. The mRNA fraction accounts for differential enrichment of mRNA from varying total RNA samples. Spike-in controls can be utilized to measure this relationship between mRNA content and input total RNA. Analysis of mixtures can also be employed to determine the composition and proportions of an unknown sample, even when component-specific markers are not previously known, so long as pure components can be measured alongside the mixture.

also helpful if measured signal arises only from the single source, and not from interference from non-targets.Precision consists of repeatability and reproducibility, defined as the degree of closeness in multiple measurements made by a single user and the closeness between multiple labs, respectively.We show that mixtures can demonstrate that a measurement's response function is linear and of high specificity (free of interference) while measuring its variability and precision.Properly constructed mixture samples can be used to correct for systematic measurement errors, provide ongoing monitoring of performance, serve as a tool for interlaboratory comparison, and create a context for evaluating batch effects.
Two approaches to creating useful genome-scale standards include the creation of a limited number of external spike-in controls, such as the External RNA Control Consortium (ERCC) controls, which were created for microarrays and have been applied to next-gen sequencing [9][10][11].A second approach utilizes mixtures of previously characterized samples as well as prior knowledge of the ratio of the mixtures, and has also been applied to microarrays [12][13][14] but has not been utilized in other genome-scale measurements.Using standards in these measurements provides confidence in the ability of a test to detect both positive and negative results, including the limits of that detection.
Mixtures can serve as a test that applies to each of the tens of thousands of transcripts in a mixture's -omics profile.Linearity of the measurement response can be demonstrated using prior knowledge of the ratio of previously characterized mixtures, based on the fundamental understanding that a mixture is a linear combination of its components.Previous work with mixtures in microarrays [12][13][14] utilized an arbitrary 10-fold "selectivity" cutoff to evaluate the linear dynamic range of microarray measurements and understand the variability of these measurements.The arbitrary selectivity cutoff in previous work prevents the identification of interference, as any genes affected by interference would be filtered by the stringent selectivity cutoff.
Using known mixture compositions, predicted values can be calculated based on the assumption that the measurement response is linear.Deviation of the observed values from the modelpredicted value is an indication of bias in the measurement.Systematic biases could be introduced by sample preparation, signal processing, interference from other genes, or sampling peer-reviewed) is the author/funder.All rights reserved.No reuse allowed without permission.
The copyright holder for this preprint (which was not .http://dx.doi.org/10.1101/015107doi: bioRxiv preprint first posted online Feb. 11, 2015; variation.Signal arising from off-target molecules, such as a closely related transcript, can cause false positive results and result in a lowered specificity.Mixture samples can provide information about the measurement sensitivity, specificity, repeatability, reproducibility, dynamic range, and limit of detection.
Determining the relative contributions to gene expression of individual components within mixtures of biological states has received some attention in the clinical realm, where biopsies and other patient samples are often mixtures containing desired and undesired components.The process of resolving gene expression signals introduced by each individual component of a mixture [13][14][15][16][17][18][19][20][21][22][23] has been used to account for tumor heterogeneity and to separate whole blood samples into individual cell types.These procedures often separate mixture components based on a subset of "signature" genes that vary uniquely between components.These 'deconvolution' methods have been used [27][28][29][30] to develop high-resolution tumor expression signatures from imperfect biological samples [31,32] and differentiate between cell-type-frequency changes and per-cell gene expression changes [33,17].Many of these methods can determine mixture component types by using a linear model where mixture expression is treated as a combination of expression signatures.
One parameter notably absent from these methods is RNA content.Different cell types express different total amounts of RNA per cell, confounding estimates of cell type proportion made based on the quantification of total RNA [24].Others have introduced the concept of a biological scaling factor [25,26] to compensate for variation in the RNA content of cells, including the use of spike-in controls to calculate this factor.The filtering of mRNA out from total RNA adds a bias to the experiment due to the different abundance of mRNA between cell types.
We aimed to demonstrate the assessment of linear response, specificity, and accuracy of genomescale measurements using mixtures.In the process, we demonstrate that linear models can be used to separate these mixtures into the proper component signatures.We were mindful that while our mixtures were of total RNA, the measurement filters for mRNA, and that the relationship between these two values is an important factor when interpreting results.We anticipate that a mixture-based approach to measurement assurance is highly generalizable to many types of mixtures and can be extended to the wide variety of genome-scale measurements, including but not limited to proteomic, metabolomic, and transcriptomic systems.

Results:
To assess measurement parameters of genome-scale transcriptome data, we analyzed two RNAseq experiments measuring synthetic mixtures of commercially available human total RNA samples (Figure 1) [13,14,34].One experiment included a mixture of two reference total RNA samples, sequenced by 9 labs as a part of the Sequencing Quality Control Consortium (SEQC) [34][35].This study sequenced "Universal Human Reference RNA" (SEQC-A), "Human Brain Reference RNA" (SEQC-B) and two mixtures of the above (SEQC-C and SEQC-D) with mixture compositions C=3A+1B and D=1A+3B.These four samples were sequenced by 9 labs using either Illumina or Life Technologies sequencing instruments.
The second sample, called BLM, contains two mixtures (BLM-1 and BLM-2) composed of total RNA isolated from human brain (the same RNA as SEQC-B), liver, and muscle tissue.These two mixtures were made with component proportions of 1B:1L:2M and 1B:2L:1M.The total RNA of each individual tissue was also sequenced "neat" to provide an expression signature for each tissue.
These mixtures were designed to have a defined expression signal ratio between them.If the measurement response were linear and unbiased, the signal in the SEQC-C sample would be exactly 1/4 the signal of SEQC-B plus 3/4 the signal from SEQC-A due to the design of the mixture.However, these total RNA mixtures went through RNA-seq library preparation, which purposely filters ribosomal RNA out of the pool.The resulting sequence data reflects this filtration, which can be different between samples.A correction for this differential filtration and an upperquartile normalization [36] must be applied to accurately reflect the experimental process and allow the model to return the designed ratios of expression between mixtures (Supp.Figure 1).ERCC spike-in control RNAs were added to the components of these mixtures.ERCC controls [12] were designed as exogenous RNA sequences to function as control RNA.Two spike-in control pools were designed with ratiometric differences in the concentration of individual ERCC spike-ins.As expected based on the mixture designs, ERCCs spiked-in equally yielded equal expression signal, while signal from ERCCs spiked differentially into multiple subpools was at peer-reviewed) is the author/funder.All rights reserved.No reuse allowed without permission.
The copyright holder for this preprint (which was not .http://dx.doi.org/10.1101/015107doi: bioRxiv preprint first posted online Feb. 11, 2015; ratios corresponding to the designed fold changes.Poisson sampling at the lower expression levels results in increased dispersion about the expected ratio [49].

Linear model-based analysis of genome-scale gene expression
We posit that mixture expression is a linear combination of the component samples and the mixture proportions of each component.Equation 1 describes the relationship between signal in the mixtures and signal in the constituent samples.A mixture M -2 per dataset in this study -is composed of a number of named components "C" ("B","L", and "M" in the Brain/Liver/Muscle mixture or "A" and "B" in the SEQC dataset), with each component comprising a proportion of the mixture Φ C .i,M is the expression signal arising from a particular gene/transcript i in mixture M.
This study uses four mixtures of the same general form: These mixtures were made from total RNA, while the expression signal (sequencing reads) arises only from the mRNA.As the fraction of the total RNA mass that is mRNA varies between cells, the filtering of total RNA into mRNA introduces a bias.Supplemental Figure 1 shows the offset from the expected ratios of tissue-specific and ERCC RNA caused by this bias.We correct the specific equations for the mRNA fraction by multiplying each component by a factor ρ.This factor corresponds to the measured mRNA compared to the mass of total RNA in each mixture.ρ C is defined as the amount of measured RNA per unit total RNA in component C.
After adding this factor, the BLM1 mixture equation becomes i,BLM1 = i,B ×Φ B,1 ×ρ B + i,L ×Φ L,1 ×ρ L + i,M ×Φ M,1 ×ρ M There are a few approaches that have been described to measure ρ.One study directly measured the mRNA content between SEQC-A and SEQC-B samples [36] using qRT-PCR.Another described the use of trimmed mean of log expression ratios (TMM) [25] to measure mRNA content peer-reviewed) is the author/funder.All rights reserved.No reuse allowed without permission.
The copyright holder for this preprint (which was not .http://dx.doi.org/10.1101/015107doi: bioRxiv preprint first posted online Feb. 11, 2015; from RNA-seq data.TMM-derived factors have been shown to be an appropriate measure in cases where there is no global expression level change (such as the SEQC mixtures), but introduce bias if there are global expression changes (such as in the BLM mixtures) [26].Supp.Figure 2 demonstrates this.
The ρ factor can be determined using spiked-in RNA [26] as sample reads per microgram of total RNA divided by spike-in reads per microgram of spike-in RNA.This calculation emphasizes that the mRNA fraction is a correction for the differential enrichment between polyadenylated spike-in RNA and total RNA, which is only partly composed of mRNA.
The mRNA fraction ρ is a property of an individual RNA sample and is affected by any RNA manipulation -particularly ribosome elimination.For replicates within a single polyA-selected SEQC experimental run, the ρ of a mix varies slightly, likely due to fluctuations in efficiency of mRNA enrichment.(S.Table 1) It is also important to note that FPKM units should not be used to calculate mRNA fraction (Supp.Figure 3), as the FPKM derivation [6] includes a term which couples sample abundance to spike abundance.

Mixture analysis models recapitulate known mixture proportions
To demonstrate the accuracy of this analytical framework of mixture sequencing, the mixture proportions Φ BLM were recalculated for the BLM mixtures BLM-1 and BLM-2.The ρ values and the sequencing expression data were used to solve for the mixture proportions Φ BLM by linear regression to the mixture equation.Figure 2 shows the Φ BLM values at which residuals were minimized for the two mixtures for each replicate sample in each laboratory.Estimates of the three component proportions in the two mixtures are consistent with the designed 25:25:50 and 25:50:25 proportions in the two BLM mixtures.Supp. Figure 4 shows that the designed proportions of SEQC mixtures can also be calculated by this equation, returning the 75:25 and 25:75 proportions for mixes C and D, with some variability between labs.Equation 1, which lacks correction for mRNA fraction, does not return the designed ratios (Supp.Figure 5).

Linear model-predicted mixture counts are equivalent to replicate measures
peer-reviewed) is the author/funder.All rights reserved.No reuse allowed without permission.
The copyright holder for this preprint (which was not .http://dx.doi.org/10.1101/015107doi: bioRxiv preprint first posted online Feb. 11, 2015; In studies by the SEQC [34], differential expression between replicate samples was utilized to evaluate measurement performance based on the hypothesis that the control samples used in the study had no true differences between replicates.We created pseudo-replicate predicted count values from the 'neat' samples for use in benchmarking.These simulated mixtures were built based on the measured mixture expression and the true mixture proportions.

Discussion:
If the response function of a measurement is linear, mixtures of biological samples can be useful as genome-scale process controls for that measurement.When this condition is met, a mixture can be modeled simply as linear combination of its components.Two experimental datasets with known mixture parameters were used to verify these assertions.In the case of RNA-seq, the mRNA fraction of the total RNA mixture components must be accounted for in order to reflect the true values, when mixtures of RNA are calculated based on mass fractions of total RNA and the sequencing experiments measures only mRNA.
Mixtures with either known or unknown proportions can be analyzed.If mixture proportion information is known a priori, genome-scale data can be used as a process control to test the repeatability and sensitivity of measurements by comparing observed and expected measures.Alternatively, if the mixture proportions are an unknown and desired parameter, expression measures from the mixture in combination with the neat components can be used to experimentally determine the mixture proportions.This application can be valuable to un-mixing biological mixtures, including clinical mixtures, cell cultures, and xenografts [27][28][29][30][31][32].While the peer-reviewed) is the author/funder.All rights reserved.No reuse allowed without permission.
The copyright holder for this preprint (which was not .http://dx.doi.org/10.1101/015107doi: bioRxiv preprint first posted online Feb. 11, 2015; mRNA fraction correction is applicable only to RNA-sequencing measurements, the general mixture model is theoretically applicable to any measurement with a linear response function.
Mixtures can provide measurement process assurance to a sequencing experiment.Using mixture samples alongside pure samples, one can demonstrate the reproducibility and sensitivity of genome-scale RNA, protein, as well as metabolite measurements.The main goal of this type of mixture analysis is to create a known ratio value by which the measurement characteristics of an experiment can be assessed.While an experiment's measurement of this known ratio is not sufficient to prove the validity of the measurement, it is a necessary condition, and any deviations are indicative of bias.
While we demonstrate mixture analysis with two specific samples, the analysis is fully generalizable to any number or type of mixture components.Any mixture split into known individual components can be measured in this way.For example, a clinical researcher may have three samples of interest from healthy, chronically diseased and acutely diseased sources.A mixture of these three cell types would provide confidence in the measurements made on the three samples individually by verifying the repeatability of that measurement.It can also provide a benchmark sample to assess comparability over space and time.These mixtures can detect biases introduced by batch effects, operator effects, sample mislabeling, and technical artifacts while evaluating the variability of the measurement.Mixture samples with known proportions can help determine experimental reproducibility and discover technical artifacts introduced by the measurement process by comparison of the expected to observed proportions.
With this analytical model, end users and core facilities can use known mixtures as a process control to track changes in measurement quality whenever changes to the experimental process are made.By including a predefined mixture, cross-sample comparisons can be made to demonstrate the internal consistency of measurements made using any new experimental technique, kit, or downstream analysis tool.In this way, there is some assurance that changes in experimental protocol have not affected measurement reproducibility.Residuals from modeled counts can be used as a metric to evaluate the magnitude of effect an experimental process has on the linearity and precision of sequence measurements.peer-reviewed) is the author/funder.All rights reserved.No reuse allowed without permission.
The copyright holder for this preprint (which was not .http://dx.doi.org/10.1101/015107doi: bioRxiv preprint first posted online Feb. 11, 2015; In addition to gaining an understanding of the measurement process using the benchmarking workflow, unknown samples can be collected and studied to determine the relative proportion of known components.Proportions of components can be determined even in the absence of any type-specific markers, given measurable differences in expression between the cell types.
Resolving the composition of mixtures has proven useful in determining the purity of cell lines or proportions of heterogeneous cells, in identifying interesting cellular contaminants such as partially differentiated cells, and understanding clinical samples containing mixed cell types.In contrast to approaches using transgene expression [41], the mixture model described here can evaluate tissue sample purity without focusing on a handful of "tissue-specific", marker, or transgenes.We expect mixed-sample RNA to be useful in regulatory applications, where a demonstration that a therapeutic stem-cell mixture has a specific composition may be key to ensuring safety [48].

Spike-in controls correct for mRNA fraction-caused biases in linearity
In addition to providing limit of detection and cross-experiment comparison characterizations of a dataset, spike-in controls can be used in mixture samples to determine the mRNA fraction of cells.mRNA fraction is a critical parameter for comparing samples that do not have identical total RNA content.This is most relevant to cells with variable global expression [24], including comparisons across and within cell cycle, tissues, and developmental states [40].mRNA fraction is also critical in single cell gene expression studies, where lysis efficiency and total RNA content can vary greatly from cell to cell.
There are many methods used to determine component gene expression profiles from mixture samples.At present, only the one we describe here explicitly accounts for mRNA fraction.In RNA-seq experiments, mRNA fraction can be calculated with information obtained via spike-in controls.When comparing samples with variable mRNA content, bias arises when that variability is not accounted for.We describe a straightforward method for measuring the enrichment of mRNA in RNA-seq samples using spike-in RNA.We show that mRNA-corrected unmixing of two mixture datasets returns the known mixture proportions (Figure 2, Supp Figure 4), demonstrating suitability for solving unknown mixtures of known components.peer-reviewed) is the author/funder.All rights reserved.No reuse allowed without permission.
The copyright holder for this preprint (which was not .http://dx.doi.org/10.1101/015107doi: bioRxiv preprint first posted online Feb. 11, 2015; Previous methods used to determine the composition of RNA-seq mixtures make inaccurate estimates of mixture proportion in the BLM sample where the mRNA fractions vary substantially between mixture components.These methods are nearer to true values in the SEQC sample, where the mRNA fraction difference is less significant, but all estimates are improved by incorporating mRNA content (Supp.Figure 5).

Limitations
Technical artifacts identified by mixture modeling are differentially expressed between replicates and should not be confused with the conventional usage of differentially expressed genes, which are compared between samples.Truly differentially expressed genes, such as tissue-specific genes in BLM mixtures, fit well to the model.Technical artifacts would not be identified as differentially expressed by modern differential gene expression methods due to their extreme variance between replicates as a result of crosstalk or nonlinear response.This means the measurement is not sensitive to these genes, and they could be false negatives.If transcripts are identified as artifacts in a process, alternative preparations need to be employed to achieve an unbiased quantification of affected transcripts.
Although mean mixture proportion values returned from a linear combination of mixture components approximate the nominal mixture proportion in both measured samples, the increased variability of the muscle estimate in the BLM mixture (error bars, Figure 2) suggests that there is a lower limit to being able to determine low-abundance mixture components.Due to mRNA fraction, the muscle component of the BLM mix was as low as 10 percent of sequenced RNA in BLM-2.It may be possible to determine lower-proportion mixture components with confidence, but this study did not generate the required data to do so.
Our estimation of mRNA fraction is imperfect; an assumption of the model we build is that the mRNA fraction is constant between replicates of the same sample.Supplemental Table 1 shows that the mRNA fraction varies by as much as 5 percent from library to library.This variability is a source of error in our model.The variability in mRNA fraction is likely due to batch effects in the mRNA enrichment process.This hypothesis is reinforced by the prevalence of non-mRNA transcripts incorrectly called as differentially expressed between mixture replicates.peer-reviewed) is the author/funder.All rights reserved.No reuse allowed without permission.
The copyright holder for this preprint (which was not .http://dx.doi.org/10.1101/015107doi: bioRxiv preprint first posted online Feb. 11, 2015; The sequencing technology and library preparation methods used in these experiments added limitations to the experiments.These are described in supplemental note 1.

Conclusions
We demonstrate the linear response function and specificity of RNA-sequencing measurements using mixtures of biological samples.Such mixtures can be used as benchmarks to characterize the repeatability and reproducibility of experiments or separated to identify the relative proportion of their components.Spike-in controls can be used to calculate the mRNA content of total RNA mixtures, compensating for biases introduced by mRNA enrichment.Our method creates a framework for using mixtures in measurement process control and corrects for biases introduced by ribosomal depletion.Using an mRNA fraction correction improves the accuracy of mixture proportion determination in RNA-seq experiments.
Benchmarking genome-scale measurements using mixed samples will remain useful even after the era of short-read sequencing is over.Answering the biological question of "what types of cells are in the mixture I'm sequencing?" requires more information than even a perfect transcriptome reconstruction could provide.The biological and measurement value added by mixed samples are demonstrated here to be platform-independent.We anticipate that mixtures can provide the same measurement assurance to protein and metabolite measurements.Confidence in the reproducibility of measurement and understanding the components in complex biological samples will always be a staple of quality science.

Library Preparation:
For the BLM experiment, Human Brain Reference RNA, Human Liver Total RNA, and Human Skeletal Muscle Total RNA were purchased from Ambion.This purified RNA was quantified by absorbance on a NanoDrop 1000, mixed in the specified proportions, then spiked with ERCC RNA transcribed from NIST SRM 2374.For Illumina sequencing, the Illumina TruSeq protocol was followed.HiSeq runs generated 100+100bp paired-end reads.Solid 5500 sequencing followed the Life Technologies Whole Transcriptome protocol, yielding 75+35 bp paired-end reads.Spikein composition and amounts are included in the data submission to GEO. peer-reviewed) is the author/funder.All rights reserved.No reuse allowed without permission.

Figure 3
Figure 3 shows a dendogram of the distance between actual mixture expression and simulated expression counts of SEQC samples.The four base samples A, B, C and D are most different from one another, reflecting the biological differences between the samples.A and C are more closely related, as C consists of 75% A and 25% B. Modeled pseudo-replicate samples 'Cm' and 'Dm' across each of the six SEQC sites are no more different than cross-lab replicates of the C and D data, indicating that building the model for mixture C from components A and B does not introduce significant variability.This supports the treatment of modeled mixtures as replicate measurements expected to have no true differential expression from the mixture samples.

Figure 1 :
Figure Legends:Figure 1: RNA samples used in this study.RNA isolated from pure tissues is used to generate pairs of mixtures used in two separate experiments.(a): Two SEQC mixtures (SEQC-C and SEQC-D) are built from two components (SEQC-A and SEQC-B).(b): Two BLM mixtures (BLM-1 and BLM-2) are built from three components.The SEQC-B component (HBRR) is from the same source as the Brain BLM component.Per-sample target ratios of tissue proportion between mixtures are shown.

Figure 2 :
Figure 2: Accuracy of model-derived BLM mix estimates.The grey center point is the nominal 'truth' ratio in which the samples were mixed.Concentric circles with radius 0.025 and 0.05 are added to visually clarify distance from the center point.Colored points depict mixture proportion (Φ) estimates generated from measurements of 4 replicate libraries.Black points are the mean of the replicates.Error bars show one standard deviation of the four replicate measures

Figure 3 :Supplemental Figure 1 :
Figure 3: Clustering of Expression measures in 4 SEQC samples and 2 in-silico replicate samples across participating sites: The close agreement between modeled (Cm, Dm) counts and actual counts (A,B,C,D) at sites numbered 1-6 supports the validity of assumptions used to model Cm and Dm counts.Euclidian distance measures between samples show that the various samples are of greater distance from one another, while the in-silico modeled samples are most similar to the correct corresponding sample.
peer-reviewed) is the author/funder.All rights reserved.No reuse allowed without permission.