Skip to main content


Computational approach to discriminate human and mouse sequences in patient-derived tumour xenografts

Article metrics



Patient-Derived Tumour Xenografts (PDTXs) have emerged as the pre-clinical models that best represent clinical tumour diversity and intra-tumour heterogeneity. The molecular characterization of PDTXs using High-Throughput Sequencing (HTS) is essential; however, the presence of mouse stroma is challenging for HTS data analysis. Indeed, the high homology between the two genomes results in a proportion of mouse reads being mapped as human.


In this study we generated Whole Exome Sequencing (WES), Reduced Representation Bisulfite Sequencing (RRBS) and RNA sequencing (RNA-seq) data from samples with known mixtures of mouse and human DNA or RNA and from a cohort of human breast cancers and their derived PDTXs. We show that using an In silico Combined human-mouse Reference Genome (ICRG) for alignment discriminates between human and mouse reads with up to 99.9% accuracy and decreases the number of false positive somatic mutations caused by misalignment by >99.9%. We also derived a model to estimate the human DNA content in independent PDTX samples. For RNA-seq and RRBS data analysis, the use of the ICRG allows dissecting computationally the transcriptome and methylome of human tumour cells and mouse stroma. In a direct comparison with previously reported approaches, our method showed similar or higher accuracy while requiring significantly less computing time.


The computational pipeline we describe here is a valuable tool for the molecular analysis of PDTXs as well as any other mixture of DNA or RNA species.


Patient-Derived Tumour Xenografts (PDTXs) are emerging as the pre-clinical models that best represent the diversity of clinical tumours and intra-tumour heterogeneity [1,2,3]. PDTXs have been shown to be robust models to study tumour progression and evolution, test new cancer drugs and drug combinations, and unravel drug resistance mechanisms, contributing to the aim of reducing the high attrition rate in cancer drug development [4,5,6,7,8].

In the era of cancer genomics and precision medicine, the molecular analysis of PDTXs is a central component of their characterization. High-Throughput Sequencing (HTS) is used to profile these models at the genomic, epigenomic and transciptomic levels. We and others have observed that after implanting human cancer tissue fragments into immuno-compromised mice, the human stroma is rapidly lost and replaced by mouse stromal cells [2, 9, 10]. This results in an unknown proportion of mouse cells incorporated into the xenograft. As a consequence, a proportion of the sequencing reads obtained by HTS will be of mouse origin. Given the high homology of human and mouse genomes, mouse reads can be wrongly aligned to the human genome, hampering downstream analyses and data interpretation.

Previous studies have tried to address this issue. Conway et al. developed Xenome, a tool able to classify sequencing reads belonging to two different species; the output is a set of FastQ files that still need to be aligned to the appropriate genome [11]. More recently, Ahdesmäki and colleagues presented Disambiguate, that takes as input two bam files obtained by aligning the same FastQ file to the two relevant genomes and then classifies each read based on the alignment scores [12]. Currently, this approach can be used only in combination with a specific set of aligners.

Here we present a computational approach to distinguish human and mouse reads in HTS data based on the use of an In silico Combined human-mouse Reference Genome (ICRG) in the alignment step. We demonstrated the accuracy of the approach using control samples and a set of matched human breast cancers and derived PDTXs. In a direct comparison with Disambiguate and Xenome, our approach was quicker while showing similar or higher accuracy.


Optimizing sequence alignment using the ICRG

In PDTX samples, a proportion of HTS reads originated from mouse DNA could have high enough homology to be aligned to the HRG. We reasoned that using the ICRG as the reference should allow the alignment software to find, for those mouse DNA reads, a better alignment score on the mouse genome, since both genomes are available simultaneously to the aligner software. This could represent the basis to accurately distinguish human and mouse reads in HTS data originated from PDTX models.

To test this hypothesis, we performed WES in a dilution series containing known amounts of human and mouse DNA (Table 1). Sequencing data were aligned to both the HRG and the ICRG. In the 100% human DNA sample, a statistically significant decrease in alignment efficiency was observed when using the ICRG instead of the HRG (average efficiency = 90.48% for the HRG and 90.28% for the ICRG, paired t-test P = 0.009), however, the decrease in performance was negligible (0.2%). Aligning the WES data obtained by sequencing the 100% mouse DNA samples onto the HRG revealed that, on average, 6.9% of the reads were misaligned. As we show below, these misaligned reads are detrimental for downstream analyses. For example, a graphical representation of this misaligned mouse reads effect on a PTEN exon is shown in Fig. 1a. Mouse reads are wrongly aligned to the human genome if sequence similarity is high enough, but since identity is not 100%, all the mismatched bases could be called as false positive ‘somatic mutations’. In contrast, using the ICRG avoids this artefact altogether (Fig. 1a).

Table 1 Alignment of WES data from the human-mouse DNA dilution series
Fig. 1

Use of the ICRG in WES data. a IGV plot of PTEN exon 5 (WES data from 25% human/75% mouse DNA sample). Top panel: bam files after alignment to the HRG. Bottom panel: bam files after alignment to the ICRG. Mismatching bases are highlighted using the corresponding colour code (A = green, C = blue, G = orange, T = red). b Correlation plot between the percentage of reads mapped to the human genome and percentage of human DNA content in the sample. The solid line shows the calibration curve fitted to the data using penalized regression splines and grey dashed lines show the standard error. c Prediction of mouse DNA content in primary human and PDTX samples using the calibration curve in (b)

To study systematically the effectiveness of using the ICRG as a reference genome, we computed the percentage of reads mapping to the human and mouse genomes. As shown in Table 1, more than 99.9% of the reads from the pure human DNA sample and from the pure mouse DNA sample mapped to the correct genome. In diluted human-mouse DNA samples, the percentage of reads mapped to the correct genome did not match perfectly the percentage of input human and mouse DNA. Nevertheless, there was a non-linear relationship between the percentage of reads mapped to the human genome and the fraction of human DNA in the sample (Fig. 1b). We hypothesized this resulted from the enrichment step during WES library preparation, since the capture probes used have been designed for human exons. A careful look at the data revealed that using a generalised additive model was able to accurately estimate the human DNA content (Table 1). This model was used to estimate the mouse DNA content (as 100 - human DNA content) in a set of matched samples (normal/tumour/PDTX) for which WES data was available (Additional file 1). The model estimate of mouse DNA content for all primary human samples was negligible while the estimate ranged between 4.2 and 15.0% in PDTX samples (Fig. 1c).

Improvement of mutation calling in WES PDTX data using the ICRG

The analysis of PDTX WES data aims at the identification of somatic mutations. However, the presence of misaligned mouse reads is likely to increase the false positive mutation rate. Therefore, we quantified the problem and verified whether the use of the ICRG for sequence alignment could effectively overcome it.

For each pair of PDTX and its originating clinical tumour, we identified the somatic mutations and quantified how many were present only in the tumour, only in the PDTX or in both. The analysis was performed on WES data aligned to either the HRG or the ICRG. Results, reported in Fig. 2a, show that the number of tumour specific mutations identified across 10 PDTX-clinical tumour pairs was not significantly affected by the reference genome used for alignment (average = 31.9 for the HRG and 32.5 for the ICRG, paired t-test P = 0.140). Similarly, no significant difference was observed for the common mutations (average = 113.1 for the HRG and 109.9 for the ICRG, paired t-test P = 0.086). In contrast, the number of PDTX specific mutations was high when the HRG was used and decreased dramatically by using the ICRG in the alignment step (average = 306.1 using HRG and 68.8 using ICRG, paired t-test P = 0.004).

Fig. 2

Impact of mouse reads on somatic mutation calling. a Bar plots showing numbers of somatic mutations identified in clinical tumours and matched PDTXs after alignment against either the HRG or the ICRG. Within each pair of clinical tumour and PDTX (n = 10), mutations were classified as ‘tumour specific’ (i.e. present in the tumour but not in the matched PDTX), ‘PDTX specific’ (i.e. present in the PDTX but not in the originating clinical tumour) and common (present in both tumour and PDTX). b Bar plots showing VAFs for all ‘PDTX specific’ mutations identified in the 10 pairs in (a) in the 100% mouse sample. Left panel- data aligned to the HRG; Right panel- data aligned to the ICRG

To quantify the percentage of PDTX specific mutations caused by misalignment of mouse reads, we computed their VAF in one of the 100% mouse samples (replicate c in Table 1). As before, we repeated the analysis after alignment to the HRG or the ICRG (Fig. 2b). Of the 3123 PDTX specific mutations identified in the 10 PDTX-clinical tumour pairs aligned to the HRG, 2496 (79.9%) were present in the 100% mouse sample and are therefore false positives caused by the presence of mouse reads. Strikingly, only 712 PDTX specific mutations were identified in WES data aligned to the ICRG and only two of them were caused by mouse reads misalignment (Fig. 2b). In conclusion, the use of the ICRG was effective in removing >99.9% of false positive mutations caused by misaligned mouse reads.

The number of false positives caused by mouse reads is mostly a result of the mouse DNA content and sequencing depth. We noticed that even a small content of mouse DNA could generate a large number of false positive mutation calls. As an example, in Case_2_PDX1, sequenced with 69× coverage and with an estimated 7.2% mouse DNA content (Fig. 1c) a total of 405 false positive mutations specifically caused by misaligned mouse reads were identified when data were aligned to the HRG.

Using the ICRG in the analysis of RRBS data

We next tested the usefulness of the ICRG in RRBS data analysis. To this aim, we profiled a set of normal, tumour and PDTX samples (Additional file 1). As shown in Table 2, in clinical samples the alignment efficiency was not affected by the reference genome used (average efficiency = 72.2 for the HRG and 72.4 for the ICRG, paired t-test P = 0.319). In contrast, the overall alignment efficiency increased in PDTX samples when the ICRG was used (average efficiency = 67.8 for the HRG and 72.7 for the ICRG, paired t-test P = 0.010). Such increase can be explained by the fact that the aligner could map most of the mouse reads present in the samples when the ICRG was used. In clinical samples, >98.8% of the reads were correctly mapped to the human genome (at least 99.9% in all but one). In contrast, in PDTXs, 2.6–23.1% of the reads mapped to the mouse genome (Table 2).

Table 2 Alignment of RRBS data and CpG quantification

The number of human CpGs having coverage higher than 5 was slightly lower when using the ICRG, but the magnitude of this effect was negligible. Indeed, we observed an average 0.06% ± 0.01 reduction in the number of CpGs in normal and tumour samples and 0.10% ± 0.04 in PDTX samples. CpG coverage was very similar independently of the reference genome used for alignment (average correlation >0.999 for normal and tumour samples as well as for PDTX samples) and similar results were obtained looking at the percentage of methylation in each human CpG (average correlation >0.999 for normal and tumour samples as well as for PDTX samples).

All together, these results suggest that the use of the ICRG for sequence alignment accurately discriminates between human and mouse reads in RRBS data. Consequently, this approach can enable the analysis of mouse stroma specific methylation signals. Indeed, in three of the PDTX samples, more than 2 × 10^5 mouse CpGs could be queried, a reasonable number to derive an informative methylation profile. As noted above, the number of mouse CpGs available for analysis depends on both sequencing coverage and percentage of mouse stroma in the sample.

Using the ICRG allows dissecting expression of mouse stroma genes in PDTX-derived RNA-seq data

We tested the effect of using the ICRG in analysing PDTX RNA-seq data. First we evaluated the impact of the reference genome using the Human Reference RNA (HRR) and Mouse Reference RNA (MRR) samples for which RNA-seq data were obtained in triplicate (Table 3). In HRR samples the alignment efficiency was basically not affected by the reference genome used, although, statistically, a significant increase was observed (average efficiency = 79.05 for the HRG and 79.10 for the ICRG, paired t-test P = 0.019). In addition, using the ICRG, 99.8% of the reads from the HRR samples aligned to the human genome, and 98.9% of the reads from the MRR samples aligned to the mouse genome (Table 3). This reassured us that we could use the ICRG approach to distinguish between mouse and human transcripts in bulk RNA-seq data generated from PDTXs.

Table 3 Alignment of RNA-seq data from the human and mouse universal reference RNA

We observed that, in the MRR sample, an average of 5.2% of the reads mapped to the human genome if the HRG was used as reference for alignment (Table 3). To evaluate the impact of these reads on the quantification of human transcripts, we computed the read counts for all human genes in the HRR and MRR samples after alignment against either the HRG or the ICRG respectively. As shown in Fig. 3a-b, in the HRR samples, gene expression quantification was not affected by the reference genome used. Indeed, the Pearson correlation between read counts in data aligned to the HRG and read counts in data aligned to the ICRG was >0.999 for all replicates. In the MRR samples, the bulk of human genes already had read count 0 or close to 0 when using the HRG for alignment (Fig. 3c). However, an average of 1565 genes (across the three MRR replicates) had read counts higher than 100, but with the use of the ICRG for alignment the number dropped to 25 genes (Fig. 3c-d). Therefore, the presence of mouse reads could introduce some bias in the quantification of a subset of human genes and the use of the ICRG in the alignment step drastically reduce this artefact.

Fig. 3

Use of the ICRG in RNA-seq data. a-d Bar plots of log10 transformed read counts for 23,059 human genes (having a read count higher than 5 in at least one sample). HRR sample (a-b); MRR sample (c-d). Alignment: HRG (a, c); ICRG (b, d). e Percentage of reads mapped to mouse in the ICRG across samples analysed. f Principle Component Analysis scatter plot using FPKM values of 4275 mouse genes with median FPKM > 1 (15 PDTX samples; 5 models). Different colours represent the distinct 5 PDTX models

We generated RNA-seq data for a set of matched human breast cancer samples and PDTXs (Additional file 1). As expected, in primary human samples, the percentage of reads mapped to the mouse genome was as low as 0.1%. In contrast, in the matched PDTX samples, between 3 and 27.6% of the RNA-seq reads were aligned to the mouse genome (Fig. 3e). Consequently, the use of the ICRG for alignment enables the in silico dissection of the human and mouse transcriptomes, and hence the study of gene expression signals from the human tumour cells and the mouse microenvironment in PDTXs.

The number of mouse genes detected in the PDTX samples depends on the amount of mouse stroma in the sample as well as sequencing depth. In our cohort of PDTXs (n = 15) sequenced to an average depth of 21 million reads per sample, 4275 mouse genes had a FPKM >1 in at least 50% of the samples. Not surprisingly, fibroblast and extracellular matrix specific genes like Sparc, Bgn, cathepsins and collagens were among the top 50 most expressed mouse genes (Additional file 2: Table S2). Interestingly, in an unsupervised Principle Component Analysis using mouse gene expression values, different passages of the same PDTX model tended to cluster together, and apart from the other models [except for one outlier sample: C3_PDTX3] (Fig. 3f).

Comparison with other methods

As a final step, we wanted to compare our approach with previously described methods, in particular Disambiguate [12] and Xenome [11]. For this comparison, we selected the 100% human DNA samples and 100% mouse DNA samples WES data, as well as the HRR and MRR RNA-seq data.

WES data were aligned using BWA, compatible with all tested tools. We looked in particular at the total number of reads mapped to the correct genome and having MAPQ score > 0. As reported in Table 4, the ICRG and Disambiguate showed very comparable performances while the ICRG slightly outperformed Xenome, mapping 0.6% more reads to the human genome in the 100% human DNA samples and 2% more reads to the mouse genome in the 100% mouse DNA samples.

Table 4 Comparison of the number of reads assigned to the human and mouse genome using the ICRG, Disambiguate or Xenome

We also compared the CPU time required in an alignment pipeline that included either the ICRG, Disambiguate or Xenome. This was tested in the 90% human and 10% mouse DNA samples since they best approximate a real scenario. The three alignment pipelines are described in Additional file 3. The use of Xenome pipeline required 3–8% more CPU time than the ICRG and the use of Disambiguate required 35–42% more CPU time than the ICRG (Table 5).

Table 5 Comparison of the CPU time required by a WES alignment pipeline including either the ICRG, Disambiguate or Xenome

An alignment pipeline using STAR and either the ICRG, Disambiguate or Xenome was applied to the HRR and MRR RNA-seq samples. As before, we focused on the total number of reads that each method was able to align to the correct genome. Also in RNA-seq data the ICRG and Disambiguate showed equivalent performances while the ICRG mapped to the correct genome an average of 4% more reads than Xenome (Table 4).

Overall, the comparison with existing methods to discriminate reads from two different species highlighted that our approach achieved the same performance as Disambiguate but was significantly faster and outperformed Xenome in terms of accuracy.


The use of PDTXs as preclinical models is growing exponentially because they better resemble clinical tumours compared with cell lines. They are becoming the model of choice to study tumour progression and evolution, heterogeneity and pharmacogenomics [2, 13]. At the same time, sequencing based technology has become the standard for cancer molecular characterization at the genomic, transcriptomic and epigenomic levels [14,15,16]. As previously suggested [11, 17], we show that standard approaches for HTS data analysis based on the alignment of raw data to the human genome can significantly compromise results and data interpretation. The use of a combined reference genome has been informally suggested in the open source community, but we demonstrate here that the alignment to the ICRG is a simple and effective strategy to distinguish between human and mouse reads in PDTX samples, preventing the identification of hundreds of false positive mutations in WES data and enabling the study of transcriptomes and methylomes of both human cancer cells and mouse stroma.

For WES data, we developed a model able to predict the percentage of human/mouse DNA content in independent samples. We applied an earlier version of this model to a cohort of breast cancer PDTXs where the average mouse stroma content was 15% [2]. Such amount of mouse stroma is enough to generate hundreds of false positive mutations if the human reference genome is used for alignment. After alignment to the ICRG, some PDTX specific mutations (i.e. present in the PDTX but not in the matched clinical tumour) were still detected. Importantly, we excluded that these were caused by misaligned mouse reads. PDTX specific mutations have several explanations: spatial heterogeneity in the donor tumour, clonal selection/evolution upon engraftment [1, 2], coverage discrepancies between the human tumour sample and the PDTX, or false positive calls.

One of the solutions adopted in previous studies to limit the high false positive rate caused by misaligned mouse reads, was to obtain Whole Genome Sequencing data for the host mouse and mask all SNVs called after mapping the data against the human genome [1]. Although the method is valid, extra sequencing data need to be obtained and extra analyses need to be run. Moreover, the presence of masked regions (>2 × 106 SNVs) will increase the false negative rate.

The impact of mouse reads in RNA-seq data seems to be significantly smaller, however we still suggest aligning the data to the ICRG to avoid any bias. Moreover, this approach enables an in silico dissection of the tumour (human) and microenvironment (mouse) expression profiles. Obviously, the amount of genes that can be quantified in the mouse compartment depends on both the amount of stromal infiltration (biological variable that we probably want to capture) and the sequencing depth coverage (technical variable that we want to minimise). We therefore recommend that a higher and uniform number of reads is obtained in PDTX RNA-seq experiments. Using a sequencing depth of 21 million reads and with an average mouse read percentage of 8.5%, we found more than 4000 mouse genes with FPKM > 1 in at least 50% of the PDTX samples. Unsupervised analysis of these expression profiles grouped together different passages from the same model, suggesting that each PDTX model induces specific transcriptomic changes in the mouse microenvironment that can be explored using the ICRG approach. Our RNA-seq libraries were sequenced using the HiSeq 4000 Illumina instrument that has been reported to be affected by ‘index hopping’, consisting in around 1% of the reads being assigned to the wrong barcode (i.e. sample) [18]. Although this is unlikely to have a tangible impact in our experimental setting, some of the reads aligned to the wrong genome could be explained by this phenomenon.

It was reassuring to observe that ICRG alignment performed well with RRBS data. In this data type, the bisulphite treatment of samples will convert methylated cytosine bases to thymine and then for downstream analysis all cytosine bases are converted in silico to thymine for alignment purposes (three letter aligners) [19], reducing read complexity and, consequently, making multiple mapping or misalignment more likely. Although, similarly to RNA-seq data, the use of the ICRG for alignment is not strictly required, we would still recommend it since it enables the methylome profiling of the mouse stroma.

An important aspect of this work is that the experiments generated using controlled dilutions represent a relevant benchmark dataset for further investigations. All sequencing data generated in this study are available through the European Genome-Phenome Archive (EGA, under accession number EGAD00001003800.

Importantly, we compared our method with previously reported methods, namely Disambiguate [12] and Xenome [11]. Our method was able to recover a higher number of reads mapped to the correct genome than Xenome, while showing a comparable performance with Disambiguate. However, an alignment pipeline using the latter required around 40% more time to complete, a significant difference for what is the most time-consuming step in the analysis of HTS data. Moreover, the implementation of an ICRG-based pipeline is compatible with any alignment software and does not require any extra software to be installed and incorporated, but only the ‘one-off’ generation of aligner indices. To facilitate a smooth implementation of our method, all the relevant code is available at


In conclusion, we present here a straightforward strategy, based on the use of ICRG for read alignment, which is able to handle the presence of mouse reads in PDTX sequencing data. We demonstrate that this approach is efficient in removing mouse reads before performing somatic mutation calling and that it allows estimation of the human/mouse DNA content in the xenograft sample. In addition, the use of the ICRG enables human tumour and mouse stroma specific analysis of transcriptome and methylome profiles. In a direct comparison with previously reported methods we observed similar or higher performances in terms of accuracy and a significantly reduced computational time.


Sample description

We used a surgical tumour sample and mouse mammary fat pad as a source of pure human and mouse tissue. In this study, we also included 5 breast cancer cases and their matched PDTXs (Additional file 1) that are part of our previously reported biobank [2]. Signed consent was obtained from the patients whose tumour samples were used in this study and all research was conducted with the appropriate approval by the National Research Ethics Service [Cambridgeshire 2 REC reference number: 08/H0308/178]. Mice were bought from Charles River®. Animals were euthanised by cervical dislocation and death confirmed by a secondary method according to Schedule 1 of the Scientific Procedure Act (1986). Tumour tissue was removed in aseptic conditions and all animal experiments were conducted in compliance with the rigorous Home Office framework of regulations (Project License 707,679).

Pure human and mouse reference RNAs were purchased: Universal Human Reference RNA (HRR, Agilent Technologies Inc., USA, 740,000); and Universal Mouse Reference RNA (MRR, Agilent Technologies Inc., USA, 740,100).

Nucleic acid purification

DNA was extracted from all samples using the Qiagen Blood and Tissue kit (Cat ID, 69,504) as per manufacturer’s instructions. To generate a human-mouse DNA dilution series, human and mouse pure DNA concentration was normalised and then mixed in predefined proportions volumetrically.

RNA was extracted from all samples using the Qiagen miRNeasy kit (Cat ID, 217,004) as per manufacturer’s instructions.

Reference genomes for read alignment

Two reference genomes were used in our study. The first was the standard Human Reference Genome (hg19/GRCh37 decoy) hereafter called HRG. The second was the ICRG, generated by combining the aforementioned HRG with the mouse reference genome (mm10). Mouse chromosomes were renamed as “m.chr” and then the two fasta files (human and mouse) were concatenated. The concatenated fasta file was then indexed using the appropriate tool provided by each aligner.

Whole exome sequencing

WES libraries were prepared using Nextera Rapid Capture Exome (Illumina Inc., USA) following manufacturer’s instructions [2]. Sequencing was performed using 75 bp paired-end reads for the human/mouse dilution series and 125 bp paired-end reads for human and PDTX samples. Demultiplexing was performed using bcl2fastq2 v.2.17 software allowing 0 mismatches. Sequencing quality of raw fastq files was assessed using FastQC (v 0.11.5,

Raw data were processed according to the ITC approach described in [20]. Briefly, alignment was performed using BWA-MEM (v 0.7.12) and Novoalign (v 3.02) followed by mutation calling with Mutect2 and Strelka. Only the intersection of mutations called by the same caller after different alignment were retained. Then, mutations called by the two callers were merged to generate the final set of identified somatic mutations (SNVs and Indels). Alignment efficiency (i.e. the percentage of reads that aligned to the reference sequence) and statistics were derived from Novoalign-aligned bam files using Picard Tools (v 1.140) or custom scripts ( The same pipeline was applied using either the HRG or the ICRG in the alignment step. For each PDTX-clinical tumour pair, the variant allele frequencies (VAFs) of the mutations called in at least one sample were re-computed in both samples using GATK HaplotypeCaller (v 3.5). If the VAF was >1% in both samples the mutation was defined as common, otherwise it was defined as either tumour specific or PDTX specific.

In WES data from the human-mouse DNA dilution series aligned using the ICRG, we used custom bash code to compute the percentage of reads mapped to the human genome. Using the R package mgcv [21] these values and the known percentage of human DNA in the sample were used to derive a calibration curve applying penalized regression splines with a basis dimension of 3.


Libraries for Illumina sequencing were prepared using TruSeq Stranded mRNA HT kit (Cat ID, RS-122-2103, Illumina). 500 ng of total RNA with RNA Integrity Numbers (RINs) above 8 was used for library preparation. Samples were processed following manufacturer’s HS (High-Sample) instructions (part# 15031048 Rev. E, Illumina) with 12 cycles of PCR used at the Enrichment of DNA Fragments step. All libraries were quantified using KAPA Library Quantification Kit Illumina ROX Low (Cat ID, KK4873, KAPA Biosystems) and normalised. Libraries were pooled in equal volumes and pools were used for clustering on HiSeq4000 sequencing flow cell following manufacturer’s instructions. Sequencing was performed using 150 bp paired-end run type for dual-indexed libraries.

Demultiplexing was performed using bcl2fastq2 v.2.17 software allowing 0 mismatches. Sequencing quality of raw fastq files was assessed using FastQC (v 0.11.5, and alignment to HRG or ICRG was performed using STAR v2.5.2 in two-pass mode for splice-aware read alignment [22]. The resulting BAM file was then assessed using RNASeQC (v1.1.8) [23].

Counting of reads aligned over exonic features for the purpose of gene expression quantification was performed using the htseq-count script in the HTSeq package (v 0.6.1) in ‘Union’ overlap resolution mode [24]. The Gene Transfer Format (GTF) file used for the purposes of counting was a merged Homo sapiens and Mus musculus GTF file, both obtained from Ensembl (, and modified to ensure chromosomal compatibility with the ICRG. The resulting counts for all samples were then collated and FPKM calculations per gene per sample were performed using the rpkm() function in the edgeR R package [25].


DNA was quantified using Quant-iT High Sensitivity dsDNA Assay (Thermo Fisher, USA) and 200 ng was used as input for RRBS library preparation. DNA was subjected to an optimised protocol [26] and pooled prior to bisulphite conversion using Zymo Research EZ DNA Methylation gold kit (Cat ID, D5006). Pooled bisulphite converted samples were amplified with 15 cycles of PCR and purified twice with SPRI beads (Agencourt AMPure XP, Beckman Coulter, Cat ID A63880) for size selection, using 2X then 1.5X volume of the elute. Libraries were assessed for concentration and quality respectively using qPCR (KAPA Biosystems, KK4873) and DNA High sensitivity chip on Bioanalyser 2100 (Agilent Technologies Inc., USA). RRBS sequencing was performed using 125 bp paired-end reads. Demultiplexing was performed using bcl2fastq2 v.2.17 software allowing 0 mismatches. Sequencing quality of raw fastq files was assessed using FastQC (v 0.11.5, Bismark (version 0.13.1) was used for read alignment and to derive alignment stats. Only CpGs with at least 5× coverage were selected for subsequent analysis. The pipeline was run twice, using HRG and ICRG, respectively.

Comparison with other methods

In the comparison analysis, alignment was performed using BWA-MEM (v 0.7.12) for WES data and STAR v2.5.2 for RNA-seq data. Our approach was compared to Disambiguate [12] (C++ version downloaded from and Xenome [11] included in the Gossamer bioinformatics suite ( The number of reads mapping to each genome was computed using samtools and the grep function as detailed in the code available at For the 90% human and 10% mouse DNA samples, the CPU time was extracted from the pipeline log file and a pipeline description for each method is reported in Additional file 2.

Other analyses

Bam files were visualized using IGV ( Processed data mining and graphical representation of the results were performed using R/Bioconductor (v 3.2.2).



Human Reference Genome


Human Reference RNA


High Throughput Sequencing


In silico Combined human-mouse Reference Genome


Mouse Reference RNA


Patient-Derived Tumour Xenograft


Reduced Representation Bisulfite Sequencing


Single Nucleotide Variant


Variant Allele Frequency


Whole Exome Sequencing


  1. 1.

    Eirew P, Steif A, Khattra J, Ha G, Yap D, Farahani H, et al. Dynamics of genomic clones in breast cancer patient xenografts at single-cell resolution. Nature. 2015;518:422–6.

  2. 2.

    Bruna A, Rueda OM, Greenwood W, Batra AS, Callari M, Batra RN, et al. A biobank of breast cancer explants with preserved intra-tumor heterogeneity to screen anticancer compounds. Cell. 2016;

  3. 3.

    Gao H, Korn JM, Ferretti S, Monahan JE, Wang Y, Singh M, et al. High-throughput screening using patient-derived tumor xenografts to predict clinical trial drug response. Nat Med. 2015;21:1318–25.

  4. 4.

    Hidalgo M, Amant F, Biankin AV, Budinska E, Byrne AT, Caldas C, et al. Patient-derived Xenograft models: an emerging platform for translational cancer research. Cancer Discov. 2014;4:998–1013.

  5. 5.

    Cassidy JW, Batra AS, Greenwood W, Bruna A. Patient-derived tumour xenografts for breast cancer drug discovery. Endocr Relat Cancer. 2016;23:T259–70.

  6. 6.

    Ocana A, Pandiella A, Siu LL, Tannock IF. Preclinical development of molecular-targeted agents for cancer. Nat Rev Clin Oncol. 2011;8:200–9.

  7. 7.

    Tentler JJ, Tan AC, Weekes CD, Jimeno A, Leong S, Pitts TM, et al. Patient-derived tumour xenografts as models for oncology drug development. Nat Rev Clin Oncol. 2012;9:338–50.

  8. 8.

    Byrne AT, Alferez DG, Amant F, Annibali D, Arribas J, Biankin AV, et al. Interrogating open issues in cancer precision medicine with patient-derived xenografts. Nat Rev Cancer. 2017;17:254–68.

  9. 9.

    DeRose YS, Wang G, Lin Y-C, Bernard PS, Buys SS, Ebbert MTW, et al. Tumor grafts derived from women with breast cancer authentically reflect tumor pathology, growth, metastasis and disease outcomes. Nat Med. 2011;17:1514–20.

  10. 10.

    Reyal F, Guyader C, Decraene C, Lucchesi C, Auger N, Assayag F, et al. Molecular profiling of patient-derived breast cancer xenografts. Breast Cancer Res. 2012;14:R11.

  11. 11.

    Conway T, Wazny J, Bromage A, Tymms M, Sooraj D, Williams ED, et al. Xenome--a tool for classifying reads from xenograft samples. Bioinformatics. 2012;28:i172–8.

  12. 12.

    Ahdesmäki MJ, Gray SR, Johnson JH, Lai Z. Disambiguate: an open-source application for disambiguating two species in next generation sequencing data from grafted samples. F1000Res. 2016;5:2741.

  13. 13.

    Cassidy JW, Caldas C, Bruna A. Maintaining tumor heterogeneity in patient-derived tumor Xenografts. Cancer Res. 2015;75:2963–8.

  14. 14.

    Lawrence MS, Stojanov P, Mermel CH, Robinson JT, Garraway LA, Golub TR, et al. Discovery and saturation analysis of cancer genes across 21 tumour types. Nature. 2014;505:495–501.

  15. 15.

    Zack TI, Schumacher SE, Carter SL, Cherniack AD, Saksena G, Tabak B, et al. Pan-Cancer patterns of somatic copy number alteration. Nat Genet. 2013;45:1134–40.

  16. 16.

    E. Ellsworth R, J. Decewicz D, D. Shriver C, L. Ellsworth D. Breast Cancer In the personal genomics era. Curr Genomics 2010;11:146–161. doi:

  17. 17.

    Khandelwal G, Girotti MR, Smowton C, Taylor S, Wirth C, Dynowski M, et al. Genomics next-generation sequencing analysis and algorithms for PDX and CDX models. Mol Cancer Res. 1

  18. 18.

    Illumina. Effects of index Misassignment on multiplexing and downstream analysis. Accessed 16 Oct 2017.

  19. 19.

    Krueger F, Andrews SR. Bismark: a flexible aligner and methylation caller for Bisulfite-Seq applications. Bioinformatics. 2011;27:1571–2.

  20. 20.

    Callari M, Sammut S-J, De Mattos-Arruda L, Bruna A, Rueda OM, Chin S-F, et al. Intersect-then-combine approach: improving the performance of somatic variant calling in whole exome sequencing data using multiple aligners and callers. Genome Med. 2017;9:35.

  21. 21.

    Wood SN. Generalized additive models: an introduction with Boca Raton: R: Chapman & Hall/CRC; 2006.

  22. 22.

    Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013;29:15–21.

  23. 23.

    DeLuca DS, Levin JZ, Sivachenko A, Fennell T, Nazaire M-D, Williams C, et al. RNA-SeQC: RNA-seq metrics for quality control and process optimization. Bioinformatics. 2012;28:1530–2.

  24. 24.

    Anders S, Pyl PT, Huber W. HTSeq--a python framework to work with high-throughput sequencing data. Bioinformatics. 2015;31:166–9.

  25. 25.

    Robinson MD, McCarthy DJ, Smyth GK. edgeR: a bioconductor package for differential expression analysis of digital gene expression data. Bioinforma. 2010;26:139–40.

  26. 26.

    Tufegdzic Vidakovic A, Rueda OM, Vervoort SJ, Sati Batra A, Goldgraben MA, Uribe-Lewis S, et al. Context-specific effects of TGF-β/SMAD3 in cancer are modulated by the Epigenome. Cell Rep. 2015;13:2480–90.

Download references


We are grateful to Cancer Research UK, the University of Cambridge and Hutchison Whampoa Limited for their support. The Human Research Tissue Bank is supported by the NIHR Cambridge Biomedical Research Centre. We thank the Cancer Research UK Cambridge Institute Core Facilities that supported aspects of this work: Genomics, Biological Resources, and Bio-repository.


This research was supported with funding from Cancer Research UK and from the European Union to the EUROCAN Network of Excellence (FP7; grant number 260791). M.C. has received funding from the European Union’s Horizon 2020 research and innovation program under the Marie Sklodowska-Curie grant agreement no. 660060 and was supported by the Department of Experimental Oncology and Molecular Medicine, Fondazione IRCCS Istituto Nazionale dei Tumori, Milan, Italy.

Availability of data and materials

All sequencing data generated in this study are available through the European Genome-Phenome Archive (EGA, under accession number EGAD00001003800. Relevant code is available at

Author information

MC, CH, ABR, OR, CC conceived the study; ABA performed nucleic acid extraction and generated the dilution series; ABA and SFC generated RRBS libraries; WG, ABA, ABR collected the clinical samples, derived and maintained the PDTX samples; MC developed the computational approach and performed the analyses; RB, SJS, HC, OR contributed to data analysis; MC drafted the manuscript, all authors revised and approved the final manuscript

Correspondence to Carlos Caldas.

Ethics declarations

Ethics approval and consent to participate

Signed consent was obtained from the patients whose tumour samples were used in this study and all research was conducted with the appropriate approval by the National Research Ethics Service [Cambridgeshire 2 REC reference number: 08/H0308/178] and animal experiments conducted in compliance with UK Home Office guidelines.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional files

Additional file 1:

Molecular data generated - Table detailing which molecular data (WES, RNA-seq or RRBS) were generated for each sample included in the study. (XLSX 39 kb)

Additional file 2:

WES alignment pipelines including either the ICRG, Disambiguate or Xenome - Schematic representation of the WES alignment pipelines developed to compare the ICRG method with Disambiguate and Xenome. (PPTX 41 kb)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Callari, M., Batra, A.S., Batra, R.N. et al. Computational approach to discriminate human and mouse sequences in patient-derived tumour xenografts. BMC Genomics 19, 19 (2018) doi:10.1186/s12864-017-4414-y

Download citation


  • In silico combined human-mouse reference genome
  • High throughput sequencing
  • Short-reads
  • Alignment
  • ICRG
  • Mouse stroma
  • Patient-derived tumour xenografts