Study design
Patients under medical evaluation for ILD that were 18 years of age or older and were undergoing a planned, clinically indicated lung biopsy procedure to obtain a histopathology diagnosis were eligible for enrollment in a multi-center sample collection study (BRonchial sAmple collection for a noVel gEnomic test; BRAVE) [7]. Patients for whom a bronchoscopy procedure was not indicated, not recommended or difficult were not eligible for participation in the BRAVE study. Patients were grouped based on the type of biopsy being performed for pathology: BRAVE-1 patients underwent surgical lung biopsy (SLB), BRAVE-2 patients underwent TBB, and BRAVE-3 patients underwent transbronchial cryobiopsy. The study was approved by institutional review boards, either centrally (Western IRB) or at each institution, and all patients provided informed consent, prior to their participation.
During study accrual, 201 BRAVE patients were prospectively divided into a group of 113 considered for use in training (enrolled December 2012 to July 2015) and 88 used in validation (enrolled August 2014 through May 2016). The training group ultimately yielded 90 patients with usable RNA sequence data and reference labels derived from pathology (described below) that were used to train and cross-validate the models. The independent test group yielded 49 patients that met prospectively defined inclusion criteria related to sample handling, sample adequacy, and the determination of reference truth labels. All clinical information related to the test set, including reference labels and associated pathology, were blinded to the algorithm development team until after the classifier parameters were finalized and locked. The test set was prospectively scored by an independent third party who was not involved in algorithm development, sequencing data generation or reference label generation.
Pathology reviews and label assignment
Histopathology diagnoses were determined centrally by a consensus of three expert pathologists using biopsies and slides collected specifically for pathology, following processes described previously [7, 12]. The central pathology diagnoses were determined separately for each lung lobe sampled for pathology. A reference label was then determined for each patient from the aggregated lobe-level diagnoses according to the following rules: If any lobe was diagnosed as any UIP subtype, e.g. Classic UIP (all features of UIP present), Difficult UIP (not all features of classic UIP well represented), Favor UIP (fibrosing interstitial process with UIP leading the differential), or combinations of these, then ‘UIP’ was assigned as the reference label for that patient. If any lung lobe was diagnosed with a ‘non-UIP’ pathology condition [7] and the other lobe was non-diagnostic or diagnosed with unclassifiable fibrosis (e.g. chronic interstitial fibrosis, not otherwise classified (CIF, NOC)), then ‘non-UIP’ was assigned as the patient level reference label. When all lobes were diagnosed with unclassifiable fibrosis or were non-diagnostic, then no reference label could be assigned and the patient was excluded. This patient-level reference label process was identical between training and test sets.
Molecular testing, sequencing pipeline, and data QC
Up to five TBB samples were sampled from each patient and typically, two upper lobe and three lower lobe samples were collected. TBB samples for molecular testing were placed into a nucleic acid preservative (RNAprotect, QIAGEN, Valencia, CA) and stored at 4 °C for up to 18 days, prior to and during shipment to the development laboratory, followed by frozen storage. Total RNA was extracted (AllPrep DNA/RNA Micro Kit, QIAGEN), quantitated (QuantiFluor RNA System, Promega, Madison, WI), pooled by patient where appropriate, and 15 ng input into the TruSeq RNA Access Library Prep procedure (Illumina, San Diego, CA), which enriches for the coding transcriptome using multiple rounds of amplification and hybridization to probes specific to exonic sequences. Libraries which met in-process yield criteria were sequenced on NextSeq 500 instruments (2 × 75 bp paired-end reads) using the High Output kit (Illumina, San Diego, CA). Raw sequencing (FASTQ) files were aligned to the Human Reference assembly 37 (Genome Reference Consortium) using the STAR RNA-seq aligner software [13]. Raw read counts for 63,677 Ensembl-annotated gene-level features were summarized using HTSeq [14]. Data quality metrics were generated using RNA-SeQC [15]. Library sequence data which met minimum criteria for total reads, mapped unique reads, mean per-base coverage, base duplication rate, the percentage of bases aligned to coding regions, the base mismatch rate, and uniformity of coverage within genes were accepted for use in downstream analysis.
Normalization
Sequence data were filtered to exclude any features that were not targeted for enrichment by the library assay, resulting in 26,268 genes. For the training set, expression count data for 26,268 Ensembl genes were normalized by sizefactor estimated with the median-of-ratio method and transformed by parametric variance-stabilizing transformation (VST), asymptotically equal to the logarithm to base 2 (DESeq2 package [16]). To mimic the processing of future clinical patients, the vector of geometric means and the parameters of VST from the training set were frozen and separately reapplied to the independent test set for the normalization.
For algorithm training and development, RNA sequence data was generated separately for each of 354 individual TBB samples from 90 patients. Eight additional TBB samples (‘sentinels’) were replicated in each of eight processing runs, from total RNA through to sequence data, to monitor potential batch effects. For independent validation, total RNA extracted from a minimum of three and a maximum of five TBBs per patient were mixed by equal mass within each patient prior to library preparation and sequencing. Each patient in the training set thus contributes up to five sequence samples to training, whereas each patient in the test set is represented by a single sequenced sample, analogous to the planned testing of clinical samples.
Differential expression analysis
We first explored whether differentially expressed genes found using a standard pipeline [17] could be applied directly to classify UIP from non-UIP samples. Differentially expressed genes were identified using DESeq2, a Bioconductor R package [16]. Raw gene-level expression counts of 26,268 genes in the training set were used to perform the differential analysis. A cutoff of p-value < 0.05 after multiple-testing adjustment and fold change > 2 was used to select differentially expressed genes. Within the training set, we performed pairwise differential analyses between all non-UIP and UIP samples, and between UIP samples and each non-UIP pathology subtype with more than 10 samples available, including bronchiolitis (N = 10), hypersensitivity pneumonitis (HP) (N = 13), nonspecific interstitial pneumonia (NSIP) (N = 12), organizing pneumonia (OP) (N = 23), respiratory bronchiolitis (RB) (N = 16), and sarcoidosis (N = 11). Principal component analysis (PCA) plots of all the training samples were generated using differentially expressed genes identified above.
Gene expression correlation heatmap
The correlation r2 values of samples for six representative patients were computed using their VST gene expression, and a heatmap of the correlation matrix with patient order preserved was plotted to visualize intra- and inter-patient heterogeneity in gene expression. The 6 patients were selected to represent the full spectrum of within patient heterogeneity including two non-UIP and two UIP patients with the same or similar pathology subtypes between upper and lower lobes, as well as one UIP and one non-UIP patient each having different pathology subtypes, one being UIP and the other being non-UIP, in upper versus lower lobes. The heatmap was generated using the gplots R package.
Classifier development
We summarize the development and evaluation of a classifier in Fig. 1b. Our specific goal is to build a robust binary classifier on TBB samples to provide accurate and reproducible UIP/non-UIP predictions. We designed a high specificity test (specificity > 85%) to ensure a high positive predictive value. Thus, when the test predicts UIP, that result is associated with high confidence.
Feature filtering for classifier development
First, features that are not biologically meaningful or less informative were removed due to low expression level without variation among samples. We filtered genes annotated in Ensembl as pseudogenes, ribosomal RNAs, individual exons in T-cell receptor or Immunoglobulin genes, and excluded low expressed genes with raw counts expression level < 5 for the entire training set or expressed with count > 0 for less than 5% of samples in the training set.
We also excluded genes with highly variable expression in identical samples processed in multiple batches, as this would suggest sensitivity to technical, rather than biological factors. To identify such genes, we fitted a linear mixed effect model on the sentinel TBB samples that were processed across multiple assay plates. We fit this model for each gene separately
$$ {g}_{\boldsymbol{ij}}=\mu +\beta {sample}_{ij}+{batch}_i+{e}_{ij} $$
(1)
where g
ij
is the gene expression of sample j and batch i, μ is the average gene expression for the entire set, sample
ij
is a fixed effect of biologically different samples, and batch
i
is the batch-specific random effect. The total variation was used to identify highly variable genes; the top 5% of genes by this measure were excluded (See Additional file 1: Figure S1). Thus, 17,601 Ensembl genes remained as candidates for the downstream analysis.
In silico mixing within patient
The classifiers were trained and optimized on individual TBB samples to maximize sampling diversity and the information content available during the feature selection and weighting process. However, for patients in the test set and in commercial setting, ideally multiple TBB samples would be pooled at the post-extraction stage, as RNA, and processed as pooled RNA in a single reaction through library prep, sequencing and classification [7]. To evaluate whether a classifier developed on individual samples could maintain high performance on pooled samples, we developed a method to simulate pooled samples “in silico” from individual sample data. First, raw read counts were normalized by sizefactor computed using geometric means across genes within the entire training set. The normalized count C
ij
for sample i = 1, …, n and gene j = 1, …, m is computed by
$$ {C}_{ij}={K}_{ij}/{S}_j $$
where \( {s}_j=\underset{i}{\mathrm{median}}\frac{K_{ij}}{{\left(\prod \limits_{v=1}^m{K}_{iv}\right)}^{1/m}} \) and K
ij
is the raw count for sample i and gene j. Then, for each training patient p = 1, …, P, in silico mixed count \( {K}_{ij}^p \) is defined by
$$ {K}_{ij}^p=\frac{1}{n_p}\sum \limits_{i\in I(p)}{C}_{ij} $$
where I(p) is the index set of individual sample i that belongs to patient p. The frozen variance stabilizing transformation (VST) in the training set was applied to \( {K}_{ij}^p \).
Training classifiers
As the test is intended to recognize and call a reference label defined by pathology [1], we defined the reference label to be the response variable in classifier training [4], and the exome-enriched, filtered and normalized RNA sequence data as the predictive features. We evaluated multiple classification models, including random forest, support vector machine (SVM), gradient boosting, neural network and penalized logistic regression [18]. Each classifier was evaluated based on 5-fold cross-validation and leave-one-patient-out cross-validation (LOPO CV) [19]. Ensemble models were also examined by combining individual machine learning methods via weighted average of scores of individual models [20].
To minimize overfitting during training and evaluation, we stratified each cross-validation fold such that all data from a single patient was either included or held out from a given fold. Hyper-parameter tuning was performed within each cross-validation split in a nested-cross validation manner [21]. We chose random search and one standard error rule [19] for selection of best parameters from inner CV to further minimize potential overfitting. Ultimately, we repeated the hyper-parameter tuning on the full training set to define the parameters in the final, locked classifier.
Best practices for a fully independent validation require that all classifier parameters, including the test decision boundary, are prospectively defined. This therefore must be done using only the training set data. Since the test set would be composed of TBBs pooled at the patient level, an in silico mixing model was used to simulate the distribution of patient-level pooled scores within the training set. We simulated within-patient mixtures 100 times at each LOPO CV-fold, with gene-level technical variability added to the VST gene expression. The gene-level technical variability was estimated using the mixed effect model (Eq. (1)) on the TBB samples replicated across multiple processing batches. The final decision boundary was chosen to optimize specificity (> 0.85) without severely compromising sensitivity (≥ 0.65). Performance was estimated using patient-level LOPO CV scores from replicated in silico mixing simulation. To be conservative for specificity, we used a criterion for averaged specificity of greater than 90% to choose a final decision boundary. For decision boundaries with similar estimated performance in simulation the decision boundary with highest specificity was chosen (See Additional file 1: Figure S2).
Evaluation of batch effects and monitoring scheme for future samples
To ensure the extensibility of classification performance to a future, unseen clinical patient population, it is crucial to ensure there are no severe technical factors, referred to as batch effects, that may cause significant shifts, rotations, compressions, or expansions of score distributions over time. To quantify batch effects in existing data and to evaluate the robustness of the candidate classifiers to observable batch effects, we scored nine different TBB pools, triplicated within each batch and processed across three different processing batches, and used a linear mixed effect model to evaluate variability of scores for each classifier. The model that is more robust against batch effects, as indicated by low score variability in the linear mixed model, was chosen as the final model for independent validation.
To monitor batch effects, we processed UIP and non-UIP control samples in each new processing batch. To capture a potential batch effect, we compared the scores of these replicated control samples and monitored whether estimated score variability remains smaller than the pre-specified threshold, σ
sv
, determined in training using the in silico patient-level LOPO CV scores (see Additional file 1 for details regarding how acceptable variability levels were determined).
Independent validation
The final candidate classifier was prospectively validated on a blinded, independent test set of pooled TBB samples from 49 patients. Classification scores on the test set were derived using the locked algorithm and compared against the pre-set decision boundary to give the binary prediction of UIP vs. non-UIP calls: classification scores above the decision boundary were called UIP, and those equal to or below the decision boundary were called non-UIP. The continuous classification scores were compared against the histopathology labels to construct the receiver-operator characteristic curve (ROC) and calculate the area under the curve (AUC). The binary classification predictions were compared against the histopathology reference labels to calculate sensitivity and specificity.