The EGFR TKI erlotinib was shown to result in increased survival in previous clinical trials when used as monotherapy in previously treated patients with advanced NSCLC . ToxiCity to erlotinib is markedly lower than many alternative pharmacologic treatments, and would clearly be a preferred therapeutic option if survival was shown to be equivalent or better than treatment with other second line agents. Since only a fraction of patients respond to such therapy, a priori identification of responders could have a vast effect on survival. Many clinical parameters which have been shown to correlate with response to EGFR TKIs, including smoking history, gender, ethniCity, and tumor histology. Additionally, EGFR expression levels, phosphorylation status of EGFR, and mutations within the kinase domain [22, 28, 31] also correlate with sensitivity to some degree. While each of these predictors of response result in some overlap, potential responders to EGFR targeted therapeutics may be overlooked. In the same vain, a significant number of patients selected for treatment with EGFR TKI will fail therapy. Therefore, we undertook this study with the hypothesis that a gene expression signature of response will capture more of the variability within the tumor and improve prediction of EGFR TKI sensitivity than currently preferred methods. Furthermore, closer examination of the genes within this signature will allow for greater understanding of the effects of aberrant EGFR signalling, as well as potential elucidation of new drug targets.
Using NSCLC cell lines as tumor surrogates and previous findings as guidance, we sought to train our model by stratifying cell lines by drug sensitivity. Three sensitive cell lines were chosen for training data: H3255, PC9, and H1650. A549 cell line and UKY-29 cell lines were resistant to treatment and used for training data. The cell lines resistant to EGFR TKI harbour K-Ras mutations while the sensitive cell lines used in the training set all harbour EGFR mutations, as previously reported, and this finding is consistent with the hypothesis that K-Ras mutations and EGFR mutations are mutually exclusive in NSCLC .
Our hypothesis is anchored in the concept that while many factors correlate with sensitivity to EGFR inhibition, distinct combinations of signalling pathway deregulation may underlie the observed phenotype. Therefore, a gene expression signature capturing this complexity may be a more accurate predictor of response to EGFR TKI, and we defined a gene expression signature that utilizes our knowledge of signal transduction to model the phenotype of sensitivity.
Approximately 1500 genes were significantly different between our sensitive and resistant training cell lines, and while many of these genes may be important in our phenotype of response, we reasoned that a significant portion may be artifacts of two-dimensional growth and cell culture conditions. We filtered the 1500 differentially-expressed genes based on ontological annotation, allowing us to focus our signature on those genes which are important for cell signalling and are more likely to influence response to inhibition of the EGFR signalling cascade. To our knowledge, this is a novel approach to feature selection within a predictive gene signature study. A limitation of this approach is that genes which may contribute to pharmacokinetic variability such as transporters and metabolic enzymes would be omitted from the signature. Furthermore, markers of epithelial to mesenchymal transition (EMT), which have been shown to correlate with sensitivity to EGFR TKI [12, 13] are not present in our final predictive signature due to the filtering by gene ontology. It is of note that the SAM analysis identified several EMT genes as differentially expressed within the 1500-gene training data set, such as vimentin, E-cadherin, and β-catenin (data not shown).
We defined a set of 180 features which represent differentially expressed genes that exhibit enrichment in signal transduction functions between EGFR-inhibition sensitive and EGFR inhibition-resistant cell lines, including a number of previously identified oncogenes such as Src, B-Raf, and PI3K that function downstream of EGFR activation. EGFR itself was identified as significantly deregulated and is consistent with the observation that EGFR expression may correlate with sensitivity .
GATHER allowed us to interrogate KEGG pathways in analysis of the genes included in the 180-gene signature and identified deregulation within the PI3K and MAPK pathways between sensitive and resistant cell lines. Interestingly, both of these pathways are downstream of EGFR, providing further evidence of their importance in NSCLC. Consistent with this finding, several subunits of PI3K were found highly-expressed in the EGFR TKI sensitive cells, including both the catalytic and regulatory subunits.
Analysis of transcription factor binding elements using GATHER also identified strong commonalities among the genes included in the signature. The high proportion of the genes are likely regulated by the E2F-family of transcription factors and/or c-MYC/MAX transcription factors suggesting common regulatory mechanisms may lead in to the phenotypic difference of EGFR TKI-sensitive and -resistant cells. Importantly, both activating E2Fs and Myc are recognized as essential cell cycle regulators and bind to promoters of genes important for driving cellular proliferation .
Many of the 180 features of our EGFR signature represent genes, described above, that were observed to have large differences with low variability in our system. Since our leave-one-out cross-validation yielded a 0% misclassification error, there may be concern that over-fitting of the model has occurred. A full leave-one-out cross validation (i.e. features are reselected and model parameters are rebuilt at each iteration) is a stringent and relatively unbiased estimate of the model building algorithm error [34, 35]. However, to ensure that the treatment of replicate cell line samples as independent samples in our model did not result in cross-validation bias, we performed additional internal validation experiments. Subsequent cross-validation was performed in which the entire data from each cell line was removed (features were re-selected and weights were recalculated based on the data from only 4 cell lines, and the samples from the 5th cell line were predicted using the new model). This method of cross-validation yielded a high degree of accuracy as well in that all cell lines predicted correctly, with the exception of 3 of 8 A549 samples (data not shown). We also constructed a second predictive model of EGFR TKI sensitivity using balanced numbers of replicates in both training classes. We found that although 111 genes of the resulting 169-gene model were common to the 180-gene signature the resulting model did not exactly replicate the classifications of the 10-, 50-, and 180-gene models. The differences could be due to a lack of statistical power in the second model or by utilizing all of the replicate measurements for the training cell lines. Thus, we may observe an artificial increase in our statistical power by using the 180-gene predictive model of EGFR TKI sensitivity.
We assessed the ability of this model to predict additional sets of gene expression data. To independently validate the signature, we used DLDA to classify cell lines that were not included in training the models. Additionally, we assessed the variability in predictive strength using multiple models. We found that predictions based on the most statistically significant 10 or 50 genes were similar to those made with the full data set. However, 10-gene model resulted in misclassification of both the UKY-29 and H1975 samples. This finding underscores the importance of including enough features in the model to account for variability found in the biological system of interest, a lung adenocarcinoma. Interestingly, the H1975 sample is seemingly misclassified in the 50- and 180-gene models as well, as this cell line harbours a second mutation in exon 20 that has been shown to confer resistance to the EGFR TKI gefitinib and erlotinib . Importantly, however, recent reports have shown that the irreversible inhibitors of EGFR such as CL-387, 785 overcome this resistance . Therefore, the double-mutant H1975 cell line, although insensitive to gefitinib and erlotinib, retains reliance on EGFR signalling pathways, providing an explanation for its classification using our models . Furthermore, when compared to predictions based on mutational status alone, the genomic predictors (50- and 180-gene models) perform better in determining a priori sensitivity (Table 3).
We carefully selected the cell lines used as a validation set to ensure that our model was predictive of EGFR TKI sensitivity and not mutational status alone. The H358 adenocarcinoma cell line harbours a K-Ras mutant and no EGFR mutations, yet our predictor and data of others  identify this cell line as sensitive to EGFR inhibition. Furthermore, the A431 cell line was not derived from a lung adenocarcinoma, has both wildtype EGFR and K-Ras alleles, and is exquisitely sensitive to EGFR inhibition. However, K562 cell line is derived from a CML blast crisis patient, is wild-type for both EGFR and K-Ras, and is highly resistant to EGFR TKI. All three of these cell lines classify correctly and consistently among the 10-, 50-, and 180-gene predictors.
To strengthen confidence in our 180-gene model, we tested an independently derived set of NSCLC cell line microarray data that thus far is unpublished (Girard, GEO # GSE4824). Our signature correctly classified 64–71% of the cell lines, depending on IC50 threshold selection of resistance to EGFR TKI as determined in Bunn et al . Of the four cell lines from the Girard set that were incorrectly predicted using our model, two were not of adenocarcinoma origin-H1299 (large cell carcinoma) and H157 (squamous cell carcinoma). Our predictor of sensitivity was trained using cell lines of adenocarcinoma origin and may then be more accurate when using similar data. That said, utilizing additional training data from cell lines of varied NSCLC histologies will likely improve the model for clinical use.
Finally, we assessed the ability of the predictive models to classify lung adenocarcinoma tumors. In the absence of clinical outcome or survival data from a prospective trial, we identified two datasets to which reasonable proxies for EGFR signalling and TKI sensitivity were available. These data included a set of 19 adenocarcinomas for which phosphorylated EGFR (pEGFR) was assessed using IHC and a set of 40 adenocarcinomas for which both pEGFR and EGFR mutational status was assessed. Classification based on 50 or 180 genes remained relatively constant demonstrating robust predictive power. Furthermore, classification of the tumors using 50- and 180-genes models identify a majority of the pEGFR positive samples in both datasets, as well as capturing 5 of 6 EGFR mutants in the Duke tumor dataset.
We identified several tumors in both the Moffitt and Duke datasets that demonstrate no detectable expression of pEGFR but classify as EGFR TKI sensitive using the predictive gene expression model. It is possible that IHC analysis is less sensitive than classification using the gene expression profile and is also dependent on sections stained and phospho-specific antibody used. That said, the tumors harbouring low levels of pEGFR predicted to be sensitive to EGFR TKI might possess deregulation of parallel signalling pathways that result in a gene expression phenotype that closely resembles activation of EGFR, and accordingly, these patients classify as sensitive to EGFR TKI.
We classified 83% (5/6) of the Duke cohort that were EGFR mutants as sensitive to EGFR by gene expression signature. While the predictor seems to have misclassified one tumor that harbors mutant EGFR, we note that others have reported that cell lines with activating EGFR mutations are also insensitive to EGFR TKI, and our predictive models may have identified a tumor that will not respond to treatment . Additionally, in non- Japanese populations screened by EGFR mutational status prior to treatment with gefitinib, the response rate among those patients with either deletion or point mutation of EGFR was found to be 75% suggesting that mutation of EGFR is not sufficient for EGFR TKI sensitivity . Thus, our tumor classifications accommodate the proportion of responders found in previous studies and while our approach may exceed those findings, future validation depends on comparing classification to response in a clinical study.
Because we did not have the EGFR TKI response data for the Moffitt and Duke tumor specimens, we used pEGFR staining and mutation status as surrogates for EGFR signalling, as described above. Combining both of the tumor data sets, our predictor of EGFR TKI sensitivity suggests that 80% of the tumors may be sensitive. Previous studies found that nearly 50% of patients with advanced stage IV NSCLC who had previously received cytotoxic chemotherapy had clinical benefit with EGFR TKI defined as either overt tumor response (shrinkage) or stable disease . Since all the Moffitt and Duke tumors were of adenocarcinoma histology, a known clinical predictor of benefit to EGFR TKI, it is possible that the genomic predictor may accurately classify sensitivity in this group of tumors. It is also unclear the difference in EGFR TKI sensitivity between early stage lung cancers and widely metastatic cancers that have previously received cytotoxic chemotherapy. Studies are underway that address the sensitivity of early stage lung cancers to EGFR TKI. True assessment of the accuracy of our gene expression profiles to predict sensitivity of lung cancers to EGFR TKI will require prospective testing in patients.