DeSigN: connecting gene expression with therapeutics for drug repurposing and development

Background The drug discovery and development pipeline is a long and arduous process that inevitably hampers rapid drug development. Therefore, strategies to improve the efficiency of drug development are urgently needed to enable effective drugs to enter the clinic. Precision medicine has demonstrated that genetic features of cancer cells can be used for predicting drug response, and emerging evidence suggest that gene-drug connections could be predicted more accurately by exploring the cumulative effects of many genes simultaneously. Results We developed DeSigN, a web-based tool for predicting drug efficacy against cancer cell lines using gene expression patterns. The algorithm correlates phenotype-specific gene signatures derived from differentially expressed genes with pre-defined gene expression profiles associated with drug response data (IC50) from 140 drugs. DeSigN successfully predicted the right drug sensitivity outcome in four published GEO studies. Additionally, it predicted bosutinib, a Src/Abl kinase inhibitor, as a sensitive inhibitor for oral squamous cell carcinoma (OSCC) cell lines. In vitro validation of bosutinib in OSCC cell lines demonstrated that indeed, these cell lines were sensitive to bosutinib with IC50 of 0.8–1.2 μM. As further confirmation, we demonstrated experimentally that bosutinib has anti-proliferative activity in OSCC cell lines, demonstrating that DeSigN was able to robustly predict drug that could be beneficial for tumour control. Conclusions DeSigN is a robust method that is useful for the identification of candidate drugs using an input gene signature obtained from gene expression analysis. This user-friendly platform could be used to identify drugs with unanticipated efficacy against cancer cell lines of interest, and therefore could be used for the repurposing of drugs, thus improving the efficiency of drug development. Electronic supplementary material The online version of this article (doi:10.1186/s12864-016-3260-7) contains supplementary material, which is available to authorized users.


Background
The drug discovery and development pipeline is a long and arduous process, one that is resource-intensive and time-consuming, making these the main barriers for rapid drug development. Furthermore, the attrition rate is high, underscoring the need to improve strategies in drug development and in expanding the usage of already approved drugs [1]. Fortunately, the availability of a large pool of drugs provides convenient candidates for drug repurposing, which can contribute to reducing the time for finding new, effective chemotherapeutic strategies [2]. The current challenge is to develop discovery pipelines to prioritize testing of already approved drugs, particularly in cancers with limited chemotherapy options, such as oral cancer [3]. Lessons from laboratory and clinical studies have demonstrated that genetic features of tumours either in the form of mutational data or gene expression signatures could be used to predict response to targeted therapies, and this has formed the basis of precision medicine that is currently practised in the clinic [4][5][6]. To extend on the advancements in our ability to characterize the cancer genome to unprecedented depth, these information can be used to link genetic features to drug response, which affords an opportunity to systematize the testing of drug candidates for expanding the spectrum of available cancer drugs for treatment.
Since the late 1980s, the NCI-60 panel of cancer cell lines has been used to systematically identify anti-cancer compounds and more recently, to identify biomarkers of response [7,8]. In 2012, the repertoire of cancer cell lines used was expanded substantially with the inclusion of new data from the Genomics of Drug Sensitivity in Cancer (GDSC) and Cancer Cell Line Encyclopedia (CCLE) projects where 707 and 860 cancer cell lines respectively were assembled for anti-cancer drug testing. Uniquely, more than 13 cancer types are represented in these panels, and more importantly, these cell lines are well-characterised with respect to their gene expression and mutational information [9,10]. Additionally, more than 50% of these cell lines were subjected to high-throughput drug screening and their response to a large panel of drugs have been documented systematically [9,10].
The development of computational tools that could take advantage of the availability of high throughput gene expression data to mine patterns of association between drug sensitivity and gene expression signatures began with the seminal work by Lamb et al. who developed the Connectivity Map (CMap) algorithm [11]. Subsequently, other bioinformatics tools were developed. For example, NFFinder searches for relationships between drugs, diseases and a phenotype of interest using transcriptomic data as input [12]. Using the same concept, the drug-to-protein associations were evaluated by the DMAP tool that resulted in the formation of 438,004 drug-to-protein effect relationships [13]. The Functional Module Connectivity Map (FMCM), which extends CMap by constructing a functional network of a set of differentially expressed genes, showed validation results for four drugs that could affect cell viability in colorectal cancer cell lines [14].
While GDSC provides large amounts of drug response data from arrays of cell lines, additional analyses are needed to extrapolate drug efficacy to new datasets. For example, GDSC shows that the head and neck cancer cell lines FADU and HSC-3 are reported to respond to the heat shock protein 90 (Hsp90) inhibitor 17-AAG [9]. However, predicting which inhibitors are likely to be efficacious in new cell lines derived from cancer patients remains a challenge.
To exploit the GDSC data for predicting drug sensitivity, we developed DeSigN (Differentially Expressed Gene Signatures -Inhibitors), a CMap-inspired [11] bioinformatics pipeline that enables gene expression patterns from experimental data to be linked to gene expression patterns associated with drug response in a cancer cell line database. To demonstrate proof-of-concept of the practical usefulness of DeSigN, we conducted two validation experiments. The first involves the examination of reported efficacy of drug candidates against four different cancer cell lines that are prioritized by DeSigN. The second is an experimental validation of the sensitivity of a set of oral squamous cell carcinoma (OSCC) cell lines to bosutinib, a Src/Abl kinase inhibitor that is currently used for treating leukemia but predicted by DeSigN to be effective against OSCC cell lines.

Differentially Expressed Gene Signatures -Inhibitors (DeSigN) platform
DeSigN is a web-based bioinformatics tool for associating gene signatures with drug response phenotype based on IC 50 data, with the aim of identifying novel drugs that have good potential to be repurposed for cancer therapy. The DeSigN algorithm ( Fig. 1) consists of three key components: (i) a reference database that contains a set of pre-defined gene expression profiles associated with drug response data to 140 drugs; (ii) a set of differentially expressed gene (DEG) signatures as query input and (iii) a pattern-matching algorithm for evaluating similarity between the query gene signature and drug-associated gene expression profiles in the reference database.

Reference database
We built the reference database using baseline microarray data and drug sensitivity data obtained from the Genomics of Drug Sensitivity in Cancer (GDSC) project. We first downloaded the raw CEL microarray data files of solid tumour cell lines from GDSC [9] (normalized using the MAS5 algorithm). The probe sets were collapsed to gene symbols using Gene Set Enrichment Analysis [15] with HT HG-U133A chip as reference, this process resulted in 12,772 unique genes. For each drug, we classified the cancer cell lines' drug response phenotype (resistant or sensitive) in the following way. We first ranked the cell lines by their IC 50 values (lowest to highest). Cell lines with IC 50 that were U standard deviations larger than the median IC 50 of all cell lines were considered to be resistant; those that were L standard deviations smaller were considered to be sensitive. We chose the parameters U and L carefully on a case-by-case basis. These two cut-offs were generally values where sharp transitions in IC 50 were observed in the scatter plot of -log 10 (IC 50 ) against rank. About 20 cell lines each from the sensitive and resistant phenotype were thus defined. The list of sensitive and resistant cell lines defined for the 140 inhibitors in DeSigN is provided in Additional file 1: Table S1. An example for the drug Mitomycin-C is shown in Fig. 2.
Differential expression of microarray gene expression data for the sensitive and the resistant phenotype was done using the Linear Models for Microarray data (limma) algorithm [16]. The result from limma for each inhibitor was sorted and converted into ranked lists according to the gene's moderated t-statistic (rank 1 for largest value). This reference database was used to connect the queries and return rankordered list of inhibitors for a particular query (Fig. 1a).

Query signature
Differentially expressed genes (DEG) obtained from microarray or RNA-Seq gene expression data of cell lines of two different phenotype classes were used to query DeSigN. DEGs were selected using joint filtering of p-value and fold change [17], with threshold value set at log 2 fold change > 1 and p-value < 0.01 (Fig. 1b).

Pattern-matching algorithm
A pattern-matching algorithm based on the nonparametric Kolmogorov-Smirnov (KS) statistic [11] was used to associate query signatures to the drug-specific, rankordered gene expression profile database. The KS test is a rank-based pattern matching approach implemented in the Connectivity Map [11], and its goal is to correlate inhibitors in GDSC that enrich for similar DEG based on the IC 50 drug sensitivity profiles.
The Connectivity Score is computed according to [11] as follows. Let N be the total number of genes in the reference database, and T the number of genes in the query signature for up-or down-regulated genes. For every drug in the reference database, we compute the rank-ordered (using moderated t-statistic) list R for all N genes. Let j index the query genes in such a way that R(j), the rank of the j-th gene in the N total number of genes, is monotone Fig. 1 Workflow of DeSigN. a A reference database of cell lines that are sensitive and resistant to drugs available in the GDSC database was created. Version 1.0 contains 140 drugs with their unique ranked-based gene signatures. b Differential expressed gene signatures are generated from differential expression analysis of cell lines from two distinct experimental conditions, e.g. cell line gene expression data from tumour samples versus normal samples. The up and down-regulated genes (log 2 fold change > 1 and p-value < 0.01) thus selected will be used to query the DeSigN database. c A ranked-based list of inhibitors is generated, with Connectivity Score between 1 (maximal efficacy) and −1 (minimal efficacy). This allows users to prioritize the testing of these candidates increasing. For j = 1, 2, …, T, we compute the following two values for each up-and down-regulated gene signatures: Subsequently, for each inhibitor i, the KS-like statistics for up-and down-regulated query gene signature, ks up i and ks down i , are computed as (subscript omitted) The Enrichment Score (ES 1 ) for drug i in the reference database is set to zero if both ks up i and ks down i have the same sign; otherwise, ES i = ks up i − ks down i . The Connectivity Score (S i ) for non-zero instances is a normalized Enrichment Score computed as: where P = max i ES i and Q = min i ES i are the normalizing constants.
DeSigN returns a ranked list of inhibitors that have the highest Connectivity Score between the DEG and the ranked-order gene expression profiles in the reference database, with S ranging between 1 (maximal efficacy) and −1 (minimal efficacy) (Fig. 1c).
To evaluate the statistical significance of S i , we used a permutation approach to simulate the null distribution of S i . Thus, m random gene sets, each having the same size as the size of the input gene signature, were simulated. Each gene set then yields S random i (k), where k indexes the random gene set. The p-value was computed as where I A is the indicator function that takes the value 1 if event A occurs, and 0 otherwise. Here, we set m = 1000.

The DeSigN web interface
The DeSigN website is freely available at http://design.cancerresearch.my/. Its web interface is implemented in PHP (v7.0) with the support of jQuery (v1.4.2), and hosted using the Apache Server. The reference database is generated and managed using MySQL database (v5.5.49). DeSigN makes use of the AJAX feature to quickly load content without reloading the pages. All queries are sent to the Java-based computing cluster to perform parallel computation. A help document providing a guide for users to query and navigate DeSigN is available in the website, with examples given. Except the pattern-matching algorithm, which was programmed in Java and the Graphical User Interface (GUI), which was built using PHP, the other methods were implemented in R version 3.3.0.

NCBI Gene Expression Omnibus (GEO) datasets
To demonstrate how DeSigN could be used to predict candidate drugs, we used differentially expressed genes generated from ER-positive breast cancer versus normal tissue reported by Clarke et al. [18] that can also be accessed from the NCBI Gene Expression Omnibus (GEO) database under the accession number GSE42568. In addition, four drug sensitivity studies published in the NCBI GEO database were used to validate DeSigN ( Table 1). The microarray gene expression data from these five GEO studies were subjected to differential analysis using the GEO2R function provided by NCBI (version info: R 2.14.  [19] were used to validate bosutinib, a drug candidate predicted by DeSigN to be effective. The RNA-Seq data of these cells were subjected to differential analysis (OSCC versus NOK) using DESeq2 [19,20]. DEG generated from DESeq2 was used as the query signature in DeSigN to shortlist candidate drugs for experimental validation. All ORL cell lines and HSC-4 (sensitive control for response to bosutinib) were cultured in Dulbecco's Modified Eagle Medium (DMEM)/F12 (1:1) supplemented with 10% (v/v) heat-inactivated fetal calf serum (FBS), 100 IU Penicillin/Streptomycin and 0.5 μg/ml hydrocortisone as described previously [19]. NOK were cultured in keratinocyte serum-free media (KSFM; GIBCO, Carlsbad, CA, USA) supplemented with 25 μg/ml bovine pituitary extract, 0.2 ng/ml epidermal growth factor, 0.031 mM calcium chloride and 100 IU Penicillin/Streptomycin (GIBCO, Carlsbad, CA, USA) as described previously [19]. The breast cancer cell line MCF7 (resistant control for response to bosutinib) was cultured in RPMI 1640 medium (GIBCO, Carlsbad, CA, USA) supplemented with 10% (v/v) heatinactivated FBS and 100 IU Penicillin/Streptomycin. All cultures were incubated in a humidified atmosphere of 5% CO 2 at 37°C.
Viability assay using 3-(4,5-dimethylthiazol-2-yl)-2,5diphenyltetrazolium bromide (MTT) The effect of bosutinib on the selected OSCC cell lines was determined using MTT assay with 1.5-8 × 10 3 cells per well as described previously [19]. Cells were treated with 0.04-5 μM of bosutinib, and cell viability was measured after 72 h of treatment. DMSO (0.5%) served as vehicle control. The two-sample t-test was used to assess whether the difference in the sample mean of IC 50 between the tested cell lines was statistically significant (p-value < 0.05). Experiments were repeated at least three times.

Apoptosis assay
Apoptosis was quantified using a FITC Annexin V Apoptosis Detection Kit (BD Biosciences, San Jose, CA, USA) according to the manufacturer's instructions. Briefly, floating and attached cells were collected at 24, 48 and 72 h after bosutinib treatment at 1 μM, and then stained using FITC Annexin V/Propidium iodide (PI). Apoptosis detection was performed using BD FACSCANTO™ II flow cytometer and data was analyzed using the BD FACSDiva™ software (BD Biosciences, San Jose, CA, USA). For each of the three time points, the two-sample t-test was used to test whether the mean of total number of apoptotic events differed significantly (p-value < 0.05) between bosutinib-treated cells and the vehicle control (0.01% DMSO) cells. Experiments were repeated at least two times.

Proliferation assay
The anti-proliferative effect of bosutinib on the OSCC cell lines were examined using Click-iT EdU Cell Proliferation Assay Kit (Invitrogen, Carlsbad, CA, USA) as previously described [19]. The cell lines ORL-48, ORL-204 and ORL-196 were treated with 0.3-3 μM bosutinib, for 24 h and cell proliferation evaluation was based on 5-ethynyl-2′-deoxyuridine (EdU) incorporation according to the manufacturer's protocol. Images were captured from 4 to 11 different fields Saiki et al. [37] of each treatment concentration and further analyzed using EBImage [21]. The percentage of EdU-labelled cells was expressed as the percentage of red fluorescent nuclei over the total number cells reflected by DAPI-stained nuclei and the data is presented as relative percentage compared to control cells (0 μM). The two-sample t-test was used to test whether the difference in the relative percentage of EdU + cells differed significantly (p-value < 0.05) between treatment and vehicle control for the three cell lines. Experiments were repeated at least two times.

Running DeSigN
To demonstrate how DeSigN can be used to generate a list of prioritized candidate drugs, we tested differentially expressed genes (DEG) generated from ER-positive breast cancer cell line compared to normal tissues (GSE42568; Fig. 3a) [18]. From the database (Fig. 3b), DeSigN returned a list of 11 ranked inhibitors together with their target proteins (Fig. 3c). Of note, the two top-scoring drugs, AICAR and BIBW2992 are drugs that are actively being studied as therapeutics against ER-positive breast cancer. The drug AICAR, which targets AMPK, have shown to have anti-proliferative effects in ER-positive breast cancer cell lines [22]. Further, a Phase II clinical trial demonstrated that BIBW2992 was able to induce stable disease in more than 50% of ER-positive metastatic breast cancer that has progressed on letrozole monotherapy when used in combination with letrozole [23]. DeSigN also predicted resistance of ER-positive breast cancer cells against drugs with strong negative Connectivity Score such as dasatinib and midostaurin. The list of DEG from GSE42568 used to query DeSigN is provided in Additional file 2: Table S2.

Validation results
GSE4342 is a study that demonstrated the sensitive response of 17 non-small cell lung cancer (NSCLC) cell lines to gefitinib (EGFR-inhibitor) treatment [24]. By querying DeSigN using 205 up-and 137 down-regulated genes, two drugs -gefitinib and BIBW2992, were returned with positive Connectivity Score (p-value < 0.05). As expected, gefitinib was returned as the top-ranked inhibitor with Connectivity Score of 1.00 and p-value < 0.001 (Fig. 4). Interestingly, BIBW2992, also known as afatinib, a second generation EGFR inhibitor, is ranked second with a significant Connectivity Score of 0.93 (p-value = 0.021). For each of the four studies, DeSigN returned Connectivity Scores that correctly correlated drug response outcome that was consistent with the respective published GEO studies. In all these studies, DeSigN successfully associated input gene signatures with the right drugs, all with statistically significant p-values ( Table 2). The list of DEG of each study used to query DeSigN is provided in Additional file 3: Table S3; Additional file 4: Table S4; Additional file 5: Table S5 and Additional file 6: Table S6.

Using DeSigN to shortlist potentially efficacious inhibitors for OSCC cell lines
As we demonstrated that DeSigN could correctly predict drug response from published data, we next used DeSigN to identify inhibitors that could control the growth of OSCC cell lines. The gene signature for differential gene  Table S7). Nine potentially efficacious drugs were returned by DeSigN, with another five drugs were predicted to be resistant (Fig. 5), with p-values < 0.05. The ranking results corroborated well with recent findings. Two of the candidates, BIBW2992 (ranked fourth) and bosutinib (ranked eighth), have been recently reported to be effective against head and neck squamous cell carcinoma (HNSCC) cell lines [25]. We set out to further evaluate the efficacy of bosutinib, which targets Src and Abl, as it is a recently FDA-approved drug for treating BCR-ABL leukemic patients and have no known effects against HNSCC or OSCC, therefore the efficacy of bosutinib is unanticipated when used against OSCC cell lines.
For experimental validation of bosutinib's efficacy against OSCC, we tested it in three OSCC cell lines (ORL-196, ORL-204 and ORL-48). All three OSCC cell lines (Table 3, Additional file 8: Figure S8) were found to have significantly lower mean IC 50 value compared to their sensitive head and neck squamous cell carcinoma control (HSC-4, IC 50 : 1.82 μM). Against the resistant control, MCF-7, all three OSCC cell lines also had significant lower mean IC 50 (Table 3, Additional file 8: Figure S8). This finding is supported by fluorescence-activated cell sorting (FACS) analysis of the cells where bosutinib induced cell death in OSCC cell lines in a time-dependent manner (Fig. 6a, Additional file 9: Table S9). In particular, ORL-196 cells were found to be more responsive to bosutinib, as close to 35% of apoptotic cells were detected as early as 24 h of treatment, while ORL-48 and ORL-204 remained unaffected. By 72 h, a significant number of apoptotic cells (35-90%) were detected in all the OSCC cell lines (p-values < 0.01), indicating the cytotoxic effect of bosutinib in these OSCC cells.
Further confirmation from the Click-iT EdU cell proliferation assay showed clearly that bosutinib inhibited the proliferation of ORL-48, ORL-196 and ORL-204 cells as demonstrated by the significant reduction in the number of proliferating cells (red-stained cells) compared to the non-treated cells (Fig. 6b). ORL-196 and ORL-204 demonstrated growth inhibition of~70-80% (p-value = 0.03, n = 3; p-value = 0.049, n = 2 respectively) whilst ORL-48 showed growth inhibition of~40% following bosutinib treatment at 1 μM for 72 h (p-value = 0.04, n = 2) (Fig. 6c, Additional file 10: Table S10 and Additional file 11: Figure  S11). The level of inhibition in the OSCC cell lines corroborated well with their mean IC 50 value for bosutinib. Taken together, these biological observations demonstrated that bosutinib confers anti-proliferative and cytotoxic effects in the tested OSCC cell lines.

Discussion
We have developed DeSigN, a web-based bioinformatics tool that allows users to query large public database of cancer cell line gene expression and drug response data such as GDSC. We showed explicitly that querying DeSigN using differentially expressed gene signatures   [26][27][28] To date, many cases of successful drug repurposing studies have been reported, an exemplary study being that of methotrexate, a drug first developed for treating leukemia, and subsequently repurposed to treat a wide spectrum of cancers ranging from breast, ovarian, bladder to head and neck cancers [29,30]. Here, we demonstrated the success of DeSigN in guiding the selection of bosutinib as a candidate drug against OSCC (a subset of HNSCC) cell lines. Emerging evidence supports the possible use of bosutinib for the treatment of HNSCC. First, the molecular target of bosutinib, Src has been reported to be a frequently altered gene in HNSCC and has been identified as a promising drug target [31]. Second, an analysis of gene expression data from 42 HNSCC cell lines also predicted that bosutinib has anti-tumour effect on HNSCC [25]. To the best of our knowledge this is the first time bosutinib was shown experimentally to have potency in OSCC cell lines.
While tools such as NFFinder, DMAP and FMCM that adopted the CMap concept make use of large public databases such as GEO, DrugMatrix, STITCH and HAPPI as their reference, DeSigN has its uniqueness whereby it explicitly capitalizes on the large panel of 707 human cancer cell lines in GDSC that have well-characterized gene expression and drug response data (Table 4). Specifically, DeSigN constructs drug-associated gene expression profile of resistant and sensitive cell lines from these 707 cell lines, whereas CMap associates response to a drug by constructing gene expression profiles of pre-and post-treatment conditions using only four cell lines. DeSigN utilizes the cumulative gene expression effect of many genes rather than one or a handful of genes, in this case global baseline DEGs. We believe through pan-cancer approach as suggested by The Cancer Genome Atlas (TCGA) Research Network, inherent genetic similarities between human cancer cell lines could result in the identification of relevant candidate drugs that have hitherto not been tested [32].
The new leads derived from DeSigN are important for accelerating the discovery of new drugs for HNSCC treatment, which is currently limited to cetuximab, where this drug remains the only FDA-approved targeted therapy for advanced HNSCC [3]. Importantly, we would like to emphasize that all candidates with positive and significant Connectivity Score should be equally considered for validation instead of considering just the few top-ranked candidates, since factors such as cost of drug, ease of availability, method of administering, side  The current implementation of DeSigN uses differentially expressed genes as starting points to associate gene signatures with drug response phenotype. This input is not necessarily optimal, as genes that are involved in dysregulated pathways in the pathogenesis of cancer may not always have their expression substantially altered [33]. Since higher-order information such as network context and post-translational modification including reversible phosphorylation or acylation are not explicitly integrated in the current version, future improvements to DeSigN will focus on integrating these types of data.
For future work, we also intend to expand drug coverage in Version 1.0 of DeSigN by incorporating the gene expression and drug response data from Cancer Therapeutics Response Portal (CTRP) [34] and other largescale pharmacogenomics studies. We anticipate that De-SigN will evolve as more cell line gene expression and drug response data become available.

Conclusions
DeSigN provides proof-of-concept for the feasibility of using a computational approach to shortlist the most promising drug candidates for effective drug repurposing in cancer treatment. We expect that DeSigN will continue to evolve based on usage feedback from the community of cancer researchers, as well as improvements in methods for mining gene signatures that have strong network context.