Multivariate models from RNA-Seq SNVs yield candidate molecular targets for biomarker discovery: SNV-DA

Background It has recently been shown that significant and accurate single nucleotide variants (SNVs) can be reliably called from RNA-Seq data. These may provide another source of features for multivariate predictive modeling of disease phenotype for the prioritization of candidate biomarkers. The continuous nature of SNV allele fraction features allows the concurrent investigation of several genomic phenomena, including allele specific expression, clonal expansion and/or deletion, and copy number variation. Results The proposed software pipeline and package, SNV Discriminant Analysis (SNV-DA), was applied on two RNA-Seq datasets with varying sample sizes sequenced at different depths: a dataset containing primary tumors from twenty patients with different disease outcomes in lung adenocarcinoma and a larger dataset of primary tumors representing two major breast cancer subtypes, estrogen receptor positive and triple negative. Predictive models were generated using the machine learning algorithm, sparse projections to latent structures discriminant analysis. Training sets composed of RNA-Seq SNV features limited to genomic regions of origin (e.g. exonic or intronic) and/or RNA-editing sites were shown to produce models with accurate predictive performances, were discriminant towards true label groupings, and were able to produce SNV rankings significantly different from than univariate tests. Furthermore, the utility of the proposed methodology is supported by its comparable performance to traditional models as well as the enrichment of selected SNVs located in genes previously associated with cancer and genes showing allele-specific expression. As proof of concept, we highlight the discovery of a previously unannotated intergenic locus that is associated with epigenetic regulatory marks in cancer and whose significant allele-specific expression is correlated with ER+ status; hereafter named ER+ associated hotspot (ERPAHS). Conclusion The use of models from RNA-Seq SNVs to identify and prioritize candidate molecular targets for biomarker discovery is supported by the ability of the proposed method to produce significantly accurate predictive models that are discriminant towards true label groupings. Importantly, the proposed methodology allows investigation of mutations outside of exonic regions and identification of interesting expressed loci not included in traditional gene annotations. An implementation of the proposed methodology is provided that allows the user to specify SNV filtering criteria and cross-validation design during model creation and evaluation. Electronic supplementary material The online version of this article (doi:10.1186/s12864-016-2542-4) contains supplementary material, which is available to authorized users.


Background
Defining the molecular basis for complex disease at high resolution has become increasingly important for the discovery of actionable drug targets and the improvement of diagnosis and prognosis of cancer patients. To this end, a widespread approach has been through differential gene expression (DGE) analyses that utilize massively parallel high-throughput RNA sequencing (RNA-Seq). RNA-Seq provides a wealth of information beyond gene expression that can be used to characterize the transcriptome, such as alternative splicing via changes in isoform proportions. Notably, it has recently been shown that single nucleotide variants (SNVs) in the genome can be accurately and reliably called from RNA-Seq data as well [64]. This is significant as previously acquired RNA-Seq data can now be analyzed to determine genotype and provide more biological insight.
It is imperative that SNVs be studied as they molecularly underpin complex disease and phenotype [17]. For example, several SNVs, known to affect major regulatory pathways, have been associated with chemotherapy resistance and survival in lung cancer patients [39,88]. In addition to changes in protein structure and function due to mutations in coding sequences of transcripts, SNVs have a variety of functional effects on gene regulation and expression. For example, variants lying in intronic regions can have functional effects on expression by modulating alternative splicing [82].
Furthermore, models created from co-occurring SNV features have also been successful in predicting disease phenotype, such as susceptibility to breast or lung cancer [51,52]. These models, however, rely on SNVs that have already been found to be associated with the disease phenotype in question. Herein, we demonstrate that accurate multivariate predictive models which identify and prioritize small subsets of candidate biomarkers can be created from SNV features derived from RNA-Seq data and that a priori knowledge of their phenotypic associations is not necessary for their creation. In fact, because variants with unknown clinical associations are included, novel variants and/or genomic regions can be implicated as candidate biomarkers.
Our proposed methodology seeks to train accurate predictive models on SNV allele fraction (AF) values using sparse projections to latent structures discriminant analysis (sPLS-DA). This approach allows the identification of disease-associated SNVs located in coding regions as well other expressed locations of the genome, such as from intronic, intergenic, and 5' UTR regions, which are often under-represented in cancer biology literature. The continuous nature of SNV AF features from RNA-Seq data also allows the exploration of several genomic phenomena, mainly allele specific expression (ASE), where one allele is preferentially expressed over the other. However, it is important to note that the proposed methodology does not discriminate these events from other sources of allelic imbalance, such as from differential cell survival leading to clonal expansion or depletion and/or from copy number amplification and deletion. In our view, this inherent naïveté is a strength in that these events can be analyzed concurrently in a whole genome fashion, providing a shotgun approach to biomarker discovery. The identification of disease-associated SNVs can thus inform and limit regions of interest when using more comprehensive approaches, such as differential expression, as well as implicate novel unannotated regions that are often ignored using traditional approaches. Furthermore, SNV calling from RNA-Seq avoids using relatively more expensive technology, such as whole-exome and whole-genome sequencing (WES/WGS), and can be used to investigate variations due to RNA-editing, which have been shown to have prognostic value regarding outcomes in cancer [63].
We demonstrate the effectiveness of the proposed methodology and software pipeline, SNV Discriminant Analysis (SNV-DA), on two datasets. The first of which is relatively small dataset of non-small cell lung cancer (NSCLC) primary tumors from which we sought to classify future recurrence. Lung cancer is the leading cause of cancer-related deaths worldwide, with the subtype NSCLC compromising approximately 87 % of lung cancer cases in the United States and causing an estimated 500,000 deaths per year worldwide [26,32]. Despite advances in diagnosis and clinical treatments, NSCLC continues to be the highest cause of cancer-related deaths in major populations across the world [73]. Thus, it is imperative that a better understanding of the molecular events that drive indolent lung cancers into more aggressive tumors be reached to guide future clinical patient management.
The second dataset included in this analysis is composed of 42 estrogen receptor positive (ER+) and 42 triple negative (TR-) primary breast cancer tumors. Standard targeted therapies for breast cancer rely on the presence of either estrogen, progesterone, and/or Her2/neu receptors in the primary tumor sample [8]. Those tumors that lack these receptors, TR-, are thus resistant to standard approaches and require a combination of chemotherapeutic drugs for their treatment [60]. Compounding this issue, these tumors usually show a more aggressive and metastatic phenotype [28]. Therefore, it is necessary that targets be found that drive triple negative breast cancer to identify effective treatment of this subset of breast cancer.
We show that SNV-DA is able to create multivariate predictive models that accurately predict disease phenotype from variants called from RNA-Seq data and are significantly discriminant towards true sample groupings. We also show that the proposed software pipeline is able to identify and prioritize disease-associative SNVs. Importantly, the utility of SNV-DA is supported by the result that rankings produced by the methodology are significantly different than rankings produced from univariate tests and are enriched within genes with significant ASE. Lastly, we present as proof of concept, the discovery of a previously unknown highly mutated ER+ associated hotspot (ERPAHS), which is associated with epigenetic markers in cancer cell lines and whose expression is significantly upregulated in ER+ primary tumors as well as significantly correlated with identified SNV features.

Methods
First, SNVs are called from processed RNA-seq files using Genome Analysis Toolkit (GATK) [58]. Calls are then filtered by SNPiR tools [64] to remove SNVs that may result from sequencing noise and/or alignment errors. After data transformation, sPLS-DA models are trained on SNVs limited by region of origin. Following the empirical estimation of the optimal number of selected features to be included in the model, performance is evaluating using 10-fold cross-validation. Finally, top predictive SNV features are characterized to determine their relevance to the cancer phenotype in question.

Variant calling pipeline
The variant calling and filtering pipeline, SNPiR, has been shown to obtain accurate SNVs with minimal false-positives from RNA-Seq data [64]. For each sample, the pipeline consists of several steps: pre-and post-processing, filtering, alignment, and variant calling. Burrows-Wheelers Aligner (BWA) [48] is used with default parameters to map reads as single-end sequences to the human genome (hg19), which is concatenated with exons with known splice junctions as per SNPiR protocol. Samtools and Picardtools are used to remove duplicate and unmapped reads, while GATK [58] is used for indel realignment, base calibration and variant calling using the reference SNP database, dbSNP (NCBI hg19 build 141). SNPiR tools are then used to remove mismatches from the first 6 bp of aligned reads, as well as to remove variant calls from repetitive regions, intronic sites within 4 bp of splice junctions, homopolymer runs, and ambiguously mapped reads determined by BLAT [42].
The resulting output is a BED file containing SNVs with their genomic coordinates and allele fractions. RADAR is first used to determine if SNVs are located at RNAediting sites [67]. The SNV annotation program, ANNO-VAR (v2014jul14), is then used to annotate unique SNVs using default parameters [79]. For each SNV, ANNO-VAR provides information on the gene and region of origin, which include exonic, intronic, 5' or 3' UTR, intergenic, up/downstream, and non-coding RNA (ncRNA). ANNOVAR defines intergenic variants to those that are at least 2 kb distal from a coding sequence, whereas the ncRNA category contains variants that do not overlap coding transcript annotations and is used by ANNOVAR to encapsulate both annotated non-coding RNA, such as known miRNA and lncRNA, as well as unannotated loci in the genome. Lastly, Bedtools genomecov [66] is used to determine loci with adequate read coverage using hg19 as reference.

Data transformation and filtering
The total set of variants is transformed into a matrix SNVM, where SNVM i,j is the allele fraction of the i-th SNV in sample j. Allele fraction, or read-frequency, is defined as the amount of reads supporting the variant allele over the total amount of reads covering that nucleotide position. Read coverages are determined for every SNVM i,j . Those SNVM i,j values that do not reach the threshold read coverage (default 10) are given a nonavailable (NA) value. Sub-models can then be generated by limiting SNVs to those located in a region of interest, such as exonic positions, and/or by requiring a minimum number of non-zero features.

sPLS-DA and optimal number of features
Predictive models are created using sPLS-DA, which is implemented in the mixOmics R package [13,15]. PLS-DA is a supervised, multivariate modeling technique used to determine the variation within X, the SNV data, that is correlated to Y, the class labels (e.g. disease-free versus relapse). The sparse version of the technique, sPLS-DA, seeks to identify the best K features that provides the best discrimination between two classes, ignoring all other features. sPLS-DA thus provides a framework for both feature selection and classification.
Nested cross-validations are used to determine the amount of features, K, utilized by sPLS-DA that result in the best predictive performance. For every iteration of 10-fold cross-validation, sub-cross-validations are performed across a range of values for K. For each K, the model is trained on 10-fold sub-training sets and evaluated. The value of K with the best performance for each iteration of the parent cross-validation is then stored. This process is repeated 15 times to more accurately estimate the distribution of optimal Ks from 150 values. The optimal K is then determined as the rounded value of K that corresponds to the maximum of the estimated kernel density of the distribution of selected K's, as represented in Fig. 1.

Construction of gene expression models
To compare the performance of the proposed methodology with traditional gene expression classifiers, models were created using gene expression values as input. For the NSCLC dataset, Bowtie (v1.2.18) [46] and RSEM  [47] were used with default parameters to align reads to the transcriptome and quantify reads, respectively. For the breast cancer dataset, BWA (v0.7.12) [48] and featureCounts (v1.4.6) [49] was used with default parameters to align reads to the genome and quantify reads, respectively. For both datasets, read counts were normalized via DESeq2 (v1.10.0) [54]. Herein, adjusted p-values reported by DESeq2 will simply be referred to as p-values. Models were trained on subsequent gene expression matrices using the same parameters as those used in the creation of SNV models. For each dataset, the distribution of performance statistics are compared to that of the corresponding SNV model to identify the similarity of performance between the proposed methodology and the traditional approach.

Evaluation
After the empirical estimation of the optimal value of K, the model is then evaluated using fifteen 10-fold cross-validations to determine performance via its predictive accuracy, classification sensitivities, and area under the receiver operating characteristic curve (AUC), which seeks to quantify the relationship between true and false positive rates. Though sPLS-DA is able to train a model on features that include NA values, missing data in the test set is not compatible with the resulting model. Therefore, NA values are replaced with the mean of the means of the centered and standardized AF values for each feature within each group in the training set. For example, the mean of the normalized AF values for feature X in group A is averaged together with the mean of normalized AF values for feature X in group B disregarding samples from the test set. This value is then used as a proxy for the missing data in the test set.
To determine if the proposed methodology is discriminant towards the true grouping of disease phenotype, permutation tests are repeated 1000 times to construct the null distribution of model performance (i.e., no relation to phenotype) for each model. The true model performance is then compared to this null distribution to determine significance, with a significantly discriminant model outperforming the majority of permutation test models. Otherwise, it could be said that model performance is independent of the true grouping and is, thus, insignificant. For each test, one iteration of a 10-fold crossvalidation is used to train and test models with randomly permuted sample group labels using the optimal K that was used in the true model. The number of models with AUC greater than or equal to the true model AUC is divided by the number of tests to determine permutation test p-values.
Lastly, to obtain the final set of putative SNV features, the model is trained using all samples and the optimal value of K. The selected features are then ranked by the absolute values of their predictive coefficients (or loadings) as determined by sPLS-DA. In order to assay the utility of the proposed methodology, a Friedman rank sum test is used to compare the rankings of selected features to those of traditional approaches -the univariate nonparametric tests, Fisher's exact and Wilcoxon rank sum.
The Fisher's exact test is implemented by the production of a 2×4 table for each SNV locus, where each value corresponds to the number of samples in each group with detectable levels of each allele in (A, C, G, T), while disregarding samples with sub-threshold read coverage (<10) at that locus. As the presence of an allele is binary in this case, the test only takes into account the differential abundance of the alleles across groups. Whereas, Wilcoxon rank sum test p-values are produced by comparing the distributions of continuous allele fractions and do not directly include information on their differential abundance across samples.
To determine if the proposed methodology selects SNVs that lie in genes that have significant allele-specific expression, selected SNVs were analyzed using MBASED: a method that combines evidence across multiple SNVs to identify gene-level ASE [56]. Though the method was designed for the integration of expression data with exonic SNV calls from WES and/or WGS, we applied the methodology on SNVs selected during the creation of our SNV genic models: exonic, intronic, and 3'UTR. To determine if genes from which selected SNVs are located are enriched for ASE, we compared the number of significant ASE gene/sample pairs to those found in equally sized random subsets of genes from which the total set of SNVs were called. One thousand subsets were evaluated to determine the null distribution from which enrichment p-values can be computed.
Finally, the top 15 features selected by SNV-DA are characterized by their relevance to cancer phenotype and are analyzed via hierarchical clustering to visualize the co-occurrence of features.

Case studies Disease outcome in non-small cell lung cancer
NSCLC is the leading cause of cancer-related mortality in the US. Adenocarcinoma, the most frequent histological subtype, accounts for 40 % of such deaths [74]. RNA samples were collected from 21 different lung adenocarcinoma tumors with known clinical outcomes obtained from the American College of Surgery Oncology Group (ACOSOG). Since the RNA specimens were received from ACOSOG with no personal identifying information, the local IRB has considered the proposed project "not human subject research" after reviewing the protocol (IRB Pro00013739). Ten of the RNA samples were derived from patients who developed cancer recurrence within three years of their initial surgical resection (Relapse; R). The remaining eleven patients had remained disease free (DF) after three years. Using these samples, we sought to determine the ability of the proposed methodology to identify and prioritize candidate biomarkers that may help predict relapse phenotype in NSCLC.
RNA integrity was verified on an Agilent 2200 Bioanalyzer (Agilent Technologies, Palo Alto, CA). One hundred to two hundred ng of total RNA was used to prepare RNA-Seq libraries using the TruSeq RNA Sample Prep Kit following the protocol as described by the manufacturer (Illumina, San Diego, CA). Three samples per lane were clustered on a cBot as described by the manufacturer (Illumina, San Diego, CA). Clustered RNA-Seq libraries were paired-end sequenced with 2×100 cycles on a HiScanSQ. Demultiplexing was performed utilizing CASAVA to generate the Fastq files. Each sample produced approximately 25 million reads after sequencing. One sample from the relapse group was removed from subsequent analysis after being identified in our previous study as an outlier based on principle component analyses of expression and alternative splicing [2]. The removal of this sample is additionally supported by the iLOO outlier detection algorithm [27]. Using normalized counts from DESeq2 [54] of all relapse samples, the algorithm identified 567 outlying gene features in the suspect sample − 5.74 standard deviations greater than the distribution of the number of outlying features in the other samples (mean = 143.44, standard deviation = 73.82).

Hormone receptor status in breast cancer
To further validate our model, we obtained a dataset from the publicly available SRA database (SRP042620), which was provided by Varley et al., 2014 [78]. In their publication, the authors sought to identify read-through transcripts that are significantly correlated with breast cancer and/or hormone receptor status. RNA-Seq was obtained from 42 ER+ and 42 TR-primary tumors using poly-A capture and Tn-RNA-Seq for library construction. Libraries were sequenced on the Illumina HiSeq 2000 using 50 bp paired-end reads, which produced 50 million reads on average. Instead of trying to predict some future outcome of the patients from which these tumors were sampled, we sought to identify SNV features that co-occur with hormone receptor status. Selected SNVs may thus provide insight into molecular mechanisms differentiating these two subgroups of breast cancer.

Called SNVs
After variant calling and SNPiR post-processing, 96,025 and 213,020 unique variants with read coverages >= 10 were found in the NSCLC and breast cancer datasets, respectively. SNV matrices were created by limiting SNVs to those that had at least 3 non-zero values across samples in the NSCLC dataset and 6 non-zero values in the breast cancer dataset. Tables 1 and 2 show the distributions of SNVs for each dataset based on region of origin as determined by RefSeq annotations.   Table 3 contains measures of performance for models trained on different subsets of SNVs in the NSCLC dataset. What is immediately apparent is that the nonsynonymous exonic model had the best performance by a large margin. The model performed better than chance as seen by its AUC and predictive accuracy. Furthermore, permutation tests reveal that the performance of the model is dependent on the true label groupings (p = 0.016), thereby, suggesting that selected SNVs reflect a true biological phenomenon. With the addition of synonymous variants in the model, however, performance reflects that of a failed model. Interestingly, the distribution of AUCs from the nonsynonymous exonic model was significantly better than the distribution of AUCs from the gene expression model (Student's t-test, p < 0.001).

Disease outcome in non-small cell lung cancer
Though some of the other models have AUC values that seem to be better than chance, their performances are not significant based on permutation test values.

Predictive SNV features: disease outcome in non-small cell lung cancer
The rankings of selected nonsynonymous features in the NSCLC dataset were significantly different than univariate rankings from Fisher's exact and Wilcoxon rank sum tests as determined by Friedman rank sum tests (ps < 10 −16 ).  Not surprisingly, SNV-DA was able to prioritize features that segregate the two groups. The heatmap also visualizes co-occurring features, one example being the three SNV features lying in TACC3, which form their own cluster. Additional file 1 contains the list of nonsynonymous exonic SNVs selected in this dataset as well as their respective model loadings. Limiting analysis to genes where selected SNVS are located, 19.49 % of gene/sample pairs showed significant ASE -an enrichment compared to that of the null distribution (p < 0.001, 2.64X greater than the mean of the null distribution).    [9]; MTRR, in which variants have a well-documented association with increased risk of NSCLC [77]; and three SNVS in TACC3, whose high expression is associated with poor prognosis in NSCLC [37].

Predictive SNV features: breast cancer hormone receptor status
Except for up/downstream and 5'UTR models, the rankings of selected SNV features for each model were significantly different than univariate rankings from Fisher's exact test (ps < 10 −6 ). When comparing rankings to those produced by Wilcoxon rank sum test, all were significantly different (ps < 10 −7 ). The similarity to univariate rankings in the two models is likely a result of a small initial feature set size and/or the types of patterns seen in the data. For example, though the 5' UTR model produced rankings that were significantly different than rankings from Wilcoxon rank sum test, they were not significantly different than those from Fisher's exact test (p = 0.515), suggesting that predictive power of selected SNVs in this model result more from the differential abundance of AF values (number of nonzeros) than with the differential magnitude of AF values between groups, which Wilcoxon rank sum test seeks to quantify. The distribution of SNVs by region of origin selected during the training of the all-SNVs model is given in Fig. 4. Notice that the majority of selected SNVs are located in traditional coding regions.
Additional files 2, 3, 4, 5, 6, 7 and 8 contain the lists of SNV features selected during the creation of each model. In the following sections, the top 15 SNVs for selected models are highlighted to demonstrate that the genes in which they are reside are enriched for associations with cancer.  Figure 6 demonstrates the clustering of samples by hormone receptor status. Though not a perfect clustering, the top 15 (of 50) features adequately segregate the two groups. Genes where selected exonic SNVs are located showed an enrichment of significant ASE events −16.67 % of gene/sample pairs (p < 0.001, 1.75X greater than random mean).

Exonic
Ten of the top 15 SNV features lie in genes that have previous associations with cancer; 6, of which, are located in genes specifically associated with breast cancer (Table 6).
Several outstanding examples include: HRPAP20, a hormone regulated breast cancer oncogene that promotes  [76]; specific variant association with protection from coronary heart disease [83] malignant tumor growth [40]; ARHGEF16, which promotes migration and invasion of breast cancer cells [29]; FASN, whose upregulation is associated with HER2+ tumors and metastatic lesions [38] (confirmed in this dataset: FC = 3.6, p < 10 −5 , DESeq2); and USP35, amplification of which is associated with significantly worse prognosis in breast cancer patients and is associated with ER-tumors [21]. The latter demonstrates a seemingly paradoxical result as the SNV in question is largely abundant in ER+ tumors. In fact, in this dataset USP35 is significantly upregulated in ER+ tumors (FC=4.2, p < 10 −5 ), perhaps providing evidence that conflict with the    HDAC7, which has been shown to promote breast cancer cell survival and resistance to therapy [80]; 2 SNVs lying in CTBP1 and CTBP2, both of which are associated with breast cancer cell proliferation -the former being a regulator of BRCA [23,53]; CD151, whose deregulation is predictive of poor outcome in node-negative lobular breast carcinoma [70]; and SNED1, high expression of which is correlated with poor outcome for ER-/PR-breast cancer patients [61] (interestingly SNED1 is significantly upregulated in ER+ in this dataset: FC = 2.8, p < 10 −4 ).
Also of note, several variants lie in genes that were previously implicated in the exonic model: two SNVs in ARHGEF16, one in EML4, and one in LMNTD2.   [80] significant ASE events (p < 0.001, 1.71X greater than random mean). Similar to the previously described models, the top 15 SNV features are located in genes enriched with cancer associations, 10 of 15 (Table 8); 6 of which are specifically associated with breast cancer. Some interesting examples include: NOTCH1, which has been shown to promote recurrence in breast cancer [1]; IKBKE, a breast cancer oncogene which is upregulated in TR-breast cancers [6]; LAMB1, a breast cancer biomarker [85]; PDXK, which is associated with breast cancer relapse and metastasis [35]; and COL1A1, which is upregulated in progesterone receptor positive breast cancer patients [50]. Interestingly, the top predictive SNV, rs11515 (a SNP located in the tumor suppressor gene CDKN2A), is associated with poor survival in glibolastoma patients and is moderately associated with breast cancer risk [24,71]. Also of note, one of the SNVs lies in LRRC56, which has appeared in the top selected features in the two previous models.   [5] RNA-editing Figure 9 visualizes allele fraction distributions of the 12 predictive SNVs identified during the creation of the model using SNVs located at known RNA-editing sites. Nine of the SNVs lie in genes that have previous associations with cancer ( Table 9). The top SNV, a synonymous mutation lying in NEIL1, is the only exonic SNV chosen in the model. In fact, there is a paucity of exonic SNVs in the total set (n = 8). Interestingly, RNA-editing sites and SNPs in this gene have been identified in other cancers [65]. Another interesting example is a SNV lying in NEAT1, a lncRNA whose overexpression is associated with poor prognosis in squamous cell carcinoma patients [19]. Intriguingly, one of the RNA-editing SNVs lies in ZNF552, which is also implicated in the 3' UTR and gene expression models.  (Fig. 10) [25]. The enrichment of these peaks provide strong evidence that this locus is regulated in K562. Because this region is highly enriched with selected SNVs associated with the ER+ subtype, we have termed this locus estrogen receptor positive associated hotspot (ERPAHS). In fact, 9 of the top 15 selected intergenic SNVs are located in this concentrated region (Fig. 11). Furthermore, the only characterized transcripts within 100 kbp of ERPAHS are two immediately flanking miRNAs (mir4477A and mir4477B) and a pseudogene FRG1JP, all of unknown function [68], though the two miRNA were shown to be expressed by a subset of lymphoma cell lines in a published study [33].

3' UTR
To assay whether this region is regulated in ER+ and TR-breast samples, we sought to determine if the locus is differentially expressed. The location of ERPAHS (as defined by the 8,610 bp region ± 1,000 bp; chr9: 68, 415, 905-68, 426, 515) was included in the hg19 RefSeq annotation used by featureCounts during read assignment. Importantly, this region does not overlap the annotated transcripts mentioned above. Strikingly, this region is highly expressed in both ER+ and TR-breast samples and is upregulated in ER+ tumors (FC = 1.63, p = 0.0265, Fig. 12). Furthermore, allele fraction values of the most predictive intergenic SNV, rs113539941 (chr9: 68418921 C→T), were significantly correlated with increased expression of this locus (Fig. 12). When comparing expression across samples classified with the binary presence of rs113539941, there is a more pronounced level of differential expression (FC = 2.67, p < 10 −5 , Fig. 13), suggesting a functional role for associated mutations at this locus. This example provides a proof of concept of Fig. 10 Breast Intergenic Hotspot A UCSC genome browser view demonstrating the region of enriched selected intergenic SNVs in genomic position 9q12 [43]. This locus is defined by the presence of several regulatory markers including CTCF binding, H3K27Ac, histone methylation marks, and a DNaseI hypersensitvity site, all of which were found K562 cell lines. The enrichment of top selective intergenic SNV features suggests that this locus is associated and regulated in ER+ primary tumors

Conclusions
This study introduces a new methodology and software pipeline, SNV-DA, which is used to identify differential patterns of mutation between two phenotypes from SNVs called from RNA-Seq data. In the breast cancer dataset, we demonstrated that SNV-DA was able to produce models with high predictive performances and that models were discriminant towards the true group labels, indicating that the methodology can identify and prioritize differentially abundant SNVs that are of biological interest. However, in the NSCLC dataset, the power of the proposed methodology to identify non-exonic SNVs that were predictive of disease outcome was likely limited by a combination of small sample size and/or shallow sequencing. We further demonstrated the utility of the proposed methodology by showing that selected SNV feature rankings were significantly different than univariate rankings by Fisher's exact and Wilcoxon rank sum tests and that the locations of selected SNVs were enriched with genes displaying significant allele-specific expression (Additional file 9). Though the relative performance of SNV models to traditional models trained on gene expression features varied, their performances were comparable -with the all-SNVs model producing the best classification accuracy of triple negative breast cancer samples (Additional file 10).

Fig. 12
Correlation of ERPAHS Expression with Hormone Receptor Status and rs113539941 A boxplot demonstrating the high expression of ERPAHS in both ER+ and TR-primary breast cancer samples as well as the significant upregulation of ERPAHS in ER+ tumors compared to those of TR-. Increased expression of ERPAHS is also correlated with higher allele fraction values in tumors. The grey region represents the 95 % CI from linear regression Fig. 13 Differential Expression of ERPAHS in Tumors Bearing rs113539941 Mean raw read coverage of tumors bearing rs113539941 and tumors lacking the SNV over the genomic location of ERPAHS. The approximate region of regulatory enrichment is highlighted in green with rs113539941 marked with a dashed line. Fold change and significance of differential expression is given from DESeq2 using normalized read counts Importantly, SNV-DA was able to identify small subsets of SNVs that can be used for further analysis (as little as 50 features in the breast exonic model). Characterization of top performing SNV features from optimally performing models demonstrated that there is an enrichment of SNVs originating from genes previously associated with cancer risk, progression, and survival. That the majority of selected features lie in genes that have strong relationships with the analyzed phenotype supports the use of SNV-DA for the identification and prioritization of novel molecular targets associated with disease phenotype. Furthermore, in the breast cancer dataset, SNV-DA was able to identify predictive SNVs located in ncRNA as well as intronic, 5' UTR, 3' UTR, up/downstream, and intergenic regionslocations in the genome that are routinely ignored by whole exome sequencing. One outstanding example was the prioritization of the previously studied SNP, rs11515, located in the 3'UTR of the tumor suppressor CDKN2A, which has clear associations with poor prognosis and risk in different cancers [71]. SNV-DA was also able to implicate SNVs originating from RNA-editing, an analysis exclusive to RNA sequencing data. Lastly, the identification of the differentially expressed ERPAHS locus, its significant correlation with predictive SNV allele fractions, and association with regulatory regions in the K562 cancer cell line demonstrates the utility of the proposed methodology for the identification of interesting unannotated expressed regions of the genome (also ignored by traditional differential gene expression analyses).
Because RNA-Seq variant calling is limited to regions of the genome that are expressed, the proposed methodology would not be able to identify variants that result in a marked decrease of expression below a threshold level. However, SNVs called from RNA-Seq data have the added benefit over traditional whole-genome or exome sequencing in that they provide information on the relative amounts of allelic expression, which -as shown -can be implicated in disease or phenotype. Moreover, this methodology can also be used to analyze WGS and WES data to determine the differential abundance of alleles across heterogenous tissues. Furthermore, the developers of GATK have recently outlined best practices for indel calling from RNA-Seq data; therefore, SNV-DA can also be used for the analysis of those variants where special attention is given to feature definitions (over-lapping intervals, etc). Though SNV-DA identified predictive SNVs lying in miRNA and lncRNA, the SNPiR pipeline is not designed for non-standard RNA-Seq methodologies, such as small RNA-Seq. Consequently, the development of novel variant calling methods for these non-standard approaches may produce new avenues of research that are amenable to classification using SNV-DA. A tremendous amount of RNA-Seq data has already been collected in the private and public domain. This data now has the added benefit in that it can be mined for more clinical insights via SNV-DA.

Availability of supporting data
SNV-DA is freely available at https://github.com/ Anderson-Lab/SNV-DA. Data transformation is implemented in Python and parallelized model creation, evaluation, and permutation tests are implemented in R.