Gene signatures of large and small size could perform equally well in clinical applications. For example, the NIEHS and SAI predictors for the breast cancer endpoint E (NIEHS_BR_E_5 [982 features] and SAI_BR_E_1 [51 features] respectively) have close predictive powers (Validation MCC of both signatures is 0.748), but completely different feature sizes. The above suggests that it is probable to minimize the size of gene signatures while maintaining their predictive power. This notion is also supported by previous study that small gene signatures can perform well in discriminative analyses .
Biological importance can be inferred through simple similarity analyses of gene signatures for each studied endpoint on the overlapping genes. Interestingly, a number of predictive gene markers were experimentally confirmed to be related to breast cancer (Table 3). These observations are consistent with all other predictable endpoints of the MAQC-II project. For example, CA12, a highly correlated gene with estrogen receptor α (ERα), is robustly regulated by estrogen via ERα in breast cancer cells, and this regulation involves a distal estrogen-responsive enhancer region [24, 25]. ESR1 encodes an estrogen receptor, a ligand-activated transcription factor composed of several domains important for hormone binding, DNA binding, and activation of transcription . Besides, high levels of MAPT (microtubule-associated protein tau) mRNA expression in ER-positive breast cancer indicate an endocrine-sensitive, but chemotherapy-resistant disease. In contrast, low tau expression levels are associated with a subset of ER-positive cancers that have poor prognosis with tamoxifen alone and may benefit from taxane-containing chemotherapy . Moreover, GATA3 (GATA binding protein 3) is reported as a breast cancer marker and is expressed almost among all ER-positive tumors . Low levels of GATA3 are associated with invasive breast carcinomas . Numerous studies, notably based on microarray data, have shown that expression of GATA3 is strongly and positively correlated with that of ESR1. The strong correlation between ESR1 and GATA3 expression in breast cancer tissues implies that GATA3 might cooperate with this steroid receptor to regulate breast tissue-specific hormone-responsive genes .
Since the minimization process can remove probes regardless of their ranking, some top-ranked probes are removed without affecting the predictive power of the model. To find out the reasons, we re-mapped the probes for two minimized signatures to corresponding genes, and no overlapping gene was found in the re-mapped list for BR_D_Model and only 4 genes overlapped in BR_E_Model. That is, probes with more than one corresponding genes were rarely observed after the minimized process. To further inquire this phenomenon, we also examined the distribution of these genes based on the pathways archived at MsigDB . Although the number of genes in the signatures was small, a large number of pathways were found to be represented, and most of these pathways included only one or two genes in each pathway. Among the genes involved in multiple pathways, CCND1, IL8, IGF1R, and MYB participate in more than 40, while numerous genes involve in same pathways, e.g. BRCA_ER_NEG, BRCA_ER_POS, STEMCELL_NEURAL_UP, and LEI_MYB_REGULATED_GENES. This finding suggests that our minimized gene signatures are highly representative of multiple important pathways that may be involved in the biological processes underlying the discrimination of normal tissue from breast cancer samples. In that way, the rationale behind the phenomenon could be as follows: when some top-ranked overlapping probes are removed, the non-overlapping probes retain sufficient discriminatory power as the remaining probes could still stands for the majority of genes and pathways.
Based on the essence of feasibility for the minimized methodology and biological functions inferred under similarity analyses, we further explored the rationale for the consistency and the diversity of the gene signatures. The gene signatures generated from different teams for the same clinical outcome are different from each other, with some failing to share any gene in common. The diversity could result from the use of different feature selection methods, classification algorithms etc. Similarly, gene signatures for different clinical outcomes of same disease have been shown to exhibit little overlap between features [29–31], This observation has been attributed to the use of multiple factors, such as different datasets, feature selection methods, classification algorithms sample sizes, and patient diversity [32, 33]. The diversity of patients includes environmental effects, age and sex, disease stages, and patient health. In addition, genes involved at different disease stages or with different disease subtypes could also be different. Furthermore, the assumptions (i.e. gene independence) for the statistical models used in gene marker identification do not typically hold up given the small sample sizes and complexity of gene interactions.
Despite of these complex issues, in some rare cases the predictive power of each model has been independently validated with large numbers of patients, and all have shown similar performance .
Models with better performance can be generated by probe redundancy reduction with the MFS process. Several factors may contribute to this significance. First, the input for MFS are not all overlapping probes but probes with a sticker criterion which can minimize the random effect and improve the predictive power of signature, the reason behind this is that different DATs have different modeling factors, which contain randomicity and are evaluated by MAQC-II . Besides, during the feature selection process, numerous different statistical strategies have been applied for this purpose, but those features in a gene signature were purely selected based on statistical significances, some of them may not have any relation to the studied endpoint phenotype but somehow are correlated to the genes related to the endpoint. Those features may not have the positive contribution to the model performance but generate certain noise to interfere with the predictive ability. Our method can identify those genes and exclude them from the minimized features, eventually, lead to improve the predictive power.
MFS method generally reduces redundancies for features within gene signatures and improves the performance of the model (Table 2), which indicates the existence of consistency for the studying endpoint. Clinical applications will benefit from the gene signature reduction, since the reduced size of gene signature with similar or better performance can increase the efficiency and reduce cost. Meanwhile, most of the features remaining in the minimized gene signature tend to have a strong association with the disease and the application of those disease oriented features in diagnosis is more informative. To solve this problem, we use an MCC-robustness value (Methods) as a measurement for feature selection process and examine their biological functions through GO term analysis. However, the predictive power of newly-generated classifiers depends on the quality of training and validation datasets, as well as the collected features and the selected classification algorithm. A newly generated gene signature by MFS would never perform well if the performance of its input signatures is not good.
MFS could benefit the clinical applications of microarray technology in several ways. Firstly, it could improve the predictive power of signatures, which is a probable contribution to the implementation of personalized medicine; secondly, it minifies the number of probes in signatures, which can reduce cost for microarray's applications, and more important, it can avoid the weaknesses of large-size signatures: the insufficiency of sample, relevance among features, and the possible inaccuracy. Thirdly, the similarity analyses can disclose the consistency and diversity among signatures for a disease, which is related to the essential of the disease.