Skip to main content
Fig. 1 | BMC Genomics

Fig. 1

From: RNA-seq assistant: machine learning based methods to identify more transcriptional regulated genes

Fig. 1

Performance comparison of models with various feature selection and classification methods. a Segments associated with protein coding genes. Features considered to predict differential gene expression are depicted on a segment-by-segment basis. From 5′ to 3′ end of the protein coding genes, listed are transcription starting sites (TSS) upstream up to 1500 bp (TSS 1500) and 200 bp (TSS 200), TSS downstream 200 bp (TSS + 200), transcription termination sites (TTS) downstream 200 bp (TTS 200), first exon which may include 5’ UTR, first intron, exon body, last intron, and last exon which may include 3’ UTR. A full transcript region is determined as the UTRs and coding region together. A full gene region is determined as the UTRs, coding region and introns together. b-f The Receiver Operating Characteristic (ROC) curves and g Areas Under the Curve (AUC) are used to compare the performance of models with different combinations of feature selection (Red line, InfoGain; Blue line, Correlation feature selection (CFS); Green line, ReliefF) and classification (b Logistic Regression, c Classification Via Regression, d Random Forest, e Logistic Model Trees (LMT) and f Random Subspace), on the training data with 10-fold cross-validation. The model with InfoGain based feature selection and Logistic Regression classification is selected as the best model

Back to article page