Performance of the classifier as a function of contig length and training set composition. Here the validation set consists of the confirmed chrUn contigs, and the training set is a subset of the set of mapped contigs. (A) Contig length matters. The open circles show results without conditioning on length; instead, the same validation set was classified using training sets with different contig lengths (1 kb - 1000 kb). For each contig length, we randomly selected 200 W & 200 non-W contigs for the training set. This was performed 100 times. The validation set contigs are short (average <1 kb in length), and the classifier performs better when shorter contigs are used for training. However, performance is maximized when we condition on length in the classifier (solid circle). Classifier performance is measured by mean AUC. (B) AUC for different ratios of non-W to W contigs in the training set. AUC increases for smaller non-W:W ratios.