Table 3 Discrimination of maize TE-encoded genes based on percent of coding sequence masked using either RepeatMasker (MIPS REcat library) or constituent 20-mer frequencies (WGS index with a threshold log repeat level of 0.8).

From: A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes

method criterion sensitivity 95% CI specificity 95% CI
REcat > 41.56 96.69 95.8–97.5 92.41 88.8–95.1
k-mer frequency > 17.00 92.62 91.3–93.8 92.08 88.4–94.9
  1. Ab initio gene prediction was carried out on 100 non-masked BAC sequences using FGENESH, and resulting models were classified as TE (1842) or as presumed genes (303) based on a BLASTP similarity search. ROC analysis was used to determine the optimum criterion (threshold percent of coding sequence masked) that would maximize detection of TEs while minimizing false positives (i.e. maximum sensitivity + specificity).