The major focus of this study was to generate comprehensive and accurate class 2 TE data for comparative analyses. To this end we utilized TARGeT and MITE-Hunter, two programs that have proved efficient at detecting cTEs and nTEs, respectively
[22, 23]. In our analyses we separated nTEs from cTEs, classified them into superfamilies, and further identified MITEs among the nTEs. This protocol was necessary because nTEs and cTEs have distinct features. For example, coding Tc1/mariners have only about 50 copies in each of the analyzed genome but there are two orders of magnitude more nTc1/mariners (mainly Stowaways) (Figure
1 and Additional file
1). The dramatic amplification potential of small nTEs, in particular MITEs, is also very different across superfamilies. Extensive manual curation was performed for each nTE consensus sequence to ensure the accuracy of TE discovery and classification. In this way, we achieved the most comprehensive annotation of class 2 cut-and-paste TEs to date in these four grass genomes. For example, we found several-fold more Stowaway and Tourist elements than a previous annotation of these elements in rice, sorghum and maize
. Comparative analysis of this robust dataset led to the identification of several previously unknown features related to copy number, element size, genomic distribution and correlation with the expression level of nearby genes.
The CACTA superfamily is the outlier in all comparisons. Among the superfamilies analyzed in this study, CACTA has the fewest number of nTEs and the greatest number of cTEs (Figure
1, Additional file
1). This paucity of nCACTAs suggests that this superfamily generates fewer short elements than the others. Further, in three of the four genomes analyzed, CACTAs are enriched in intergenic regions where their copy numbers increase proportionally with genome size (see Figure
1). Finally, in at least two grass species (maize and rice) the presence of nCACTAs in or near genes has a negative correlation with transcription (Figure
4). Taken together, these data suggest that CACTA elements have either evolved a genic region insensitive/avoidance strategy or are removed from genic regions by selection.
The other four class 2 TE superfamilies also have distinctive features that are conserved in all genomes analyzed. For example, the ratio of the number of nTEs to cTEs is the highest for the Tc1/mariner superfamily and next highest for PIF/Harbinger followed by hAT, Mutator and CACTA (Figure
1). In this same order, the average length of nTEs also increases (Figure
2 and Additional file
2) suggesting that there is a range of lengths that is optimal for the transposition and amplification of each superfamily.
With regard to the distribution of elements in the four superfamilies, we have extended an observation originally made in rice
 and show that except for nCACTAs nTEs are enriched at gene borders (Figure
3). Specifically, nTEs are most abundant at the 5′ gene border, and also enriched but less so near the 3′ border (Figure
3 and Additional file
3). This result, however, differs from a recent report in rice
. For the PIF/Harbinger superfamily, enrichment of the active MITE mPing was shown previously to result from its preference for insertion into gene proximal regions
. Although our data are descriptive and as such cannot distinguish between an insertion preference or winnowing by selection, the strikingly similar patterns around grass genes suggests a preference. A similar insertion preference was observed for Hermes, an active member of the hAT superfamily from the housefly Musca domestica. Characterization of almost two hundred thousand insertion sites in a Saccharomyces cerevisiae transposition assay revealed a marked preference for nucleosome free regions (NFRs) around genes, presumably because of their accessibility. Given that NFRs have also been found near the 5′ ends of plant genes
 it is possible that their distribution underlies the pattern of nTE insertion sites from four of the five superfamilies in plants.
The dramatic enrichment of class 2 nTEs around genes especially in promoters prompted us to analyze the correlation between these elements and nearby gene expression levels using microarray data. Genes with PIF/Harbinger and Tc1/mariner elements, which are the two superfamilies that generate the majority of MITEs, showed significantly higher expression values (Figure
4, Additional files
6). Furthermore, genes with hAT and Mutators showed higher expression levels in maize but not in all rice tissues. In contrast, as discussed above, genes with CACTA elements were associated with lower gene expression.