MASiVEdb: the Sirevirus Plant Retrotransposon Database
© Bousios et al.; licensee BioMed Central Ltd. 2012
Received: 19 October 2011
Accepted: 30 April 2012
Published: 30 April 2012
Skip to main content
© Bousios et al.; licensee BioMed Central Ltd. 2012
Received: 19 October 2011
Accepted: 30 April 2012
Published: 30 April 2012
Sireviruses are an ancient genus of the Copia superfamily of LTR retrotransposons, and the only one that has exclusively proliferated within plant genomes. Based on experimental data and phylogenetic analyses, Sireviruses have successfully infiltrated many branches of the plant kingdom, extensively colonizing the genomes of grass species. Notably, it was recently shown that they have been a major force in the make-up and evolution of the maize genome, where they currently occupy ~21% of the nuclear content and ~90% of the Copia population. It is highly likely, therefore, that their life dynamics have been fundamental in the genome composition and organization of a plethora of plant hosts. To assist studies into their impact on plant genome evolution and also facilitate accurate identification and annotation of transposable elements in sequencing projects, we developed MASiVEdb (Mapping and Analysis of SireVirus Elements Database), a collective and systematic resource of Sireviruses in plants.
Taking advantage of the increasing availability of plant genomic sequences, and using an updated version of MASiVE, an algorithm specifically designed to identify Sireviruses based on their highly conserved genome structure, we populated MASiVEdb (http://bat.infspire.org/databases/masivedb/) with data on 16,243 intact Sireviruses (total length >158Mb) discovered in 11 fully-sequenced plant genomes. MASiVEdb is unlike any other transposable element database, providing a multitude of highly curated and detailed information on a specific genus across its hosts, such as complete set of coordinates, insertion age, and an analytical breakdown of the structure and gene complement of each element. All data are readily available through basic and advanced query interfaces, batch retrieval, and downloadable files. A purpose-built system is also offered for detecting and visualizing similarity between user sequences and Sireviruses, as well as for coding domain discovery and phylogenetic analysis.
MASiVEdb is currently the most comprehensive directory of Sireviruses, and as such complements other efforts in cataloguing plant transposable elements and elucidating their role in host genome evolution. Such insights will gradually deepen, as we plan to further improve MASiVEdb by phylogenetically mapping Sireviruses into families, by including data on fragments and solo LTRs, and by incorporating elements from newly-released genomes.
The intense activity of long terminal repeat (LTR) retrotransposons has been among the major drivers (together with polyploidization events) for the often enormous size of plant genomes [1–3]. This phenomenon was initially suggested to lead plants to ‘genomic obesity’ , before it was shown that mechanisms of LTR retrotransposon removal counterbalance this propensity [5–7]. The relative success of these two opposing forces likely underlies the impressive variation in the LTR retrotransposon content of plant genomes. Just over 7% of the tiny genome (125 Mb) of Arabidopsis thaliana is occupied by this transposable element (TE) type [8, 9], in contrast to approximately 25% of the rice genome (389 Mb) , ~75% of the maize genome (2,300 Mb) , and ~65% of the wheat genome (16,000 Mb) . Although these vast genomic stretches were long dismissed as ‘selfish’ or ‘junk’ DNA [13, 14], they have eventually emerged as a major evolutionary force with profound effects not only on the structure, organization and composition of the host epi/genome, but also on the evolution, function, and regulation of genes [15–18].
Sireviruses are an ancient LTR retrotransposon genus of the Copia superfamily, and the only one (of either Copia or Gypsy) that has exclusively proliferated within the plant kingdom . Due to their host specificity they were originally termed Agroviruses , before being renamed to Sireviruses (derived from the SIRE1 element of soybean ) by the International Committee on the Taxonomy of Viruses (ICTV) . In contrast to ICTV which has divided the Copia (or Pseudoviridae) superfamily into three genera (i.e. Sireviruses, Hemiviruses and Pseudoviruses), the unified classification system for eukaryotic TEs  is devoid of an analogous genus-level taxonomy. This has left as yet open questions on the position of Sireviruses within the LTR retrotransposon order, and whether they should actually be considered as viruses. Based on all published work on Sireviruses so far [23–28], it is safe to assume that the characteristics of their life cycle correspond to that of typical LTR retrotransposons and not of viruses.
Sireviruses have infiltrated many phylogenetic branches of flowering plants, as several elements from a plethora of monocot and eudicot species have been classified as Sireviruses [23, 27] (Additional file 1: Figure S1). More specifically, and based on the Sirevirus origin  of abundant families in rice , barley and wheat [12, 30], they have extensively colonized the genomes of grasses. Notably, Sireviruses currently take up ~21% of the maize genome and ~90% of the maize Copia population, with the majority of copies accumulating during the last 600,000 years . Moreover, experimental evidence suggests that Sireviruses are present in high numbers in other species such as legumes [20, 31, 32], beets , bananas , and agaves , while approximately half of the Copia sequences deposited in GenBank belong to this genus . Therefore, through their widespread, intense and complex colonization patterns in plant genomes, Sireviruses seem to have been critical in the evolution of their hosts.
Sireviruses are also unique among LTR retrotransposons in terms of their own genome structure [23, 25]. It is the only Copia genus whose members often possess a putative envelope-like (ENV-like) gene , which however shares little sequence similarity among elements apart from the presence of transmembrane and coiled-coil domains [19, 27]. The origin of this ENV-like gene in Sireviruses (and some Gypsy plant LTR retrotransposons) is currently unknown. Moreover, apart from elucidating the mechanisms by which it is expressed (i.e. stop codon suppression, internal promoter) [28, 37, 38], its function (if any) has not been experimentally proven. As a result, its role remains highly controversial , and it is likely that it may not even represent a true envelope gene. Another intriguing characteristic of Sireviruses is the presence of a variety of highly conserved sequence motifs within their extremely divergent genome (Additional file 1: Figure S1), regardless of the evolutionary distance between their hosts. The motifs are located in key non-coding domains known to decisively participate in the life cycle of LTR retrotransposons, and may be the underlying factors for the affinity of Sireviruses for plants .
Due to their abundance and complex insertion patterns, efficient annotation of TEs is among the most cumbersome and problematic tasks of genome sequencing projects. Despite the use of several structural [40–42] and homology-based methods , identification is hampered or misguided by the often recombined, degraded and nested genome structure of LTR retrotransposons, or by their low copy number or uniqueness that renders them invincible to detection by comparison with previously characterized elements of abundant families. To alleviate such issues in the analysis of Sireviruses, we recently developed an algorithm able to identify Sirevirus elements with high accuracy and sensitivity . Initially implemented in maize , it yielded >2,700 previously unidentified intact Sireviruses and offered insights into their crucial role in the evolution of the maize genome. Herein, the algorithm was updated and applied on a curated collection of 14 fully-sequenced plant and algal genomes to create MASiVEdb (Mapping and Analysis of SireVirus Elements Database).
MASiVEdb offers, through multiple and often novel ways, a comprehensive, highly curated and detailed report on the full-length Sirevirus complement of each species, while it additionally includes an integrated system for analyzing user-provided sequences against MASiVEdb elements. In this way, MASiVEdb is unlike any other TE database, thus complementing the efforts of research groups to collect and organize repetitive sequences. Such widely used and useful databases include the TIGR plant repeat database , Repbase , TREP (the Triticeae Repeat Sequence Database) , the maize TE database [48, 49], the species-specific RetrOryza  and SoyTEdb  of rice LTR retrotransposons and soybean TEs respectively, and the GyDB Gypsy database .
This unique directory of Sireviruses can aid the scientific community in a variety of ways. Firstly, it is a consistent and up-to-date source of full-length Sireviruses (at present totaling 16,243 elements), which can significantly improve TE annotation not only of species currently included in the database but also of other plant genomes. Among other, the above will enable comparative TE studies at whole genome levels, and analyses of interactions between Sireviruses and host genes. Finally from an evolutionary perspective, MASiVEdb provides the foundation for studying the depth and impact of infiltration of this intriguing TE genus across plants, and for discerning what underlies their success or failure in massively colonizing different phylogenetic branches of the plant kingdom.
Properties of the host species and their Sirevirus populations included in MASiVEdb
genome size (Mb)b
number of chr.
avg age (my)
Oryza sativa indica
Oryza sativa japonica
All sequences were manually inspected to remove smaller contigs and scaffolds, properly formatted and split into chromosomes. For each chromosome, intact Sireviruses were identified and analysed with an updated version of the MASiVE algorithm . MASiVE is based on the step-by-step identification of Sirevirus-specific and other critical sequence motifs of LTR retrotransposons, which have been shown to provide base-pair accuracy in outlining the element. The update mainly concerned the removal of a preliminary run of the LTRharvest algorithm  for detecting generic LTR retrotransposons. The exclusion of this step increased sensitivity without sacrificing accuracy, as was confirmed with large-scale manual inspection of the resulting data. Further improvements included optimized order of steps, element overlap detection, and data output. Specifically for the purposes of MASiVEdb we developed custom-built PERL scripts for the detection of the integrase (INT), monocot/eudicot ENV-like core domains, the multiple zf-CCHC motif of the gag gene , and the target site duplication of each element.
Intact Sireviruses were not detected in three out of the 14 species, the green algae Chlamydomonas reinhardtii and Ostreococcus lucimarinus, and the tree Populus trichocarpa, hence providing preliminary evidence that Sireviruses have neither been present when land plants emerged from algae, nor have they successfully colonized all branches of the plant kingdom.
The remaining 11 species provided a total of 16,243 full-length elements (Table 1), with a highly variable abundance, ranging from just one Sirevirus in strawberry and four in Arabidopsis, to 1,337 in soybean and 13,833 in maize. The total length of the detected elements exceeds 158 Mb. Sireviruses appear to have been active in different time periods in the genomes of their hosts, with an average insertion age of 0.35 million years ago (mya) in lotus (median of 0.16) to 3.17 mya in cacao (median of 3.06). There are also stark differences in the distribution of elements containing the ENV-like gene. Nearly all Sireviruses identified within the eudicot genomes of soybean, lotus, cacao and grapevine carry the ENV-like gene, in contrast to approximately half of the populations present in grasses, including brome, rice and sorghum. Notably, and as shown in recent work , the vast majority of maize Sireviruses are devoid of it. Given the availability of a large number of such sequences in the Sireviruses of MASiVEdb, as well as in Gypsy LTR retrotransposons and other TEs available in aforementioned databases, it may now be possible to elucidate its evolutionary origin and test recently formed hypotheses , and also investigate whether its function (if any) can attribute retrovirus-like properties to the carrier-elements.
All data were organized per species into tab-delimited, GFF-formatted, FASTA-formatted, and database-ready files. The latter were loaded into a two-table schema in the postgreSQL software system, whilst the rest are available for download as described below. Data are divided in four categories: ‘basic’, which includes the date of the run and version of the MASiVE algorithm used, the host species, Sirevirus identifier, chromosome, direction, coordinates, and distance to centromere (where available); ‘advanced’, which includes phylogenetic information on Sireviruses (currently unavailable – see Future development), presence of the ENV-like gene, age or time of insertion (in million years, e.g. 0.1 equals to an age of 100,000 years), and length of the element and its LTRs; ‘genes’, which includes the starting position (within the element) and length of the core domains of the reverse transcriptase (RT), INT and ENV-like genes; and ‘motifs’, which include the target site duplication, the primer binding site (PBS) sequence and starting position, and detailed analysis of the zf-CCHC motifs and multiple polypurine tract (PPT) signature . The Sirevirus identifier of the MASiVEdb is constructed with the host species four- or five-letter code (also available online in the home page), and the direction and start coordinate of the element, e.g. Ljap_chr_3-D-12553650 stands for a lotus Sirevirus that was identified on position 12,553,650 of the sense strand (D for direct, in contrast to P for palindromic) of chromosome 3 of the lotus genome.
The GFF-formatted files, viewable in appropriate browsers, provide the coordinates, direction, identifier and length of the element and each LTR, plus the age of the element; and the coordinates, sequences, and direction of the motifs of the multiple PPT signature of each element. Finally, the FASTA-formatted files provide the sequences of the full-length element and its LTRs, the multiple PPT signature, the zf-CCHC motifs, and the core domains of the RT, INT, and ENV-like genes.
The interactive output matrix contains the information requested by the user (Figure 2C), where additional filters can be used to further process the data. The results can be downloaded in tab-delimited text format, and the related sequences as FASTA-formatted files.
The batch retrieval function (Figure 1), as the name suggests, allows the simultaneous querying of MASiVEdb with one or more Sirevirus identifiers, which means that these have to be available to the user through e.g. a previous query. Again, data to be returned in the results can be selected.
We consider the sequence similarity-based access to the Sirevirus data among the most important aspects of MASiVEdb, for which we developed an integrated system collectively termed ‘LTRphyler’, and linked it to the database sequences. Through LTRphyler users can examine whether their sequences contain Sirevirus-related fragments, visualize the sequence similarity in a highly informative way, and infer the phylogenetic position of their query (if successful) within the Copia tree. More specifically, LTRphyler is a BLAST-based system that combines the use of Circoletto  for visualization, of the Wise2 package  for the detection of the RT and INT genes, and of MAFFT  for the construction of the RT- and INT-derived draft phylogenetic trees. The Circoletto visualization has been complemented with the highlighting of LTRs, the zf-CCHC motifs, and the RT INT and ENV-like core domains.
Finally, compressed data files (with file size) are available for download. They are divided per species and per content type, with information provided for the date of the run and the version of the MASiVE algorithm used (Figure 1).
Due to the difficulty in correctly assigning LTR retrotransposon families into the genera of the Copia/Gypsy superfamilies (which possibly contributed to the omission of a taxonomic step below ‘superfamily’ in the proposed TE classification system ), and to the scarce reference on the Sirevirus origin of elements, research on Sireviruses has been very limited so far. Hence, despite their abundance and wide distribution in plant genomes as suggested in a small number of earlier publications [26, 27, 32, 33], the implications of their colonization dynamics are currently unknown. The recent discovery of their highly conserved genome structure [23, 25], however, enabled their efficient and collective identification in one step , which has already proven crucial in elucidating their role in the structure and evolution of the maize genome . Large-scale studies on other TE superfamilies like Helitrons [59–61] or subclasses like Pack-MULEs [62–64], which are distinguished by their structural characteristics or their amplification intricacies (i.e. carrying gene fragments), has shown that research at these higher classification levels can provide valuable insights into the mechanisms of plant genome evolution.
We argue that MASiVEdb is a step towards this direction. It represents the resource and methodological platform that can support research for uncovering the integrative impact of a specific TE genus on plant genomes - the first such attempt for LTR retrotransposons, excluding research on Gypsy chromoviruses [65, 66]. In this respect, but also based on the functionality and technologies it incorporates, MASiVEdb is unique, and hence, complementary to the compendium of related databases [45–52]. Consequently, our aim for MASiVEdb is to assist other similar resources in untangling the complex genomic landscape of plants, by means of accurate annotation of TEs and genes, by assisting studies on their interactions, and by enabling whole genome comparative analysis of their TE complement. Such insights will gradually deepen as MASiVEdb will be continuously expanding its phylogenetic coverage (see below).
We plan to periodically update MASiVEdb with Sireviruses from other plant genomes as they become available, so as to delve deeper into their distribution across plants, and possibly uncover more branches (like maize) where Sireviruses have aggressively amplified to achieve massive numbers, or others in which they have spectacularly failed to establish. We also intend to enrich the database with entries from various species where only limited sequence information is available.
The next major update of MASiVEdb will include a comprehensive phylogenetic analysis of its elements, and their categorization into families within and across species. Although such an analysis could have been performed relatively easily by either sequence clustering with a number of pre-annotated elements, or by construction of an e.g. RT-based tree, we strongly believe that more intricate sequence and genome characteristics of Sireviruses should be taken into consideration, a considerable undertaking out of the context of this first version of MASiVEdb. Finally, we expect in the near future to be able to incorporate sequence data of fragmented Sireviruses and solo LTRs.
MASiVEdb is so far the most comprehensive directory for Sireviruses, an abundant and distinctive genus of plant LTR retrotransposons, in currently available fully-sequenced plant genomes. Although there are a number of databases (and methods behind them) dealing with the repetitive fraction of genomes, the methodology of MASiVEdb provides unprecedented accuracy in delineating and analyzing Sireviruses, in turn enabling robust and meaningful research into their own life and their impact on their hosts.
MASiVEdb is freely accessible without any restriction to its use by non-academics at http://bat.infspire.org/databases/masivedb.
We thank Prof. Athanasios Tsaftaris and Dr. Kostas Stamatopoulos for reading and improving the manuscript. This work was partially supported by the Hellenic General Secretariat for Research and Technology (GSRT). ND is currently supported by CEITEC MU (CZ.1.05/1.1.00/02.0068) and project SuPReMMe (CZ.1.07/2.3.00/20.0045). AT acknowledges a PhD scholarship from the “Alexander S. Onassis” Public Benefit Foundation.