SoyTEdb: a comprehensive database of transposable elements in the soybean genome

Background Transposable elements are the most abundant components of all characterized genomes of higher eukaryotes. It has been documented that these elements not only contribute to the shaping and reshaping of their host genomes, but also play significant roles in regulating gene expression, altering gene function, and creating new genes. Thus, complete identification of transposable elements in sequenced genomes and construction of comprehensive transposable element databases are essential for accurate annotation of genes and other genomic components, for investigation of potential functional interaction between transposable elements and genes, and for study of genome evolution. The recent availability of the soybean genome sequence has provided an unprecedented opportunity for discovery, and structural and functional characterization of transposable elements in this economically important legume crop. Description Using a combination of structure-based and homology-based approaches, a total of 32,552 retrotransposons (Class I) and 6,029 DNA transposons (Class II) with clear boundaries and insertion sites were structurally annotated and clearly categorized, and a soybean transposable element database, SoyTEdb, was established. These transposable elements have been anchored in and integrated with the soybean physical map and genetic map, and are browsable and visualizable at any scale along the 20 soybean chromosomes, along with predicted genes and other sequence annotations. BLAST search and other infrastracture tools were implemented to facilitate annotation of transposable elements or fragments from soybean and other related legume species. The majority (> 95%) of these elements (particularly a few hundred low-copy-number families) are first described in this study. Conclusion SoyTEdb provides resources and information related to transposable elements in the soybean genome, representing the most comprehensive and the largest manually curated transposable element database for any individual plant genome completely sequenced to date. Transposable elements previously identified in legumes, the third largest family of flowering plants, are relatively scarce. Thus this database will facilitate structural, evolutionary, functional, and epigenetic analyses of transposable elements in soybean and other legume species.


Background
Transposable elements (TEs) are the most abundant genomic components in flowering plants. For example, approximately 40% of the rice genome [1] and 80% of the maize genome is occupied by TEs [2]. Based on transposition mechanisms, TEs are generally classified into two types: DNA transposons and retrotransposons. DNA elements in plants are further classified into at least seven superfamilies based on their structural features and transposase similarities, whereas retrotransposons are traditionally separated into two superfamilies, the long terminal repeat (LTR)-retrotransposons and the non-LTR retrotransposons [3]. Although they are often referred to simply as 'junk DNA', more and more evidence demonstrates that TEs not only contribute to the shaping and reshaping of plant genomes and epigenomes, including centromeric regions, through their amplification, recombination, and methylation [4,5], but also play significant roles in regulating the expression of adjacent genes [6] and creating the raw material for the evolution of new genes and new genetic functions [7][8][9] Identification of TEs in a species is the first step towards the understanding of their functional roles. However, precise characterization of TEs in complex genomes is not straightforward. First, many TEs, despite their abundance, have undergone intra-or inter-element unequal recombination [10,11], or accumulation of small deletions by illegitimate recombination [10,11], and thus are structurally incomplete. Second, many TEs are organized in nested patterns [12] or in chimerical structures [7], which hamper the application of programs for automated annotation of such elements. Finally, numerous elements belonging to low-copy or even single-copy number families are highly diverged within or across species, and thus are less likely to be identified by comparison with limited numbers of previously characterized elements belonging to the same families. Therefore, it remains challenging to identify and characterize the various families of TEs, especially new and low-copy number elements, in plant genomes. These TEs, as shown in rice, are apt to be mis-annotated as genes or affect the prediction of gene structures in which they reside or flank [13]. Hence, the full characterization of TEs is a critical step towards the accurate annotation of genes in a sequenced complex genome and for the investigation of interactions between TEs and genes. To this end, RetrOryza, a manually curated database of the rice LTR-retrotransposons was constructed [14]. The authors characterized many low-copy families of LTR-retrotransposons that were not collected in either Repbase [15] or the TIGR plant repeat database [16], two repeat databases that contain TEs (primarily TE fragments) from multiple plant species. In addition, manual identification and detailed analyses of DNA transposons, such as Pack-MULEs in rice [7] and Helitrons in maize [17], have been performed at the whole or nearly whole genome level, highlighting the essentiality and significance of careful characterization of TEs in individual organisms.
Soybean (Glycine max, 2n = 40) is the most valuable legume crop in the world, with numerous nutritional and industrial uses. Previous studies demonstrated that the soybean genome has undergone multiple whole genome level duplications [18], thus making it one of the most complex plant genomes investigated to date. Because of the economic significance of soybean, its genome has been recently sequenced and assembled by the combination of the whole-genome-shotgun (WGS) sequencing and the integration of physical and genetic maps [19]. The present pseudomolecules (Glyma1.01) of the soybean genome comprise 975 Mb of DNA that is assembled and mapped in the 20 chromosomes [19]. To facilitate the gene and genome annotation, and to better understand the organization, structure and evolution of the soybean genome, we carried out the characterization of all families of TEs in this genome, constructed a comprehensive database of soybean TEs, among which only < 5% were previously identified [20][21][22][23][24]. We implemented web-based sequence browsing, visualization, and comparison tools to facilitate the annotation of TEs or TE fragments in genomic sequences from soybean and other closely related legume species. In addition, the resource and tools allow users to study potential gene-TE interaction, TE-mediated gene creation, and TEmediated evolution of duplicated regions of soybean, to identify active TEs for functional genomics, to develop TE-based molecular markers for applied studies, and to address other relevant biological questions.

Construction and content
A combination of structure-based and homology-based approaches was employed to identify TEs in the 975 Mb of genomic sequence, but the precedures and programs used for different classes or superfamilies of TEs varied. LTR-retrotransposons were characterized by the methods previously described [25]. Non-LTR-retrotransposons, such as LINES, Helitrons, and other DNA transposons were identified following the protocol provided by Holligan et al [26]. More than a dozen custom perl scripts were written to facilitate the data mining and analyses. Detailed manual inspection was conducted to confirm each predicted element and to define its structure and boundaries. LTR retrotransposons were classified into different families based on the criteria proposed by Wicker et al. [3], while other elements were classified into superfamilies as previously described [26]. Only elements with clearly defined boundaries were deposited in the database.
Using the approaches above, we identified 32,370 LTR-retrotransposons, including 14,106 intact elements and 18,264 solo LTRs. These elements are classified into 510 distinct families, among which 353 were categorized into Gypsy-like families, and 157 families were assigned as Copia-like families on the basis of the order of protein coding domains [27] and/or sequence similarity. Of these families, 22 were previously described, and one of them (SIRE family) was collected in the TIGR plant repeat database to date [16]. A total of 182 LINEs with clearly defined target site duplications (TSDs) were identified, which are categorized into five distinct families. Overall, the 32,552 class I elements and numerous fragments defined by RepeatMasker [28] make up 42% of the soybean genome. In addition to the class I elements, 6,029 DNA transposons were identified, including nine Tc1-Mariners, 90 PIF-Harbingers, 65 hATs, 2,373 Mutators, 65 CACTAs, 12 PONGs and 82 Helitrons. These manually curated intact elements and fragments defined by RepeatMasker account for 16% of the soybean genome. None of these class II elements from soybean were previously collected in either Repbase or the TIGR plant repeat database. The elements identified and deposited in SoyTEdb are summarized in Table 1.

Utility
The SoyTEdb web interface is organized into functional sections. Each of the main navigation tabs ( Figure 1A) provides a specific capability for retrieving information of TEs from the database or viewing the TEs in the context of either the genetic or genome sequence maps.

Sorting TEs in an ontological category
TEs can be retrieved based on their ontological classification. A graphical representation of the ontology is presented ( Figure 1B). Clicking on a node retrieves all of the TEs in the ontology hierarchy from that node downwards. Because the list of TEs will typically be very large, a summary of the search results is shown with the entire results available for download in either tab-delimited or FASTA format.

Finding TEs around genes
A list of the TEs for an entire chromosome or in a user defined window around either a chromosomal position or a gene model can be generated ( Figure 2C). Each TE is annotated with chromosome and start/stop position, the complete ontology classification and a short description of the TE's structure. These data can be downloaded in a tab-delimited or FASTA format which includes the sequences of the TEs. This function can help users to identify TEs that surround the genes of interests, and study the interaction between TEs and genes.

Visualizing TEs in the context of genetic map and genome sequence
The soybean TEs can be viewed in the context of either the composite soybean genetic map or the Williams 82 genomic sequence (Figure 2). These views are accomplished using the CMap and GBrowse components of The GMOD Project [29]. The genetic map view is useful for obtaining an overview of the TE distribution and genetic marker distribution for a chromosomal region or an entire chromosome ( Figure 2A). As TEs are largely enriched in the recombination-low heterochromatic regions or other gene-poor regions, where few genetic markers are generally mapped, the integration of TE distribution and genetic map can help users to develop unique repeat-junction markers [30] that can be used for construction of finer genetic map or mapping of genes of interest. The sequence map view allows users to zoom into a region of the chromosome and see the TEs relative to the other sequence annotations (gene models, transcripts, etc.) ( Figures 1C and 2B), and thus allows users to identify TEs that may alter the structures and/or regulate the expression of genes. Nested TEs are indicated in the sequence map displays using the familiar box & line glyphs ( Figure 1C). The genetic and sequence displays are interconnected via contextual menus, which also allow a quick retrieval of all of the information available for a specific TE.

Searching sequence similarity using BLAST
Because the structural variation and distribution patterns of TEs vary among classes and among families, a single annotation pipeline cannot satisfy all users with different interests. Thus, we did not intend to develop new tools or to integrate tools currently available (except for BLAST) for sequence comparison, editing and/or assembly in our database infrastructure. However, the SoyTEdb web provides the canonical web BLAST interface, which allows users handy and quick comparison of their sequences with the soybean TEs deposited in SoyTEdb.

Discussion
We established SoyTEdb under the infrastructure of SoyBase and the Soybean Breeder's Toolbox [31]. As such, SoyTEdb represents the only TE database with components of integration with a genetic map and physical map, with annotation tools, annotations of other DNA components, as well as nearly 20 years of quantatitive trait locus (QTL) analyses of agronomically important genes. SoyBase and the Soybean Breeder's Toolbox were described in the "National Plant Genome Initiative: 2009-2013" [32] as databases that bridge genomics and application for crop improvement. Thus SoyTEdb can be used for both basic research and applied studies, such as marker development for mapping agronomically important genes. It is also easily used for both intraand inter-specific comparison of transposable elements at whole genome levels.
In light of recent discoveries made from detailed analysis of TEs in plants, such as rice and maize [7,8], the importance of creating a complete TE database from an individual genome can be substantial. Although the TIGR plant repeat database is currently available, it only collected approximately 4,000 TEs, of which, many were  fragments and very few were manually inspected. In addition, the majority of TEs collected in the TIGR database are from grasses, and very few were identified in legumes, the third largest family of flowering plants. For example, only 23, eight, and zero TEs or fragments were collected from soybean, Lotus, and Medicago, respectively. It thus is not surprising that this database was rarely used for annotation of even the rice genome. By contrast, RetrOryza, a manually cruated rice LTRretrotransposon database, despite its incompleteness [33], has served as an essential resource for the reannotation of the rice genome [34]. Thus, manual annotation of a complete set of TEs are desirable for any genome sequencing projects and research community.

Conclusion
We have generated a comprehensive database of transposable elements, of which,~95% were first identified in this study and~5% were identified in previous studies (19)(20)(21)(22)(23). This database has been used in the soybean genome annotation pipeline to facilitate accurate annotation of the soybean genes. SoyTEdb will be valuable as the legume community undertakes the structural and functional characterization of TEs and their interaction with genes in soybean and related legume species. In addition, the availability of the complete set of TEs from a complex dicot genome allows evolutionary and comparative analyses of TEs between dicot and monocot species at the whole genome level.

Future perspectives
Future SoyTEdb development includes the integration of TE data from Glycine soja, other Glycine species, and common bean, whose genomes will be completely or partially sequenced [SoyMapII project supported by the US NSF Plant Genome Research Program Grant # DBI-0822258; Common Bean Sequencing Project to be supported by the USDA Agriculture and Food Research Initiative (Jackson, pers. Comm.)]. In addition, genes captured by TEs and TEs that carry gene fragments in soybean and these relatives will be identified, classified and integrated into the database in the context of the comparative genome maps of multiple species.

Availability and requirements
All TEs or subsets of TEs can be downloaded from the SoyTEdb website http://www.soytedb.org, which is publicly accessible. These data are freely available without any restrictions to use by non-academics.