RepPop: a database for repetitive elements in Populus trichocarpa
© Zhou and Xu. 2009
Received: 04 June 2008
Accepted: 09 January 2009
Published: 09 January 2009
Skip to main content
© Zhou and Xu. 2009
Received: 04 June 2008
Accepted: 09 January 2009
Published: 09 January 2009
Populus trichocarpa is the first tree genome to be completed, and its whole genome is currently being assembled. No functional annotation about the repetitive elements in the Populus trichocarpa genome is currently available.
We predicted 9,623 repetitive elements in the Populus trichocarpa genome, and assigned functions to 3,075 of them (31.95%). The 9,623 repetitive elements cover ~40% of the current (partially) assembled genome. Among the 9,623 repetitive elements, 668 have copies only in the contigs that have not been assigned to one of the 19 chromosome while the rest all have copies in the partially assembled chromosomes.
All the predicted data are organized into an easy-to-use web-browsable database, RepPop. Various search capabilities are provided against the RepPop database. A Wiki system has been set up to facilitate functional annotation and curation of the repetitive elements by a community rather than just the database developer. The database RepPop will facilitate the assembling and functional characterization of the Populus trichocarpa genome.
The Poplar was selected to be the first tree genome to be sequenced, mainly because of its extraordinarily rapid growth rate and its relatively compact genome size (450–500 Mbps [1, 2]). Biofuels are produced mainly through two sources, i.e. crops high in sugar or cellulose, e.g. sugar canes  and plants , and plants high in vegetable oils like soybean. The Populus trichocarpa genome's rapid growth coupled with the high content of lignocelluloses has made it one of the model systems for the new generation of biofuels . The current assembly of the Poplar genome was released in June 2004, and its total length is ~485 Mbps. The assembled 19 chromosomes with 7.66% gaps count for 63.41% of the whole genome. Further efforts are still needed to close the gaps in the sequenced chromosomes.
Repetitive elements represent a significant fraction of eukaryotic genomes and they could occupy as high as 80% of some land-plant genomes like wheat  and as low as 10–35% for Arabidopsis thaliana  and rice . There are three main classes of repetitive elements, namely, local repeats (tandem and satellite repeats) , interspersed repeats (transposons) and segmental duplications (duplicated genomic segments). Among them, transposable elements are the most extensively studied repetitive elements, and they can be classified as retrotransposons or DNA transposons based on whether they are transposed through the RNA or DNA intermediates . Both interspersed repeats [11–16] and other duplicated elements  may induce homologous recombinations and insertions/deletions in the host genome, which may introduce great difficulties to the correct assembly of the repetitive regions in the host genome.
Typically repetitive elements have been identified in a genome using two approaches: (1) identification of homologous sequences to known repetitive elements , and (2) identification of repeats based on self-comparison a given genome and clustering them into families [19–21]. The first approach requires manually curated repetitive elements, which may not be feasible for newly sequenced genomes, though it can identify the precise boundaries of repetitive elements, even for the embedded partial copies. The second approach identifies repetitive elements in a de novo fashion, though it may require additional manual curations for the boundaries of the predicted elements.
The current assembly of the Populus trichocarpa genome was released in June 2004 as version 1.1, which consists of 22,012 nucleotide sequences, covering large pieces of the 19 chromosomes and some unassembled short contigs, and the total length is 485,510,911 bps. This data was downloaded from the web site of Populus trichocarpa genome sequencing project .
We downloaded four of the most comprehensive databases of repetitive elements in eukaryotes, RepBase  version 12.05 (release of July 13, 2007), TREP  version 10 (release of July 2008), RetrOryza  and AtRepBase , for homology search. We also downloaded the databases RDP  and Rfam , and RNA genes in the rice RAP-DB database . The NCBI database NT  containing all the non-redundant protein sequences was also downloaded for homology search.
Due to the very large computer memory requirement by many repeat identification programs [19–21], we implemented our RepPop database and associated tools on a 64-bit Linux operating system with 32 GB memory. The repetitive elements with at least 2 copies in the Poplar genome were identified using RepeatScout . We then removed any repetitive elements predicted to be low complexity regions using program NSEG  and tandem repeats using program TRF . All the programs were run using the default parameters.
We first identified the homologous regions of the 9,623 repetitive elements in the databases RepBase , TREP , RetrOryza  and AtRepBase  using the NCBI Blast  with E-value cutoff e-5. One region might match two homologous elements in the database. We then removed the redundant annotations by keeping only the region with the lowest E-value for the overlapping regions. A total of 226 homologous regions were identified.
We then predicted 30 tRNA genes using the program tRNAscan-SE with default parameters . 8 and 40 homologous regions to the RNA genes in databases RDP  and RAP-DB  were identified using the NCBI Blast  with E-value cutoff e-5 after removing the redundancy like above. No homologous regions were identified based on the RNA profiles of Rfam  using the program infernal  with default parameters.
2,720 homologous regions to sequences in the database NT  were identified using NCBI Blast  with E-value cutoff e-5, and annotated as having the functions of the best matched homologous proteins.
Basic knowledge of the RepPop database
A Help interface is provided to help the users to get familiar with how to use RepPop. A detailed description of using various interfaces of RepPop can be found on this page. A collection of comprehensive databases of repetitive elements and computational programs for identifying such elements is provided in this Help interface, a user of which may be interested in identifying repetitive elements in other genomes. A list of Frequently Asked Questions (FAQs) is included in the FAQ interface.
There are quite a few databases focusing on the repetitive elements in plants. RetrOryza  collects 242 families of LTR retrotransposons in the rice genome, AtRepBase  provides the browsing and blasting interfaces for the 63 well annotated repetitive elements in the Arabidopsis genome and TREP  represents a community joint effort to collect and annotate the repetitive elements in the Triticeae genomes. All above three databases collect a limited number of repetitive elements with well curated annotations in one or a few closely related organisms. Our database, RepPop, computationally identified all the families of repetitive elements and tried to annotate them using sequence mapping. We have classified them as RNA, transposon and unknown genes, which is similar to the classification system of TREP.
RepPop is a database currently consisting of all the 9,623 predicted repetitive elements in the Populus trichocarpa genome along with functional annotations for some of them. Various search capabilities are provided in support of using this database by a large community of users. One unique feature of the database is that it allows users to add their annotations and curations to selected repetitive elements in a fashion similar to Wikipedia, which should help to rapidly increase the amount of information stored in this database.
More efforts are being put into manual curations to provide more accurate annotations of the predicted repetitive elements, especially for the chimeric ones. Curations from other researchers, including users, are encouraged, as discussed above, through the web site of RepPop.
Project name: The repetitive elements in Populus trichocarpa genome.
Project home page: http://csbl.bmb.uga.edu/~ffzhou/RepPop/.
Operating system(s): Platform independent.
Programming languages: PHP.
License: Not required.
Any restrictions to use by non-academics: None.
Repetitive elements in the Populus trichocarpa genome
Long Terminal Repeat
This work is supported in part by the National Science Foundation (DBI-0354771, ITR-IIS-0407204, DBI-0542119, CCF0621700), also National Institutes of Health (1R01GM075331 and 1R01GM081682) and a Distinguished Scholar grant from the Georgia Cancer Coalition, and the grant for the BioEnergy Science Center http://genomicsgtl.energy.gov/centers/center_ORNL.shtml, which is a U.S. Department of Energy Bioenergy Research Center supported by the Office of Biological and Environmental Research in the DOE Office of Science. We thank the colleagues in the Biofuel group of UGA CSBL for their comments on this work. We would also like to thank the two anonymous reviewers for helpful and constructive comments on our work.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.