The rationale behind adapting 454 sequencing to repeat profiling in complex plant genomes is that it provides efficient sequence sampling from a high number of independent genomic loci. The amount of generated sequence data is large enough to include multiple reads from highly repeated elements, thus allowing evaluation of their abundance and sequence composition. However, required prerequisites for the use of 454 sequencing for repeat quantification and reconstruction are that the template sampling is random and that the sequencing does not introduce a bias towards certain sequences. In this study, we addressed these questions by comparing repeat copy number estimates obtained using experimental approaches [20–24] with the estimates based on the frequencies at which these repeats occur in 454 reads. The values were within a two-fold difference range for most of the repeats, and they never differed more than 2.8-fold. The observed discrepancies can be explained by principal limitations of both analytical approaches. The experimental quantification was based on DNA-DNA hybridizations, which may bias estimates if the probe fragment spans sequence regions differing in genomic abundance. In that case, hybridization signal is primarily determined by only a part of the sequence, but it is considered to be representative for the whole probe in subsequent calculations. This is not the case for the calculations of genome representation based on sequence coverage by 454 reads, which are performed separately for each nucleotide of the sequence in question. On the other hand, sensitivity and specifiCity of sequence similarity searches employed for this analysis can be partially affected by the algorithm and the similarity threshold values used for the assay. Thus, taking into account these limitations, we consider the experimental and 454 data to be in a good agreement.
Reconstruction of the repetitive element sequences from 454 reads represents a difficult task, complicated by the short length of the reads and considerable sequence diversity of individual genomic copies of the repeat. A similar problem has been successfully addressed by Li and co-workers , who aimed at the recovery of ancestral sequences for rice mobile elements from whole genome shotgun sequences. They developed an algorithm based on short oligomer (K-mer) frequency analysis for repeat identification and reconstruction. However, the program implementing this analysis was designed for processing conventional sequence reads of at least several hundred nucleotides in length and could not be adapted to work with the short 454 sequences. Thus, we used a different strategy, employing sequence-similarity based clustering of the reads followed by assembling them into contigs representing reconstructed fragments of the genomic repeats. Although the TGICL program package used to perform this analysis was originally designed for clustering ESTs , it provides a number of customizable parameters, which after proper adjustment resulted in the desired performance with our data. It should be noted, however, that even with these settings, most repeats could not be reconstructed as a single contig spanning their full-length sequences and including most of the sequence reads. This is mainly due to the occurrence of multiple subfamilies of the repeats in the genome and the presence of poorly conserved repeat regions. On the other hand, highly conserved rDNA repeats could be reconstructed as a single contig and their consensus was in excellent agreement with the sequences obtained by conventional cloning and sequencing of pea rDNA fragments.
Instead of performing direct contig assembly from all 454 reads , we preceded the assembly with a clustering step, which resulted in partitioning the read collection into groups of overlapping sequences. In addition to reducing the computational complexity of the assembly step, this approach also allowed the classification of contigs based on their cluster of origin. In principle, multiple contigs resulting from the assembly of reads from the same cluster should represent overlapping fragments and sequence variants of the same repeat family. Whereas this was true for many repeat families, there were also clusters including reads from several unrelated repeats. This is likely due to the existence of reads containing parts of two different repeats which act as a bridge to join groups of unrelated sequences during the transitive closure clustering procedure. Such reads can, for example, originate from insertion sites of mobile elements, which are numerous in the genome and often located within other repetitive sequences. This assumption is supported by our results, where this problem occurred in the largest cluster (CL1), which originated from at least five different families of retroelements. Although such clusters can be subsequently broken into smaller sets of overlapping contigs (see Methods), we plan to avoid this problem in the future by developing algorithms for identification of such "hybrid" reads and their elimination from the clustering procedure.
Our results have shown that low-depth genome sequencing using massively parallel technology provides sufficient sequence data for comprehensive repeat characterization even in a relatively large plant genome. Compared to the only other study on this topic, employing 454 sequencing for repeat analysis in soybean , the pea 454 sequences used here provided considerably smaller genome coverage (0.77% vs. 7% in soybean) due to the 4-fold difference in genome size between these species and the smaller reaction scale used in the pea sequencing. Still, it was possible to characterize repeats constituting 35–48% of the pea genome and including all major classes of repetitive DNA. On the other hand, considering the estimated 75–97% proportion of repeats in the genome [5, 6], relatively large fraction of the repeats remained uncharacterized. Reassociation kinetics studies of pea genomic DNA  as well as observations from other species  indicate that this fraction includes diverged, low-copy remnants of ancient repeats ("fossil-repeats"). Such repeats are below the sensitivity limit of our analysis due to their high sequence variability and low copy numbers.
Similar to most higher plants studied so far [34–36], LTR-retrotransposons were found to be the major component of pea repetitive DNA. Ty3/gypsy elements were present in twice as many copies as Ty1/copia and constituted an even larger portion of the genome (24–39%, vs. 5% spanned by Ty1/copia) owing to much longer element sequences. The prevalence of Ty3/gypsy elements over other groups of retroelements was observed in other plant genomes including rice  and Vicia sp. , and their differential proliferation substantially contributed to the genome size variation among related species . In pea, most of the Ty3/gypsy sequences were classified as Ogre-like retrotransposons, a distinct evolutionary lineage of giant elements occurring in a range of dicot plants including the genera of Leguminosae, Solanaceae, and Salicaceae . Ogre elements were found to play an important role in genome evolution of Vicia, a genus closely related to Pisum. They were differentially amplified in individual species, with the highest abundance in V. pannonica where their recent expansion to 105 copies/1C increased the genome size by more than 50% . Contrary to V. pannonica, the Ogre population in pea is not as homogeneous but it occurs as several distinct subfamilies differing in their sequences. This suggests that the evolutionary history of Ogre elements in pea was more complex and included processes of amplification and diversification of the elements. Although they are the most abundant, Ogre elements are probably not the only Ty3/gypsy elements with a significant impact on pea genome evolution. For example, Peabody elements were found to be very conserved in their nucleotide sequences, implying their recent amplification.
Compared to Ty3/gypsy elements, Ty1/copia represented a much smaller portion of the genome but occurred in a larger number of different families. Intraspecific heterogeneity of the Ty1/copia population, resulting from the presence of divergent families, was reported in a number of other species . Interestingly, these families are well conserved across different taxa in spite of their ancient origin before the divergence of monocots and dicots . This is also true for the pea Ty1/copia sequences, which in some cases, show high similarity to elements from phylogenetically distant species (Additional file 5). A typical example is the most abundant family Ps-copia-1/751 with strong similarity to monocot elements RIRE-1 and BARE; moreover, the high proportion of solo-LTRs derived from Ps-copia-1/751 suggests its long presence in the pea genome.
Whereas the general composition of dispersed repeats represented by various groups of mobile elements resembled that of other plants with complex genomes, our analysis revealed surprising diversity of tandem repeats in the pea genome. In addition to the previously described PisTR-A and PisTR-B repeats , thirteen novel families of abundant tandem repeats showing genomic organization typical for satellite DNA have been identified. This contrasts with most plants studied so far for which only a single or a few satellites are known . However, whether this is a specific feature of the pea genome or simply a consequence of highly efficient tandem repeat identification employing 454 data remains to be seen after more species will be analyzed using this technology. Nevertheless, the availability of such a rich set of satellite repeats differing in monomer length, sequence, and chromosomal localization makes pea an attractive model for studying this type of repeated DNA. For example, our previous investigation of PisTR-B repeats using COD-FISH revealed uniform orientation of its monomers with respect to telomeres on most subtelomeric loci . Extending this study to other satellite families should show if this is a general feature of the satellite arrangement at pea chromosome termini. Moreover, the wealth of sequence data obtained in this study will allow detailed characterization of sequence variability of individual families and testing if it correlates with the repeat chromosomal localization as was shown for other species . Yet another interesting question concerns the possible lack of a satellite repeat conserved among pea centromeres. Although all pea centromeres seem to contain satellite DNA (Table 2), no family of the newly identified tandem repeats occupies all centromeres as is common in most plant species characterized so far [44, 45]. This might either suggest that the genuine centromeric satellite has not been identified in our sequences or that the centromeric sequences in pea underwent less extensive homogenization among non-homologous chromosomes.
In addition to the Arabidopsis-type telomeric repeats, two other variants of telomeric minisatellite sequences were identified in the pea genome. Although they were both localized at chromosome termini along with the Arabidopsis-type sequences, their origin and role in telomere maintenance are unclear. Occurrence of the mixed minisatellite telomeric motif was reported from several plants and could be attributed to low fidelity of telomerase . The relatively small number of 454 reads containing telomeric repeats did not allow us to perform a thorough investigation of their variability; however, there were several reads including non-perfect or mixed repeat motifs which could support this hypothesis (data not shown). On the other hand, both alternative repeats were also found to form arrays spanning the whole read lengths, which indicates their at least partial arrangement in longer homogeneous arrays.