Certain difficulties arise when defining repeat elements. One is that REs often are present as mosaics of smaller subsets of commonly occurring sequences [21, 22]. Another is that REs can often sustain considerable mutations, including large truncations and insertions. Two extreme examples of this are the truncation of the 5' end during retrotransposition, and the insertion of one RE into the middle of another. A third difficulty requiring resolution is that segmental duplication will create very long repeated sequences that do not retro-transpose together, and therefore should be broken up into their retro-transposable component parts. RECON, the software for identification of REs described by Bao and Eddy, handles all three of these difficulties [10].
Our approach was intentionally a bit more simplistic. We were able to create a much more parsimonious set of RE than what we were able to generate with RECON. Whereas RECON intends to recreate the full repeat elements in the way that will make for the best possible additions to the RepeatMasker database, as well as aid in the study of the evolutionary history of the repeat elements, our goal was to mask out the most commonly repeated regions of the porcine genome. The technique we found most useful in refining the definitions of the MPREs was to plot the frequency of BLAST hits as a function of position within the sequence of the putative repeat elements. From the criteria used to define them, the number of hits was at least 10 across the whole sequence – but many showed a much higher hit frequency along part of their lengths. For purposes of comparison, we applied RECON to our pair-wise alignments from Step 2. RECON divided the 1,334,953 BLAST hits into 29,631 potential repeat elements that were then grouped into 14,067 families. Only 249 of these families had 10 or more elements. Note that it is possible for a family containing only one element to correspond to many BLAST hits. Rather than continue with so many families, we found that our method yielded a more parsimonious classification of moderately repetitive elements. One difference between the two methods was that our method required a minimum copy number prior to the formation of families of repeat elements.
The MPREs have no clear connections to known proteins. The NCBI BLASTX results for these sequences were typically a combination of description-less accessions and unrelated proteins in a variety of organisms. That remained true when the dataset was compared to the TIGR gene index for Sus scrofa [20].
The novel repeat elements were compared to known types of repeats – SINEs, LINEs and LTRs – and did not fit the definitions for those classes of repeat elements. Because RepeatMasker would mask out low-complexity regions, the methods used here would not initially find the tail ends of LTRs. Each MPRE was tested for nearby low-complexity regions and none were consistently found. One of the characteristics of SINEs is the presence of tRNA coding sequence in their 5 prime regions [23, 24]. Only MPRE58 had a region similar to tRNA, and that was in the middle of its sequence. LINEs are best characterized by their two ORFs – one coding for a reverse transcriptase and the other for a protein with RNA binding activity [6]. All the MPREs were translated to potential proteins and compared to a comprehensive database (NCBI BLASTX). None of the results were similar to the possible translations of a LINE.
Counting repeat elements is challenging
Because of the degeneracy of repetitive elements it is difficult to arrive at an accurate count in the target genome. Another difficulty in the quantification of repeat elements is that REs are often composed of smaller repeat units that occur more frequently than the larger unit [21, 22].
To characterize the prevalence of MPREs, we went to an independent data set, the Sanger BAC-ends from the CHORI-242 library archived at Ensembl [19]. Table 1 lists three different measures of prevalence of MPRE within these BAC-ends. The first measure (BLAST hits to BAC-ends) gives the number of hits that were at least half the length of the repeat element. An issue here is the typical size of the traces – an average of 1219 bp. The longer REs will tend to be under-counted due to edge effects in the trace archive. The next two measures of count number were calculated by plotting the number of BLAST hits as a function of position on the RE. Some of the resulting plots were smooth and flat across most of the RE with an expected drop-off near each end. For these "regular" plots the count number was the average value of the middle 90% of the plot amplitude. Other plots varied quite a bit in amplitude across the RE. This was likely due to sub-repeats that hit in areas of the genome that the whole repeat did not. During this measure of count number there was no lower limit to the size of the hit other than that needed to get the expectation value below 0.1. These were considered irregular and the algorithm for determining their count number was to take the smallest value on the plot after ignoring the first and last 10% of the plot. A few plots were only mildly irregular, and for those both the regular and irregular algorithms were used with both numbers reported in Table 1.
Comparing the novel repeat element content across genomes
The sequences of novel porcine repetitive elements listed here were compared (BLAST [25]) to a recent build of the complete cow genome (AAFCO2 from [26]) as well as against the mouse and human genomes. In the case of mouse, there were no significant similarities found. The comparison to the human genome yielded only one significant hit – a 37 bp long section of MPRE17 (870 bp long) matched once in chromosome 9 thousands of bp away from any annotated features. The comparison to the cow genome yielded a variety of results. Five of the 27 MPREs did not hit at all (MPREs 6, 22, 28, 50 and 60), and three others (MPREs 44, 49 and 61) had ten or fewer hits (Table 1), despite the fact that the cow genome contains ten times more sequence than the collection of porcine BAC-ends tested. Fourteen of the 27 MPREs appeared frequently in cow as well as pig, as indicated by having at least 1000 BLAST hits to the cow genome.
Not surprisingly, the bovine hits tend to be shorter than the porcine hits because the MPREs were defined from pig sequence and as such would be expected to be more intact in porcine. What is interesting is that in both species the endpoints of the hits have a strong tendency to line up to particular spots in the MPRE, as shown in Figure 1 using MPRE12, 15, 17, 41, 51 and 58 as examples. Sometimes the common endpoints are the same in both species, sometimes not. This could be a result of the repeat elements being comprised of smaller repeat elements, not all of which have the same frequency of occurrence in either genome. The longer MPREs often had more than one sub-region with multiple extra hits. This, too, could be evidence of internal repeat structure.
Figure 2 shows that MPRE55 occurs in both swine and cattle in orthologous loci. The pig BAC lies along the x-axis, and the cow BAC lies along the y-axis. Also plotted are line segments of high similarity between the two BACs. The preponderance of these segments demonstrates little genomic rearrangement between species, which indicates that these are orthologous regions of likely common ancestry between the two species. This region is highly similar to the human contig NT_005403.16 and the locus of MPRE55 corresponds to the 3' UTR of the model gene LOC643405, which codes for a protein similar to TGF-beta induced apoptosis protein 2.
Because the collection of BACs spans only 1% of the whole pig or cow genome, we cannot rule out the possibility that all of the MPREs have at least one orthologous location in both species. The fact that 12 MPREs did not have blast hits in any of the cow BACs makes it seem likely that those 12 are relatively recent evolutionary occurrences. Of the 10 MPREs that appear most frequently in the cow, only two, MPRE55 and MPRE59, were observed to appear in orthologous locations among the tested set of fully-sequenced BACs.
A phylogenetic analysis was performed on the different integration sites of MPRE55 from both the cow and pig BAC libraries using ClustalX (see Additional file 2 for the sequences), and the output (Additional file 3) was then input into R [27] to create Figure 3. The sequences that occurred at orthologous locations in swine and cattle are highlighted. As expected, the pig branches and cow branches tend to be separate. It is notable that the most similar sequences that occur in both species do not come from orthologous locations, but seem to be found in loci that originated after the cow and pig ancestral lines diverged. The evolutionary distance between them is represented by the sum of horizontal distances that one must travel along the tree to connect the two sequences. The leftmost part of that path represents a common ancestor. It is not surprising that the two sequences in question have individually diverged a significant amount from the original sequence of the common ancestor at that locus. The more surprising result is that some of the pig and cow sequences are more similar to each other than the sequences at the oldest loci. Coincidental convergence is an unlikely possibility. A more likely explanation is that enough copies of the old sequence were created that some of them experienced much less mutation than the diverged sequences at the ancestral locus. The most recent common ancestors (MRCA) occurred in a narrow window of time (evolutionary) relative to the full extent of the tree (< 1/5 of the distance from the root to the most peripheral branch). The MRCA among the orthologous sites occurred within the same time frame as the other MRCA. The tree clearly shows considerable radiation following speciation as evidenced by large genetic distances from MRCA to peripheral tips.
A closer look at MPRE61
Allelic differences or SNP can be identified from cases where MPRE61 sites coincide with overlaps among CVSI BACs. MPRE61 sites coincide with 3 pairs of overlapping BACs, 1 (AC145413 and AC144901), 2 (AC139879, AC140099) and 3 (AC146932 and AC087424). In addition, an MPRE61 site coincided with a group of 3 overlapping BACs, including AC138784, AC138788 and AC138786. Overlapping BAC pair 2 had two single base differences, and pair 3 had 3 single base differences and one 43 bp insertion/deletion. No sequence differences were observed within MPRE61 for pair 1 or the group of 3 overlapping BACs.
To put the apparent allelic diversity rates into context, we examined the genetic sources of the DNA used to construct the BAC library (RPCI-44). The source of DNA for RPCI-44 was a pooled sample with equal contributions from 4 male crossbred pigs each comprised of 3/8 Landrace, 3/8 Yorkshire and ¼ Meishan [28]. The probability of identifying SNP increases with the diversity of genomes sampled. For the cases of 2 overlapping BACs, the probability of sampling different genomes is 87.5%, different breeds is 65.7%, and one BAC of western (Landrace or Yorkshire) origin and the other of Meishan origin is 37.5%. The probability of sampling diverse genomes is higher for the case of 3 overlapping BACs. The probability of sampling more than one genome is 98.4%, more than one breed is 87.9%, and at least one BAC of western origin combined with one BAC of Meishan origin is 56.25%. The fact that we didn't observe SNP in one of the three pairs of overlapping BACs is not that unusual given that the probability of sampling identical genomes with at least one of the 3 pairs of overlapping BACs is 33% (1-.8753). On the other hand, the fact that we did not observe SNP within the group of 3 overlapping BACs given the relatively high probabilities of diverse genomes being sampled is unexpected.
To bolster the relatively small number of distinct MPRE61 loci (7) identified in the CVSI BACs, we further investigated the prevalence and diversity of MPRE61 by cloning and sequencing PCR amplification products derived from 16 pigs sampled from 10 breeds (Berkshire, Chester White, Duroc, Hampshire, Landrace, Meishan, Pietrain, Poland China, Spot, and Yorkshire). We used primers designed to match the highly conserved parts of MPRE61 to amplify and clone (see Methods for details) multiple and variable loci for the RE that are differentiable by size as well as sequence. The different breeds showed indistinguishable smears on denaturing PAGE gels including many different sizes. Too many fragments and too many sizes were present to identify allelic differences in sizes among animals. The PCR products were sequenced to yield 91 reads that were not bacterial or vector contamination. The 91 sequences (listed as a fasta file in Additional file 4) were analyzed with Clustal X (creating a dendrogram file, Additional file 5) and displayed in Figure 4 as a phylogenetic tree. The topology of the tree (number of diverse nodes) is consistent with the estimated copy number of 300 sites in the whole genome given in Table 1. We speculate that the more similar sequences represented as tips close (with few sequence differences) to their common ancestor are probably allelic differences at the same locus. On the other hand, the more diverse tips and peripheral nodes probably represent different sites or loci. The amount of sequence diversity presented in Figure 4 supports the idea that individual integration sites (loci) and alleles of repetitive elements can be uniquely identified by high-throughput array based assays by hybridizing samples to short probes. This demonstrates that repetitive DNA with similar properties to MPRE61 (i.e., prevalence and diversity) can be harnessed for genetic and physical mapping [29]. This dispels the long standing myth that repetitive DNA should always be avoided because it is intractable. Our results indicate that some classes (low to intermediate copy number and highly diverse) of repetitive DNA would be tractable with high-throughput technologies.
MPRE61 size differences are not randomly distributed throughout the phylogenetic tree. Different sizes cluster on different branches of the tree; however, the clustering is not complete. This indicates that insertions and deletions (evolutionary events that cause size differences) occurred throughout the evolution of MPRE61, and in some cases while the element was still replicating. The incomplete clustering of sizes indicates evolutionary plasticity and as a result recurrent insertions and deletions.
MPRE61 was further characterized by plotting BLAST hits of it to the 275,595 sequences in the trace archive of BAC-ends submitted by Sanger. These were plotted along with the repeat elements recognized by RepeatMasker. The most interesting observations included the fact that three times among the 140 hits a PRE1 was incorporated into MPRE61. PRE1 is a porcine specific SINE that is included in the RepeatMasker library. Several other examples existed of PRE1 next to a section of MPRE61, but the trace end occurred next to the PRE1, so that it may or may not have had the continuing section of MPRE61 on its other side. No other REs were found to be incorporated into MPRE61, suggesting that MPRE61 replicated relatively recently. Another interesting observation was that the density of REs on the 3' side of MPRE61 was much higher than on the 5' side. To take a closer look at this, we collected the trace sequence 3' of the 62 hits that ended near (within 60 bp of) the 3' end of MPRE61 (length of 1059 bp). This flanking sequence, ranging in length from 12 to 1368 nucleotides, was analyzed for repeat content and distance of that content from the end of MPRE61 (Figure 5). Running RepeatMasker on the entire collection (275,595 sequences) of Sanger BAC-ends shows that the number of SINE elements is 75% greater than the number of LINE elements (203,206 vs. 116,107). The LINEs tend to be longer than the SINEs, so the total percentage of sequence occupied by the LINEs is actually larger (13.29% vs. 10.29%). The most obvious feature of Figure 5 is that LINEs are significantly over represented on the 3 prime side of MPRE61, particularly in the region closest to the end of MPRE61. For the 22 LINEs that occur within 80 bp of the end of MPRE61, 15 are oriented on the opposite strand and 7 on the same strand. At this point, there is no way to know which strand of MPRE61 might be transcribed. We arbitrarily chose one of the strands and used it consistently. Because the LINEs have a particular internal structure, the 5' and 3' ends are well defined. So another way of looking at the result would be to say that the LINEs occur on the 5' end of MPRE61 (or rather, its reverse complement) with 15 on the same strand and 7 on the opposite strand. Either way, there is less strand conservation than would be expected if MPRE61 used LINEs as a vehicle for either replication or integration.