Low-pass shotgun sequencing of the barley genome facilitates rapid identification of genes, conserved non-coding sequences and novel repeats
© Wicker et al; licensee BioMed Central Ltd. 2008
Received: 22 May 2008
Accepted: 31 October 2008
Published: 31 October 2008
Barley has one of the largest and most complex genomes of all economically important food crops. The rise of new short read sequencing technologies such as Illumina/Solexa permits such large genomes to be effectively sampled at relatively low cost. Based on the corresponding sequence reads a Mathematically Defined Repeat (MDR) index can be generated to map repetitive regions in genomic sequences.
We have generated 574 Mbp of Illumina/Solexa sequences from barley total genomic DNA, representing about 10% of a genome equivalent. From these sequences we generated an MDR index which was then used to identify and mark repetitive regions in the barley genome. Comparison of the MDR plots with expert repeat annotation drawing on the information already available for known repetitive elements revealed a significant correspondence between the two methods. MDR-based annotation allowed for the identification of dozens of novel repeat sequences, though, which were not recognised by hand-annotation. The MDR data was also used to identify gene-containing regions by masking of repetitive sequences in eight de-novo sequenced bacterial artificial chromosome (BAC) clones. For half of the identified candidate gene islands indeed gene sequences could be identified. MDR data were only of limited use, when mapped on genomic sequences from the closely related species Triticum monococcum as only a fraction of the repetitive sequences was recognised.
An MDR index for barley, which was obtained by whole-genome Illumina/Solexa sequencing, proved as efficient in repeat identification as manual expert annotation. Circumventing the labour-intensive step of producing a specific repeat library for expert annotation, an MDR index provides an elegant and efficient resource for the identification of repetitive and low-copy (i.e. potentially gene-containing sequences) regions in uncharacterised genomic sequences. The restriction that a particular MDR index can not be used across species is outweighed by the low costs of Illumina/Solexa sequencing which makes any chosen genome accessible for whole-genome sequence sampling.
Within the family of Poaceae, the tribe of the Triticeae contains some of the world's most important food crops such as wheat and barley. Despite their economic importance, efforts to produce large amounts of genomic DNA sequences have been dragging along slowly due to the enormous complexity and the high repeat content of the Triticeae genomes. Barley, for example, has a genome size of 5,500 Mbp , almost twice the size of the human genome, which is composed of over 80% of repetitive DNA . Transposable elements (TEs) make the major part of the repetitive fraction, some TE families can reach thousands of copies. The most abundant TE family, BARE, is e.g. present in 80'000 to 120'000 copies, contributing about 9.6% to the total genomic sequence [3–5].
Such vast amounts of repetitive DNA hinder the efficient large scale sequencing and also make sequence analysis much more complicated and time-consuming. Identification and characterisation of transposable elements have, therefore, been an important aspects of Triticeae genomics from the beginning, and the public database TREP was dedicated to host and document annotated repeat sequences from Triticeae .
Triticeae researchers are facing a double challenge in genome sequencing. First, an adequate amount of genomic sequences needs to be produced to reasonably cover these huge genomes and second, at the same time methods to efficiently analyse the accumulated large amount of data need to be developed. While new second generation sequencing technologies such as Illumina/Solexa 1G, Roche/454 GS FLX or ABI SOLiD provide the opportunity to produce genomic sequences at ever-increasing speed and decreasing costs , this step change in technology entails the need for efficient tools and strategies to rapidly narrow down and to accurately describe DNA regions of interest in an ocean of unorganised sequence data.
Mathematically defined repeats (MDRs) are determined by measuring the frequency of k-mer (e.g. a 20-nucleotide oligomer) matches within a sequence dataset. MDRs offer the advantage of requiring no prior knowledge of structurally-defined repetitive elements to identify repeats within a genome. They have been previously used to aid genome assembly by masking repetitive sequences [8, 9], to identify TEs  and reconstruct ancestral TE families . However the full capabilities of MDR analysis can be difficult to achieve due to the high memory requirements of previously implemented data structures (e.g). The accompanying article  introduced a highly efficient tool, Tallymer, to analyze k-mers within the massive sequence datasets available from maize. Tallymer uses an enhanced suffix array data structure that offers reduced hard disk space requirements and processing times compared to suffix trees, permitting quick processing of very large sequence sets.
Repeat masking is an efficient way to narrow down the amount of sequences that may contain genes. This process basically searches the sequence of interest for and flags all regions that show homology to known TEs and other repetitive sequences. The bases in the repeat regions are then replaced either by "X" or "N" (hard masking) or lowercase letters (soft masking). Examples for publicly available repeat masking software are RepeatMasker  or CENSOR . Obviously, the repeat masking process is only as complete as the repeat database it is based upon. Low-copy repeats and variable regions within TE families are usually not recognised.
In this study, we present a practical application of short-read sequencing to the large and complex genome of barley. We raised in a single run of Illumina/Solexa 1G sequencing almost 10% of the sequence information present in the haploid barley genome equivalent. The raw sequences were used to generate an MDR index and mapped to publicly available genomic sequences in order to identify repetitive and low-copy sequences. The so-called MDR plots allowed an efficient exclusion of repetitive sequences and easy identification of gene-containing space. The MDR plots were compared with results of careful expert re-annotations of the same sequences, showing a large overlap but also the complementary value of both approaches. It demonstrated that expert annotation of Triticeae repetitive DNA has already reached a high level of accuracy based on a very limited amount of available Triticeae genomic sequences but the MDR data still led to the identification of several, novel TE sequences.
DNA sequencing – Sanger BAC DNA sequencing
Eight barley bacterial artificial chromosome (BAC) clones originating from the library of cultivar Morex  were sent to commercial Sanger shotgun sequencing (AGOWA, Berlin, Germany). Each clone was sequenced at two- to three-fold coverage considering an average BAC insert size of 110 kb. Sequence data was further processed by the program PHRED  and assembled using the program GAP4 (Roger Staden, MRC Cambridge). Contig consensus sequences were exported into FASTA-flatfiles sorted according to contig length without considering their true physical order in the respective BACs.
DNA sequencing – high-throughput short read shotgun sequencing
Genomic DNA was extracted from barley cultivar Morex as previously described . The DNA was adjusted to a concentration of 1 μg/μl and sent for commercial short read sequencing on the Illumina Genome Analyzer platform (GATC Biotech, Konstanz, Germany). Genomic DNA was fragmented and a size fraction of around 300 bp was selected for sequence sample preparation according to manufacturer's instructions. 15,950,203 individual reads each of 36 nt in length were exported out of the raw data file obtained from 6 lanes of a flowcell in a single sequencing run by applying the default system settings for quality check. This amounted to 574 Mb whole genome shotgun sequence information. No additional measures of quality clipping were applied before utilising the dataset for MDR index construction.
Generation of an MDR index and analysis of BAC DNA sequence
All 15,950,203 individual reads were used for construction of the barley MDR index as an enhanced suffix array [12, 18, 19] of 20-mers, a data structure that allows for fast and efficient retrieval of copy numbers in the index relative to oligomer queries. Algorithms and scripts were developed for use on a Dual 3.8 Ghz Intel 64 bit linux machine with 8GB of memory. In this study, the analysed BAC sequences were virtually digested into overlapping 20-mers and each 20-mer was assessed for its frequency in the index. The resulting data was transformed to reflect the average copy number at each nucleotide position across the BAC sequence. When plotted logarithmically on a genomic scale, these discrete statistics aggregate into a repeat landscape wherein regions of high TE content are easily distinguished from low-copy genic and regulatory content. Indices of this type then become a persistent resource against which query sequences of the same species can be assessed for their repeat content. No special provisions were made to address homopolymers and simple sequence repeats SSRs. The idea was to use available sequence resources directly without any manual or computational annotations/modifications. Thus, homopolymers and SSRs cause spikes of apparently extremely high copy numbers in the MDR plots. These signals are very distinct and can easily be distinguished from high-copy signals that originate from repeated sequences with higher complexity. All software for users to generate and query their own MDR indices is available upon request.
Expert annotation of publicly available sequences
Previously published barley sequences used for re-annotation.
TE content (bp)
Previously published sequences from diploid wheat T. monococcum used for re-annotation and comparison with their MDR profiles
TE content (bp)
Automated repeat masking and processing of MDR data
For repeat masking, the BAC sequences were used in BLASTN searches against the complete TREP database (release 9) which contains multiple individual copies of most TE families and thus covers variable regions within TEs better than a database which contains merely consensus sequences. Additionally, a BLASTX search was done against PTREP in order to identify coding regions of divergent TE families. Regions which produced significant BLAST (E-value < 10e-10) hits were replaced with stretches of "X". This repeat masking was performed with a custom Perl program. The results of this procedure are virtually identical with those of RepeatMasker . We used the described approach rather than RepeatMasker because of its better integration into our custom software package. All parts of TE sequences, protein coding and-non-coding, were masked.
For automatic extraction of novel repeats, regions extending over at least 80 bp with an MDR signal strength of at least 2 were identified. This corresponds to a copy number of about 20, as the Illumina/Solexa dataset covers approximately 10% of a genome equivalent. Since many TEs contain variable regions that would have lower MDR signals, we allowed for stretches of 20 bp for the signal to go below the threshold of 2 (i.e. highly variable or fast-evolving regions within a repeat may have a lower or no coverage at all in the MDR set). For the extraction of MDR data for specific TEs, the position information from the manual annotation was used to obtain the corresponding coverage data from the MDR datafile. In this way, we obtained coverage data for all annotated TEs.
For graphical display of MDR plots and expert annotations, and for processing the MDR and annotation data, custom programs were written in Perl. The source codes of all custom Perl programs that were used for this study are available upon request.
Sequence accession numbers
The barley genomic Illumina/Solexa reads are available at NCBI's Short Read Archive (accession number SRA001155) or as download at http://pgrc.ipk-gatersleben.de/solexa/. The unfinished sequences of the eight barley BAC clones are deposited at NCBI Genbank under the accession numbers EU914123 – EU914130.
An index of mathematically defined repeats (MDR) for the barley genome
Chloroplast and mitochondrial DNA were present in moderate amounts in the DNA sample. An MDR plot of the barley chloroplast genome (GenBank accesion EF115541) showed the chloroplast to be evenly covered approximately 60-fold (see Additional file 2). Given a size of 136.4 kb for the cpDNA molecule of barley the genomic DNA sample utilised for WGS sequencing contained a 1.3% fraction of cpDNA. Mitochondrial DNA was less abundant: judging from a short mitochondrial insertion into nuclear DNA (see below) the mitochondrial genome is covered only about 3-fold.
Comparison of MDR plots with manual annotation
We selected 9 publicly available large genomic sequences from barley with a cumulative length of over 2 Mbp and re-annotated them in detail in order to obtain a standard set of homogeneously annotated sequences. Special attention was given to the analysis of TEs, in order to cover the repetitive fraction as thoroughly as possible (see methods). The 9 sequences range in size from 114 kb to 519 kb (Table 1) and contain 2 to 14 genes, respectively; gene fragments and pseudogenes were included in this count. In some cases, the published annotations contained TE sequences that were mistakenly annotated as genes. Thus, the number of genes we annotated may differ from the ones given in the respective publications. The TE content of the nine barley sequences varied strongly from 37.9% to 87.6% (Table 1). We will hereafter refer to these sequences as the "standard sequence set".
The expert annotation of each sequence was compared with its respective MDR plot, graphically and quantitatively. Three examples of graphical comparisons in barley are shown in Figure 2a through 2c. The complete dataset for the standard sequences is available as supplementary material (see Additional files 3 and 4). The graphical comparison shows that, over the majority of the data, the MDR plots confirm the results of expert annotations, as almost all regions that were annotated as TEs also showed clear signals in the MDR plot. The borders of individual TEs are easily recognisable as sharp changes in the intensity of MDR signals (Figure 2a–c). Some regions show MDR signals where no TEs were annotated: these represent novel repeated sequences (see below).
In a quantitative analysis, we calculated the percentage of all standard sequences that were classified as repetitive by both expert annotation and MDR analysis. As a threshold, we considered as repetitive all regions that showed an MDR coverage higher than 2 as being repetitive. The rationale for that was that our Illumina/Solexa dataset provided an approximately 10% coverage a genome equivalent. Thus, one out of 10 single copy sequences is expected to be covered. Sequences with 10 copies are, on average, covered one-fold and those with 100 copies 10-fold. The coverage with Solexa reads is mathematically equivalent to the MDR coverage. For this study, we considered an MDR signal of 2 as repetitive (corresponding to approximately 20 copies in the genome) and sequences that were covered less than 2-fold by MDR were considered low-copy.
Comparison of the fractions that were identified as repetitive by manual annotation and through MDR analysis.
Interestingly, the results of expert annotation and MDR analysis do not completely overlap as all sequences contain a considerable fraction of 1.3% – 7.1% (average 3.9%) which was not recognised as repetitive in the expert annotation but shows clearly visible MDR signals. Most of these regions have moderate to low copy numbers and, thus, most likely contain novel types of TEs (see below).
Barley MDR signal intensity decreases strongly if applied to analysis in other Triticeae species
To determine the usefulness of barley genomic shotgun sequence information in predicting repetitive DNA in other Triticeae species, we performed an analogous MDR analysis by appying the same Illumina/Solexa dataset on a number of publicly available sequences from a diploid wheat species (Triticum monococcum L., einkorn). These were expert annotated in detail to the same quality as the set of standard sequences from barley described above. Wheat and barley diverged from each other approximately 11 MYA . The comparison of MDR plots with the expert annotation showed that both the density and the intensity of MDR signals on T. monococcum is much decreased compared to those obtained for the barley sequences. As shown in Figure 2d and 2e, MDR signals are distributed somewhat sporadically across the repetitive portions of the two T. monococcum sequences. High-copy repeats such as Angela and WIS (the wheat homologs of BARE elements) still produce relatively strong and consistent signals, however, roughly an order of magnitude weaker than in barley. Less abundant and diverse repeats such as CACTA transposons, LINE s or low-copy Gypsy elements produce signals that are barely (if at all) above the threshold for being recognised as repeats (Figure 2d and 2e). The complete set of MDR plots for wheat sequences is available as supplementary material (see Additional file 5).
Identifying the putative gene space by combining automated repeat masking with MDR analysis
Comparison of fractions of de novo sequenced BACs that were identified as repetitive by repeat masking and MDR analysis.
As immediate candidates for gene-containing regions, we considered all regions that were at least 1 kb in size, were not repeat masked and showed no or very low MDR signals (Figure 3). Small sharp peaks of strong MDR signals in candidate gene islands were ignored as they likely represented miniature inverted-repeat transposable elements (MITEs) or microsatellites, frequently found inside or nearby genes in Triticeae . In total, 34 such candidate regions were identified in the eight BACs. Seven BACs contained between two and five, while 89E23 contained 12 – consistent with its low repeat content (Figure 3).
In 17 of the candidate gene islands, putative coding sequences of genes were found. In two cases (BAC 567E19, island 2 and 4 and BAC 297P14 island 3 and 4), a single gene was split and found on two different sequence contigs which was due to the low-pass sequence status of the BAC clones. One candidate gene island contained no protein coding sequences but a conserved non-coding sequence (CNS, BAC 89E23, island 1, Figure 3). We characterised it as CNS because it contains no homology to protein coding sequences, and no open reading frame could be detected by means of gene prediction algorithms and yet it is highly conserved at the DNA level to rice genomic sequences. The CNS has a size of 207 bp and is 92% identical between rice and barley. In the rice genome, we found nine copies of that sequence. Interestingly, the CNS appears to be expressed as we found a barley EST which is 97% identical to it.
The large overlap in repetitive sequences identified by MDR or "traditional" repeat masking implies that in case of a lack of a comprehensive reference repeat database, as it is available for barley, a simple MDR analysis of a new non-annotated sequence would be sufficient to identify all potential genic regions and no additional repeat masking would be necessary.
MDR plots of individual TEs reveal variable regions and over- and under-represented motifs
Similarly, Caspar (and other CACTA) transposons contain regions with much higher copy numbers than the overall average for the whole element (Figure 4b), because of the presence of large arrays of tandem repeats. Previous studies showed that CACTA transposons can contain arrays with dozens of such repeat units , some of which were initially described as Afa repeats [30, 31]. Thus, the number of individual repeat units can by far exceed the actual copy number of the respective transposon. Additionally, Caspar transposons usually contain at least one region of several hundred bp which consists almost exclusively of low-complexity G/A-rich motifs. This region can also be easily identified in the MDR plot in Figure 3a.
MDR plots of individual TEs can also be used to identify variable regions. For example, the LTR of BARE1 showed a low coverage between positions 1000–1400. This indicates a variable region in the otherwise highly conserved LTR sequence. Indeed, comparison of 19 BARE1 LTR deposited at TREP showed much higher sequence variability in that region (data not shown). The Caspar transposon too has a region with lower sequence conservation between the two coding regions. This is also confirmed by comparison of different copies of Caspar elements (data not shown).
Identification of novel repeats based on MDR plots
Visual inspection of the graphical output of the standard set and the newly sequenced BACs yielded several regions that showed a clear MDR signal but for which no TE sequences were annotated. The first was located close to the right end of sequence AF521177 and contained the only Helitron found so far in barley (Figure 2b). This element was originally identified as an insertion in orthologous loci between two barley varieties Morex and Cebada Capa . However, it was previously unclear if (i) the entire 22 kb insertion indeed consisted of a single element or whether it (ii) contained nested insertions of other, yet unknown, TEs or (iii) fragments of non-TE genes, which Helitrons often carry . In the present analysis, the MDR plot showed for the entire element a relatively even signal corresponding to a 2–3-fold coverage with Illumina/Solexa reads (linear scale), indicating that it is present in this form in about 20–30 copies in the barley genome.
Similarly, a region of more than 20 kb close to the right end of the Cebada Capa sequence AY642926 (approx positions 150 kb – 170 kb) showed an even MDR signal, indicating the presence of a very large TE with moderate copy number. Indeed, comparison of that region with its ortholog in Morex led to the identification of an exotic TE with a size of more than 28 kb. It contains a transposase that resembles slightly those of Mutator elements but it does not contain any of the typical structural features such as terminal inverted repeats or target site duplication. Thus, its classification remains uncertain yet.
A third interesting repeat was found in an intron of a gene of the de novo survey sequenced BAC 798F12 (gene island 1, Figure 3). Specific analysis of the region revealed it to be a short insertion of a fragment of mitochondrial DNA. Insertions of fragments of organellar DNA into the nuclear genome of plants are a frequently observed phenomenon . The inserted fragment is part of the mitochondrial MatR gene which contains a reverse transcriptase domain and is, according to gene ontology, involved in RNA splicing and RNA-dependent DNA replication.
In order to systematically extract all novel repeat sequences from the standard set, a Perl program was written to recognize regions with strong MDR signals which were not identified by expert annotation. We were able to identify by this approach 62 novel repeats which ranged in size from 82 to 1999 bp (see Additional file 6). Three of them simply represented microsatellite sequences and were excluded. The average MDR coverage of the other 59 repeats (i.e. the average strength of the MDR signal across the whole element) ranged from 1.7 up to 142, indicating that some are low-copy and other are rather high-copy repeats. The low coverage of less than 2 for some repeats is due to the fact that, although we required a minimum MDR coverage of 2 for a sequence to be considered repetitive, we still allowed stretches with lower coverage within the repeat (see methods). All 59 novel repeats were used in BLASTN searches against a publicly available set of 315 genomic Triticeae sequences larger than 20 kb. Most of the repeats were present in multiple copies, thus high MDR coverage mostly correlated with a high copy number in the public dataset.
The focus of this study was to evaluate the potential of short read sequencing (SRS) in combination with mathematically defined repeat (MDR) analysis for surveying and draft annotating repetitive DNA in the large and complex barley genome. Differing from previous studies which have explored the potential of genomic 454 sequencing for stretches of plant genomic DNA packed with repetitive elements of different kind  and in purpose of characterising the global repetitive DNA landscape of legume genomes [36, 37], we studied the usefulness of Illumina/Solexa sequencing which produces a higher number of bases per run but much shorter reads than 454. Barley and other Triticeae species can serve as an appropriate training system for such an evaluation because comprehensive knowledge of the TE composition of their genomes is available and has cumulated to formation of a high-quality Triticeae repeat database . Thus, the pre-existing information of accurately annotated full length repeat elements allowed for the detailed and comprehensive comparison with the generated MDR data. The observed strong regional overlap between the statistical survey for MDR distribution with repeats identified by expert annotation, imply that even for organisms without significant genomic sequence information available, a limited investment into SRS genomic shotgun sequencing allows to accumulate a large amount of valuable biological information which can be efficiently exploited for comprehensive repeat and putative gene space identification.
MDR plots also allow to specifically focus on repeat sequences that are likely of interest to a particular research project. For example, it could be important to know whether a gene family contains TEs in promoter, intron or upstream/downstream regulatory sequences as they could affect expression of the gene. MDR analysis would thus allow to quickly identify most repetitive elements in and around genes of interest.
The comparison between MDR analysis and expert annotation in barley clearly demonstrated that manual annotation, if carefully done, can identify almost all TEs present in a sequence. However, this requires a critical amount of available training data for the organism of interest. Our experiences with the highly repetitive genomes of wheat and barley show that several hundred kb of large genomic sequences are required to identify most major high-copy TE families. Indeed, the main repeat types of wheat and barley were all discovered early on in only a few large genomic sequences [28, 38–40]. Thus, as an important outcome of the present study we can conclude that the accumulated knowledge of barley repetitive elements has reached already a very comprehensive level in the available repeat databases. The generated MDR data helped demonstrating that expert annotation of TEs can reach a high level of completeness, as the repeat identification by MDR analysis revealed in barley only relatively few new repeat types. This information is particularly valuable for sequences like AF521177 (Figure 2b) which has a very low repeat content. Without having access to statistical repeat prediction by MDR analysis we previously assumed that the sequence contained a substantial proportion of yet unknown TE sequences. The comparison to MDR plots now provided confirmation that AF521177 in fact bears extensive regions which neither contain genes nor TEs. The obvious bottleneck of manual annotation is that it is labour-intensive and time-consuming and requires a lot of expert knowledge.
Automated repeat masking and MDR analysis both provide rapid and reliable ways to determining the low-copy fraction and putative gene space in highly repetitive BAC clones
We assumed that for most projects, researchers will be interested mainly in the gene content of a particular BAC instead of fine annotation of the full repeat composition and organization. Thus, an initial search for gene containing regions will be the first step of most analyses, before efforts would be undertaken towards finishing the BAC sequence and its annotation. Our approach of combining simple repeat masking with MDR data proved to be an efficient way of identifying gene containing sequences. Exactly half of the candidate gene islands that were identified this way indeed contained (non-TE) coding sequences. If unmasked genomic sequences were used for gene identification, for example, in a BLAST search against rice proteins, interpretation of outputs tend to be labour-intensive as most of the coding sequences that produce BLAST hits will belong to TEs and the few actual genes might be missed.
Our data showed there is a great overlap in repetitive regions identified by MDR and "traditional" repeat masking. Thus, for the reliable identification of potential gene islands both methods perform similarly well. The advantage of MDR analysis is that an MDR index can be constructed from a Illumina/Solexa dataset in a relatively short time at limited investment without going through the labour-intensive process of constructing a repeat library for the species of interest.
An important result of the repeat masking vs. MDR analysis experiment was the identification of a large conserved non-coding sequence (CNS) that occupied one of the candidate gene islands. CNS were described previously in mammals as gene-regulatory sequences  or as remnants of sequence identity due to lack of divergence time . In plants, CNS function is still unclear [43, 44] and the average motif length was usually found to be much smaller . The approach described here provides the means to identify such sequences at a large scale and in large numbers.
Although sample size of genomic sequences considered for this study is small, our results suggest that the barley genome contains relatively low amounts of low-copy DNA which does not encode proteins. Nevertheless, one still has to expect many regions with no obvious coding capacity as half of the candidate gene islands on the de novo sequenced BACs harboured no genes. Another example is the sequence AF521177 which contains large regions where neither genes nor TEs were found. Such low-copy islands may still contain TE sequences that have fewer than 20 copies in the genome and are thus not detected at the 0.1 X coverage provided by the Illumina/Solexa data of our study. Lower abundance repeats could probably be identified by increasing the genome coverage with Illumina/Solexa reads. However, at such low-copy numbers, the borderline between genic sequences (e.g. multi-copy gene families) and TE and other selfish sequences may be blurred.
Some of the low-copy sequences might also represent highly degenerate TE sequences which have lost most sequences homology due to a long time of degeneration. We expect that fraction to be minimal, because intergenic sequences in Triticeae are rapidly turned over through permanent creation of DNA by TE amplification and DNA loss through illegitimate recombination and unequal crossing-over [3, 28, 38, 45]. Thus, hardly any TEs older than 3–4 Million years are found [38, 45, 46].
Characterisation of novel repeats
The availability of a highly curated set of barley sequences allowed the automated isolation of novel repeat sequences by simply using strong MDR signals outside of annotated TEs as indicator of repetitiveness. This led to the rapid identification of 60 novel TE sequences. In addition to the typical and highly abundant plant TEs (e.g. LTR retrotransposons), the barley (or any plant) genome contains a multitude of TE-derived sequences that do not contain typical and easily identifiable structural features or coding regions [25, 29]. Many of them have only a moderate copy number per genome and are therefore not detectable if only a limited dataset, i.e. publicly available large genomic Triticeae sequences, is being used, or if closely related orthologous regions are analyzed and compared [47, 48]. MDR data derived from whole genome SRS allows for the identification of novel elements merely on the basis of copy number. Since the described approach is a quantitative assessment and no qualitative interpretation, the identified novel repeats have to be analysed further and characterised in more detail in order to sustain the discovered TE information for future sequence annotations.
Already simple visual inspection of MDR plots and comparison to expert annotation or repeat masking results allows identifying possible novel repeat regions. Such an approach led to the characterisation of a novel large 28 kb TE, which was identified solely due to the information from MDR analysis. As this element was not found in any other publicly available sequence, MDR data was the only indication for it being indeed a multi-copy sequence. Additionally, a small sharp spike in MDR signal within an intron led to the rapid identification of a small mitochondrial insertion. Overall, MDR analysis in barley provided relatively few novel repeat sequences, but this is mainly attributable to the already well documented repeat landscape of the Triticeae. Thus, if the same approach would be undertaken in a species with a less well characterized repetitive fraction, a single Illumina/Solexa run covering 10% of the genome could help identifying most of the main repeat types when matched against a large genomic (e.g. BAC) sequence. Probably, the novel repetitive elements identified in this study could have been discovered as well through exhaustive TE annotation and comparison with other publicly available sequences. However, the obvious advantage of MDR analysis over manual annotation is its much lower time requirement.
Low-pass short-read-genomic shotgun sequencing of approximately 10% of a haploid genome equivalent allowed rapid and easy identification of the low-copy fraction (i.e. putative gene space) and repetitive DNA regions in barley. Converted into an index of mathematically defined repeats (MDR) the low-pass genomic shotgun sequence dataset proved to be as efficient as traditional repeat masking based on BLAST searches. In addition, it proved very efficient in the identification of novel repeat types thus further refining our knowledge on the organization of the barley genome. The obvious advantage of MDR over the traditional methods of repeat masking and expert annotation is its lower time and pre-existing expert knowledge requirement, as no repeat databases have to be constructed and de novo repeat annotation (which is inherently time-consuming and labour-intensive) becomes fast and easy. An MDR index produced from SRS reads thus provides an elegant tool for the identification of repetitive and low-copy regions in genomic sequences. An experiment as described in this study takes about one to two weeks and can generate the necessary dataset for comprehensive repeat prediction by MDR for any chosen genome. As sequencing becomes cheaper and more and more efficient, similar low-effort datasets for genome analysis and comparative genomics could be obtained for a multitude of species for which no or only few sequence data is or will become available in short time.
This work has been supported by grants to DW (National Science Foundation Division of Biological Infrastructure #0321467 and #0333074; U.S. Department of Agriculture-Agricultural Research Service) and to NS (Bundesministerium für Bildung und Forschung, BMBF, GABI-0313098A).
- Bennett MD, Smith JB: Nuclear DNA amounts in angiosperms. Philos Trans R Soc Lond B Biol Sci. 1976, 274: 227-274. 10.1098/rstb.1976.0044.View ArticleGoogle Scholar
- Smith DB, Flavell RB: Characterisation of the wheat genome by renaturation kinetics. Chromosoma. 1975, 50: 223-242. 10.1007/BF00283468.View ArticleGoogle Scholar
- Vicient CM, Suoniemi A, Anamthawat-Jonsson K, Tanskanen J, Beharav A, Nevo E, Schulman AH: Retrotransposon BARE-1 and its role in genome evolution in the genus Hordeum. Plant Cell. 1999, 11 (9): 1769-1784. 10.1105/tpc.11.9.1769.PubMed CentralView ArticleGoogle Scholar
- Kalendar R, Tanskanen J, Immonen S, Nevo E, Schulman AH: Genome evolution of wild barley (Hordeum spontaneum) by BARE-1 retrotransposon dynamics in response to sharp microclimatic divergence. Proc Natl Acad Sci USA. 2000, 97 (12): 6603-6607. 10.1073/pnas.110587497.PubMed CentralView ArticleGoogle Scholar
- Soleimani VD, Baum BR, Johnson DA: Quantification of the retrotransposon BARE-1 reveals the dynamic nature of the barley genome. Genome. 2006, 49 (4): 389-396. 10.1139/G05-119.View ArticleGoogle Scholar
- Wicker T, Matthews D, Keller B: TREP, a database for Triticeae repetitive elements. Trends Plant Sci. 2002, 7: 561-562. 10.1016/S1360-1385(02)02372-5.View ArticleGoogle Scholar
- Mardis ER: The impact of next-generation sequencing technology on genetics. Trends Genet. 2008, 24 (3): 133-141.View ArticleGoogle Scholar
- Wang J, Wong GK, Ni P, Han Y, Huang X, Zhang J, Ye C, Zhang Y, Hu J, Zhang K: RePS: a sequence assembler that masks exact repeats identified from the shotgun data. Genome Res. 2002, 12: 824-831. 10.1101/gr.165102.PubMed CentralView ArticleGoogle Scholar
- Batzoglou S, Jaffe DB, Stanley K, Butler J, Gnerre S, Mauceli E, Berger B, Mesirov JP, Lander ES: ARACHNE: a whole-genome shotgun assembler. Genome Res. 2002, 12: 177-189. 10.1101/gr.208902.PubMed CentralView ArticleGoogle Scholar
- Yu J, Wang J, Lin W, Li S, Li H, Zhou J, Ni P, Dong W, Hu S, Zeng C: The genomes of Oryza sativa : a history of duplications. PLoS Biol. 2005, 3 (2): e38-10.1371/journal.pbio.0030038.PubMed CentralView ArticleGoogle Scholar
- Li R, Ye J, Li S, Wang J, Han Y, Ye C, Wang J, Yang H, Yu J, Wong GK-S: ReAS: Recovery of ancestral sequences for transposable elements from the unassembled reads of a whole genome shotgun. PLoS Comp Biol. 2005, 1 (4): e43-10.1371/journal.pcbi.0010043.View ArticleGoogle Scholar
- Kurtz S, Narechania A, Stein J, Ware D: A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes. BMC Genomics. 2008, 9: 517-PubMed CentralView ArticleGoogle Scholar
- RepeatMasker. [http://www.repeatmasker.org]
- Kohany O, Gentles AJ, Hankus L, Jurka J: Annotation, submission and screening of repetitive elements in Repbase: RepbaseSubmitter and Censor. BMC Bioinformatics. 2006, 7: 474-10.1186/1471-2105-7-474.PubMed CentralView ArticleGoogle Scholar
- Yu Y, Tomkins JP, Waugh R, Frisch DA, Kudrna D, Kleinhofs A, Brueggeman RS, Muehlbauer GJ, Wise RP, Wing RA: A bacterial artificial chromosome library for barley (Hordeum vulgare L.) and the identification of clones containing putative resistance genes. Theor Appl Genet. 2000, 101 (7): 1093-1099. 10.1007/s001220051584.View ArticleGoogle Scholar
- Ewing B, Hillier L, Wendl MC, Green P: Base-calling of automated sequencer traces using Phred. I. Accuracy assessment. Genome Res. 1998, 8 (3): 175-185.View ArticleGoogle Scholar
- Graner A, Jahoor A, Schondelmaier J, Siedler H, Pillen K, Fischbeck G, Wenzel G, Herrmann RG: Construction of an RFLP map of barley. Theor Appl Genet. 1991, 83 (2): 250-256. 10.1007/BF00226259.View ArticleGoogle Scholar
- Abouelhoda MI, Kurtz S, Ohlebusch E: Replacing suffix trees with enhanced suffix arrays. Journal of Discrete Algorithms. 2004, 2 (1): 53-86. 10.1016/S1570-8667(03)00065-0.View ArticleGoogle Scholar
- The Vmatch large scale sequence analysis software. [http://www.vmatch.de]
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25 (17): 3389-3402. 10.1093/nar/25.17.3389.PubMed CentralView ArticleGoogle Scholar
- The Triticeae Repeat Database. [http://wheat.pw.usda.gov/ITMI/Repeats]
- Sonnhammer E, Durbin R: A dot-matrix program with dynamic threshold control suited for genomic DNA and protein sequence analysis. Gene. 1995, 167 (1–2): GC1-10. 10.1016/0378-1119(95)00714-8.Google Scholar
- Rutherford K, Parkhill J, Crook J, Horsnell T, Rice P, Rajandream M-A, Barrell B: Artemis: sequence visualization and annotation. Bioinformatics. 2000, 16 (10): 944-945. 10.1093/bioinformatics/16.10.944.View ArticleGoogle Scholar
- The Institute of Genomics Research (TIGR). [http://www.tigr.org]
- Kalendar R, Vicient C, Peleg O, Anamthawat-Jonsson K, Bolshoy A, Schulman A: Large retrotransposon derivatives: abundant, conserved but non-autonomous retroelements of barley and related genomes. Genetics. 2004, 166 (3): 1437-1450. 10.1534/genetics.166.3.1437.PubMed CentralView ArticleGoogle Scholar
- Gaut BS: Evolutionary dynamics of grass genomes. New Phytol. 2002, 154 (1): 15-28. 10.1046/j.1469-8137.2002.00352.x.View ArticleGoogle Scholar
- Sabot F, Guyot R, Wicker T, Chantret N, Laubin B, Chalhoub B, Leroy P, Sourdille P, Bernard M: Updating of transposable element annotations from large wheat genomic sequences reveals diverse activities and gene associations. Mol Gen Genom. 2005, 274 (2): 119-130. 10.1007/s00438-005-0012-9.View ArticleGoogle Scholar
- Shirasu K, Schulman AH, Lahaye T, Schulze-Lefert P: A contigous 66-kb barley DNA sequence provides evidence for reversible genome expansion. Genome Res. 2000, 908-915. 10.1101/gr.10.7.908.Google Scholar
- Wicker T, Guyot R, Yahiaoui N, Keller B: CACTA transposons in Triticeae. A diverse family of high-copy repetitive elements. Plant Physiol. 2003, 132: 52-63. 10.1104/pp.102.015743.PubMed CentralView ArticleGoogle Scholar
- Rayburn AL, Gill BS: Isolation of a D-genome specific repeated DNA sequence from Aegilops squarrosa. Plant Mol Biol Report. 1986, 4 (2): 102-109. 10.1007/BF02732107.View ArticleGoogle Scholar
- Nagaki K, Tsujimoto H, Sasakuma T: Dynamics of tandem repetitive Afa-family sequences in Triticeae, wheat-related species. J Mol Evol. 1998, 47 (2): 183-189. 10.1007/PL00006375.View ArticleGoogle Scholar
- Scherrer B, Isidore E, Klein P, Kim JS, Bellec A, Chalhoub B, Keller B, Feuillet C: Large intraspecific haplotype variability at the Rph7 locus results from rapid and recent divergence in the barley genome. Plant Cell. 2005, 17 (2): 361-374. 10.1105/tpc.104.028225.PubMed CentralView ArticleGoogle Scholar
- Morgante M, Brunner S, Pea G, Fengler K, Zuccolo A, Rafalski A: Gene duplication and exon shuffling by helitron-like transposons generate intraspecies diversity in maize. Nat Genet. 2005, 37 (9): 997-1002. 10.1038/ng1615.View ArticleGoogle Scholar
- Leister D: Origin, evolution and genetic effects of nuclear insertions of organelle DNA. Trends Genet. 2005, 21 (12): 655-663. 10.1016/j.tig.2005.09.004.View ArticleGoogle Scholar
- Wicker T, Schlagenhauf E, Graner A, Close TJ, Keller B, Stein N: 454 sequencing put to the test using the complex genome of barley. BMC Genomics. 2006, 7: 275-10.1186/1471-2164-7-275.PubMed CentralView ArticleGoogle Scholar
- Swaminathan K, Varala K, Hudson ME: Global repeat discovery and estimation of genomic copy number in a large, complex genome using a high-throughput 454 sequence survey. BMC Genomics. 2007, 8: 132-10.1186/1471-2164-8-132.PubMed CentralView ArticleGoogle Scholar
- Macas J, Neumann P, Navrátilová A: Repetitive DNA in the pea (Pisum sativum L.) genome: comprehensive characterization using 454 sequencing and comparison to soybean and Medicago truncatula. BMC Genomics. 2007, 8: 427-10.1186/1471-2164-8-427.PubMed CentralView ArticleGoogle Scholar
- SanMiguel PJ, Ramakrishna W, Bennetzen JL, Busso C, Dubcovsky J: Transposable elements, genes and recombination in a 215-kb contig from wheat chromosome 5Am. Funct Integr Genomics. 2002, 2: 70-80. 10.1007/s10142-002-0056-4.View ArticleGoogle Scholar
- Wicker T, Stein N, Albar L, Feuillet C, Schlagenhauf E, Keller B: Analysis of a contiguous 211 kb sequence in diploid wheat (Triticum monococcum L.) reveals multiple mechanisms of genome evolution. Plant J. 2001, 26 (3): 307-316. 10.1046/j.1365-313X.2001.01028.x.View ArticleGoogle Scholar
- Panstruga R, Büschges R, Piffanelli P, Schulze-Lefert P: A contigous 60 kb genomic stretch from barley reveals molecular evidence for gene islands in a monocot genome. Nucleic Acids Res. 1998, 26 (4): 1056-1062. 10.1093/nar/26.4.1056.PubMed CentralView ArticleGoogle Scholar
- Loots GG, Locksley RM, Blankespoor CM, Wang ZE, Miller W, Rubin EM, Frazer KA: Identification of a coordinate regulator of Interleukins 4, 13, and 5 by cross-species sequence comparisons. Science. 2000, 288 (5463): 136-140. 10.1126/science.288.5463.136.View ArticleGoogle Scholar
- Dubchak I, Brudno M, Loots GG, Pachter L, Mayor C, Rubin EM, Frazer KA: Active conservation of noncoding sequences revealed by three-way species comparisons. Genome Res. 2000, 10 (9): 1304-1306. 10.1101/gr.142200.PubMed CentralView ArticleGoogle Scholar
- Kaplinsky NJ, Braun DM, Penterman J, Goff SA, Freeling M: Utility and distribution of conserved noncoding sequences in the grasses. Proc Natl Acad Sci USA. 2002, 99 (9): 6147-6151. 10.1073/pnas.052139599.PubMed CentralView ArticleGoogle Scholar
- Bossolini E, Wicker T, Knobel PA, Keller B: Comparison of orthologous loci from small grass genomes Brachypodium and rice: implications for wheat genomics and grass genome annotation. Plant J. 2007, 49 (4): 704-717. 10.1111/j.1365-313X.2006.02991.x.View ArticleGoogle Scholar
- Wicker T, Yahiaoui N, Guyot R, Schlagenhauf E, Liu ZD, Dubcovsky J, Keller B: Rapid genome divergence at orthologous low molecular weight glutenin loci of the A and A(m) genomes of wheat. Plant Cell. 2003, 15 (5): 1186-1197. 10.1105/tpc.011023.PubMed CentralView ArticleGoogle Scholar
- Wicker T, Keller B: Genome-wide comparative analysis of copia retrotransposons in Triticeae, rice, and Arabidopsis reveals conserved ancient evolutionary lineages and distinct dynamics of individual copia families. Genome Res. 2007, 17 (7): 1072-1081. 10.1101/gr.6214107.PubMed CentralView ArticleGoogle Scholar
- Fu HH, Dooner HK: Intraspecific violation of genetic colinearity and its implications in maize. Proc Natl Acad Sci. 2002, 99 (14): 9573-9578.PubMed CentralView ArticleGoogle Scholar
- Sabot F, Sourdille P, Chantret N, Bernard M: Morgane, a new LTR retrotransposon group, and its subfamilies in wheats. Genetica. 2006, 128 (1): 439-447. 10.1007/s10709-006-7725-5.View ArticleGoogle Scholar
- Rostoks N, Park Y, Ramakrishna W, Ma J, Druka A, Shiloff B, SanMiguel P, Jiang Z, Brueggeman R, Sandhu D: Genomic sequencing reveals gene content, genomic organization, and recombination relationships in barley. Funct Integr Genomics. 2002, 2 (1–2): 51-59. 10.1007/s10142-002-0055-5.View ArticleGoogle Scholar
- Brunner S, Keller B, Feuillet C: A large rearrangement involving genes and low-copy DNA interrupts the microcollinearity between rice and barley at the Rph7 locus. Genetics. 2003, 164 (2): 673-683.PubMed CentralGoogle Scholar
- Gu YQ, Anderson OD, Londeore CF, Kong XY, Chibbar RN, Lazo GR: Structural organization of the barley D-hordein locus in comparison with its orthologous regions of wheat genomes. Genome. 2003, 46 (6): 1084-1097. 10.1139/g03-071.View ArticleGoogle Scholar
- Yan L, Echenique V, Busso C, SanMiguel P, Ramakrishna W, Bennetzen J, Harrington S, Dubcovsky J: Cereal genes similar to Snf2 define a new subfamily that includes human and mouse genes. Mol Gen Genom. 2002, 268 (4): 488-499. 10.1007/s00438-002-0765-3.View ArticleGoogle Scholar
- Caldwell KS, Langridge P, Powell W: Comparative sequence analysis of the region harboring the hardness locus in barley and its colinear region in rice. Plant Physiol. 2004, 136 (2): 3177-3190. 10.1104/pp.104.044081.PubMed CentralView ArticleGoogle Scholar
- Wicker T, Zimmermann W, Perovic D, Paterson AH, Ganal M, Graner A, Stein N: A detailed look at 7 million years of genome evolution in a 439 kb contiguous sequence at the barley Hv-eIF4E locus: recombination, re-arrangements and repeats. Plant J. 2005, 41 (2): 184-194. 10.1111/j.1365-313X.2004.02285.x.View ArticleGoogle Scholar
- Komatsuda T, Pourkheirandish M, He C, Azhaguvel P, Kanamori H, Perovic D, Stein N, Graner A, Wicker T, Tagiri A: Six-rowed barley originated from a mutation in a homeodomain-leucine zipper I-class homeobox gene. Proc Natl Acad Sci USA. 2007, 104 (4): 1424-1429. 10.1073/pnas.0608580104.PubMed CentralView ArticleGoogle Scholar
- Chantret N, Cenci A, Sabot F, Anderson O, Dubcovsky J: Sequencing of the Triticum monococcum Hardness locus reveals good microcolinearity with rice. Mol Gen Genom. 2004, 271 (4): 377-386. 10.1007/s00438-004-0991-y.View ArticleGoogle Scholar
- Miller A, Galiba G, Dubcovsky J: A cluster of 11 CBF transcription factors is located at the frost tolerance locus Fr-A m 2 in Triticum monococcum. Mol Gen Genom. 2006, 275 (2): 193-203. 10.1007/s00438-005-0076-6.View ArticleGoogle Scholar
- Yan L, Loukoianov A, Tranquilli G, Helguera M, Fahima T, Dubcovsky J: Positional cloning of the wheat vernalization gene VRN1. Proc Natl Acad Sci USA. 2003, 100 (10): 6263-6268. 10.1073/pnas.0937399100.PubMed CentralView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.