HANDS: a tool for genome-wide discovery of subgenome- specific base-identity in polyploids
© Mithani et al.; licensee BioMed Central Ltd. 2013
Received: 21 May 2013
Accepted: 18 September 2013
Published: 24 September 2013
The analysis of polyploid genomes is problematic because homeologous subgenome sequences are closely related. This relatedness makes it difficult to assign individual sequences to the specific subgenome from which they are derived, and hinders the development of polyploid whole genome assemblies.
We here present a next-generation sequencing (NGS)-based approach for assignment of subgenome-specific base-identity at sites containing homeolog-specific polymorphisms (HSPs): ‘HSP base Assignment using NGS data through Diploid Similarity’ (HANDS). We show that HANDS correctly predicts subgenome-specific base-identity at >90% of assayed HSPs in the hexaploid bread wheat (Triticum aestivum) transcriptome, thus providing a substantial increase in accuracy versus previous methods for homeolog-specific base assignment.
We conclude that HANDS enables rapid and accurate genome-wide discovery of homeolog-specific base-identity, a capability having multiple applications in polyploid genomics.
KeywordsNext-generation sequencing RNA-seq Polyploidy Wheat Base-identity
Polyploidy has played a key role in the evolution of many plants , and has important consequences with respect to transcriptional regulation and the silencing of genes [2–6]. However, polyploidy presents a particular problem with respect to genome analysis. Polyploid genomes consist of two or more homeologous subgenomes, and sequence relatedness between these subgenomes makes determining the homeoallelic identity of genes and gene transcripts a particular challenge .
The recent advent of next-generation sequencing technology now makes it possible to survey multiple HSPs on a genome-wide basis , and to determine the homeologous subgenome-identity of individual bases at those HSP sites (Additional file 1: Figure S2). Thus, we have developed ‘HSP base Assignment using NGS data through Diploid Similarity’ (HANDS) as a method to characterize HSPs in polyploid genomes. In essence, HANDS involves comparative alignments of next-generation sequencing reads from polyploid and diploid progenitor-relatives onto a suitable reference sequence. Similarity of polyploid subgenomes with progenitor-relative diploids has recently been exploited to classify gene assemblies in bread wheat subgenomes  and sequencing reads in allopolyploid cotton , thus suggesting the viability of this approach for homeolog classification. In contrast, HANDS classifies individual HSP positions, and provides an added advantage in that it works even if only EST or mRNA datasets are available for polyploid genomes, thus allowing base characterization without the need of genomic assemblies.
Although HANDS was developed with bread wheat in mind, it is applicable to all allopolyploid genomes, and relies on sequence similarity between the polyploid subgenomes and the corresponding diploid progenitors. In the specific case of wheat HANDS uses the sequence similarity between the three A, B and D bread wheat subgenomes and the genomes of three extant diploid relatives of the original subgenome donors (Additional file 1: Figure S2). When applied to high-throughput RNA sequencing (RNA-seq) for bread wheat, HANDS correctly predicts subgenome-specific base-identity at over 90% of assayed HSPs. Thus HANDS allows rapid and accurate discovery of homeolog-specific base-identity at a genome-wide level.
The HANDS framework
HANDS uses the sequencing alignment/mapping (SAM) file  of the polyploid genome (for example, bread wheat) obtained using sequence alignment tools such as Burrows-Wheeler Aligner  along with the variants lists (HSPs for the polyploid and single base substitutions (SBSs) for the progenitor-relative diploids) to assign base-identity to the polyploid subgenomes. The variant lists are tab-separated files containing sequence name, position, reference base and consensus base, and can be generated using tools such as SAMtools  (see Methods).
HANDS starts by optionally preprocessing the data to validate the HSPs and SBSs identified in the polyploid and progenitor-relative diploid genomes respectively. The preprocessing step is described in detail in Note S1 in Additional file 1. Once the preprocessing is done, HANDS considers the read alignments (from SAM file) for each gene simultaneously and proceeds in five steps (Figure 1B). In step 1, HANDS creates a base pattern for each read and counts the reads containing a particular base pattern. In step 2, base patterns that are a result of sequencing error or embedded in other base patterns are removed. In step 3, base patterns are assigned to one or more subgenomes based on their similarity with the corresponding diploid. In step 4, base patterns are prioritized based on their percentage identity with the diploids, and bases are assigned to each subgenome iteratively discarding base patterns that contradict with the already assigned bases. In step 5, the base patterns discarded in step 4 are re-used to characterize bases at unassigned positions. These steps are described in the next five subsections and are further explained in Note S1 in Additional file 1 using a detailed example.
Step 1: Creation of base patterns from aligned reads
In step 1, HANDS considers each read aligned against the current gene in turn and creates a base pattern for that read. A base pattern is regarded as the sequence of the pairs (position, nucleotide) for all HSPs found in the read. For paired-end data, alignments corresponding to both reads in the pair are considered together to generate a single base pattern for the pair. Read pairs, where a nucleotide discrepancy is found at a particular HSP position in the forward and reverse reads are ignored as this suggests that one of the two reads had a sequencing error (Additional file 1: Figure S3). Note that this scenario will only occur when reads in a pair have some overlap between them. While creating the base patterns, HANDS groups the same base patterns together keeping track of the number of reads supporting a particular base pattern. Once base patterns are obtained all further processing is done on these patterns and not on reads.
Step 2: Removal of potential sequencing errors and embedded base patterns
To remove the base patterns arising due to sequencing errors, HANDS calculates the base pairs (two consecutive HSPs) in each base pattern along with the number of reads supporting each pair. Base patterns containing a base pair that is present in less than 5% (a user-specified parameter) of the reads at these positions are ignored. HANDS further removes those base patterns that are embedded within another base pattern.
Step 3: Assignment of base patterns to individual genomes
Once a list of valid base patterns is obtained for the current gene, HANDS assigns the base patterns to individual genomes using the SBS data for the diploid progenitors. A base pattern is assigned to a genome if at least 50% (a user-specified parameter) of its bases at HSP positions match with the corresponding diploid genome. When calculating the percentage identity, positions with low coverage in the diploid (less than 3 (a user-specified parameter) reads) as well as positions where a unique nucleotide is not present (for example, heterozygous base substitution) are ignored.
Step 4: Assignment of bases to individual genomes
In step 4, HANDS assigns bases to a genome iteratively using the patterns assigned to the genome in step 3. For each HSP position, HANDS tabulates all nucleotides that are present in different base patterns assigned to an individual genome along with the base pattern(s) having the maximum percentage identity with the diploid genome containing that particular nucleotide. If only one nucleotide is present at a particular position, it is assigned at that position. Otherwise the nucleotide corresponding to the base pattern with the maximum percentage identity with the diploid progenitor is assigned at that position. If multiple nucleotides have a base pattern with the maximum percentage identity with the diploid then no assignment is made. Once all the positions have been checked, base patterns that contradict with the assigned positions are removed and the process is continued until no more positions can be assigned.
Step 5: Assignment of bases to individual genomes using discarded base patterns
HANDS finally uses the base patterns that were discarded in Step 4 from different genomes because they contradicted the assigned bases (see above) to correct the bases assigned to the genomes. The base patterns and subsequently bases are assigned to the genome using the same strategy as above but now the percentage identity is calculated using the previously assigned bases instead of diploid data. A nucleotide is only changed at a particular HSP position if the base pattern containing that nucleotide has a better percentage identity than the base pattern corresponding to the currently assigned nucleotide.
Provision for distant genomes
HANDS also caters for the scenario where one diploid progenitor may be relatively distant from the corresponding genome in the polyploid. One such example is the B genome in wheat. The donor of the B genome is unknown but is hypothesized to be similar to Aegilops speltoides[7–9]. In this scenario, a base pattern may have very low or no identity with the diploids. HANDS uses the fact that the base pattern must come from one of the subgenomes of the polyploid and consequently assigns the base pattern that is not assigned to any genome (due to low identity with the diploids) in step 3 to the genome designated as the distant genome.
Characterization of subgenome-specific base-identity in bread wheat
We used HANDS to characterize the HSPs in bread wheat (T. aestivum cv. Chinese Spring (CS)). To avoid the complexity due to the large genome size (16,000 Mb) and high repetitive content (>80%) of wheat [14, 15], we sequenced the transcriptomes of T. aestivum (hexaploid), T. urartu (diploid A-genome progenitor-relative), Ae. speltoides (diploid B-genome progenitor-relative) and Ae. tauschii (diploid D-genome progenitor-relative) using Illumina paired-end sequencing technology (Additional file 1: Table S1, Methods), and aligned the resultant reads onto a wheat transcriptome reference sequence (Additional file 1: Table S2). This wheat transcriptome reference was constructed using Wheat UniGene Build 60 (http://www.ncbi.nlm.nih.gov/UniGene/) by concatenating the UniGenes such that they were separated by 200 ‘N’s (Additional file 1: Figure S4). The aligned reads were filtered to remove low quality alignments and variants (HSPs and SBSs for polyploid and diploid progenitor-relatives respectively) were called (Additional file 1: Table S2, Methods). HANDS was then used to characterize the HSPs in T. aestivum based on the similarity with the SBSs in the diploid progenitor-relatives. Out of a total of 342,266 HSP positions identified in T. aestivum, HANDS assigned bases at 325,517 positions (95.1%) to one or more subgenomes (Additional file 2: Table S3). Out of the remaining 16,749 positions where base assignments could not be made 16,454 (98.24%) positions were such that one or more diploids had low coverage or an ambiguous base (a heterozygous base substitution) was present at these positions and were consequently ignored.
Assessment of HANDS base-identity assignment accuracy
Performance evaluation of HANDS
Diploid coverage low
Sub genome silenced
We have presented HANDS, a method for rapid and accurate characterization of HSPs in polyploid genomes. HANDS uses next-generation sequencing data from the polyploid and the diploid progenitors to characterize bases in the polyploid genome at HSP positions without requiring to sequence a large number of genomes. This not only reduces the cost associated with genome sequencing but also eliminates the need of generating NT lines for each chromosome of the polyploid genome. The current implementation allows characterization of base-identities in any allopolyploid containing up to three subgenomes and is therefore applicable to a number of important polyploid genomes such as Brassica napus (rapeseed) , Gossypium hirsutum (cotton)  and Avena sativa (oat) . Although, the results are presented in this paper using SAM file generated by BWA and variant lists generated by SAMtools, it is important to note that HANDS is independent of the sequence mapping and variant calling software, and can characterize HSPs using output from any standard mapping/variant calling tool.
When tested on chromosomes 1 and 5 of bread wheat, HANDS characterized bases to the subgenomes with high accuracy for both chromosomes. However, for both chromosomes 1 and 5, the base assignment accuracy was lowest for the B subgenome (Table 1). This is likely due to the fact that Ae. speltoides, used as the B genome diploid progenitor-relative for base assignments, is actually more distantly related to the T. aestivum B subgenome than is the case for the A and D genome progenitor relatives [7–9]. Nevertheless, HANDS still assigns bases with great accuracy (~94%) to the B subgenome, because it caters for the scenario where one progenitor may be relatively distant from the corresponding polyploid subgenome when assigning bases (see above).
Given this defined level of accuracy for group 1 and group 5 chromosomes, it is likely that HANDS works with >90% accuracy throughout the entire bread wheat genome. To further support this proposition, we calculated the percentage of shared bases between different subgenomes of wheat, and between wheat subgenomes and corresponding diploid progenitors across the whole in silico reference using the base-identities predicted by HANDS, and compared the results to those obtained for chromosomes 1 and 5 (Figures 2B and 3B). No significant difference was found between the full and partial datasets (Fisher’s exact test, P-values ≈ 1).
We next sought to calculate the percentage of HSPs that were tri-homeoallelic (i.e. positions containing three different alleles) in our datasets. It has been recently reported that the percentage of tri-homeoallelic HSPs present in bread wheat is surprising low although it contains three homeologous subgenomes . We found that the percentages of tri-homeoallelic HSPs in partial (chromosomes 1 and 5) as well as full datasets were comparable to those reported earlier (Additional file 1: Figure S5). This further supports the proposition that HANDS works with high accuracy across the whole in silico reference. Moreover, it is possible that HANDS classification of individual HSPs accounts for the greater accuracy of this method (exploiting the similarity of the polyploid subgenomes with the diploid progenitor relatives) versus the previous gene-assembly based method, which respectively gave accuracies of ~72% for the A, ~85% for the D and ~60% for the B genomes .
In summary, HANDS allows accurate genome-wide discovery of subgenome-specific base-identity in polyploids. In the specific case of wheat, this advance will enhance our understanding of the complex genome architecture of this important crop plant. In particular, it provides the ability to relate, on a genome-wide basis, specific homeoallelic variants to particular agronomic traits, thus increasing the precision of crop-breeding solutions to address the challenge of global food security.
Three spikelets of T. aestivum CS, T. urartu, Ae. speltoides and Ae. tauschii and chromosomes 1 and 5 Nullisomic-Tetrasomic (NT) lines were placed on Whatman filter papers soaked with water in Petri-dishes. The spikelets were stratified at 4°C for 2 days in the dark before extracting the seeds from the spikelets. The seeds were then transferred back to 4°C for a further 2 days after which they were allowed to grow at room temperature. Three replicates for each plant were then grown hydroponically in a controlled environment room (16/8-h light/dark cycle at 21°C, irradiance 120 μmol photons m-2 s-1). Root and shoot tissue samples were collected when the fifth leaf appeared. At this point, the fourth leaf was taken as the shoot sample.
RNA extraction and sequencing
RNA was extracted from the shoot and root samples using Trizol (Invitrogen) using the standard protocol. The Illumina Genome Analyzer II (GAII) platform was used for transcriptomic sequencing, according to the manufacturer’s instructions, at the Wellcome Trust Centre for Human Genetics, Oxford, UK. Four lanes of 36, 51 and/or 71 bp paired-end data were generated for diploids and NT lines whereas two lanes of 100 bp paired-end data were generated for CS for both root and shoot samples of a single plant. Between 244.6 and 694.9 million reads were produced for each line (Additional file 1: Table S1).
Sequence alignment, filtering and visualization
Sequencing reads for T. aestivum CS, T. urartu, Ae. speltoides, Ae. tauschii, and chromosome 1 and 5 NT lines were mapped to the in silico reference genome created using Wheat UniGene Build 60 (http://www.ncbi.nlm.nih.gov/UniGene/, Additional file 1: Figure S4) using Burrows-Wheeler Aligner (BWA)  using default parameters. Reads with low mapping quality (phred score ≤ 20), reads for which the read pair did not map onto the reference genome, and reads that did not map uniquely onto the reference genome were removed from subsequent analyses using custom scripts written in C++. The alignments were visualized using Integrated Genome Viewer (IGV) .
Variant calling and filtering
The filtered alignments were used to generate HSP and SBS lists for the polyploid and diploids respectively using SAMtools  (Additional file 1: Table S2). First, pileups were generated using ‘mpileup’ command with probabilistic realignment for the computation of base alignment quality (BAQ) disabled and maximum coverage threshold set to 50,000. These pileups were then used to call variants using bcftools (available as part of SAMtools) using default parameters. The HSP and SBS lists were filtered to remove potential false positives including those due to errors arising during DNA sequencing itself. Variants with low quality (phred score ≤ 20) as well as those with read coverage of less than 3 reads per site were removed. For diploids, ambiguous base calls were also ignored.
Identification of chromosomes 1 and 5 UniGenes
To identify the UniGenes located on chromosomes 1 and 5, we first characterized HSP bases for all 56,954 UniGenes present in the Wheat UniGene build 60 separately for chromosomes 1 and 5 using NT lines (Note S2 in Additional file 1). For each chromosome, we then extracted those UniGenes for which base assignments were made for at least 60% of the HSPs present in the UniGene. This process identified 2,773 and 3,218 UniGenes on chromosomes 1 and 5 respectively. HSPs present on these UniGenes were used for the evaluation of HANDS’s performance.
Availability of supporting data
The Illumina sequencing data files have been submitted to the NCBI Sequence Read Archive (SRA) (http://www.ncbi.nlm.nih.gov/sra) under accession number SRA097144. HANDS is implemented in Java and is available online (http://lums.edu.pk/sse/biology/aziz-mithani-hands/).
This publication is based on work supported by Award No. KUK-I1-002-03, made by King Abdullah University of Science and Technology (KAUST). We thank Steve Reader, John Innes Centre, Norwich, UK, for supply of wheat lines, and David Buck, the Head of Genomic Services, and his team at the Wellcome Trust Centre for Human Genetics, Oxford, UK with advice and help regarding Illumina sequencing.
- Wendel JF: Genome evolution in polyploids. Plant Mol Biol. 2000, 42: 225-249. 10.1023/A:1006392424384.View ArticlePubMedGoogle Scholar
- Udall JA, Swanson JM, Nettleton D, Percifield RJ, Wendel JF: A novel approach for characterizing expression levels of genes duplicated by polyploidy. Genetics. 2006, 173: 1823-1827. 10.1534/genetics.106.058271.PubMed CentralView ArticlePubMedGoogle Scholar
- Chaudhary B, Flagel L, Stupar RM, Udall JA, Verma N, Springer NM, Wendel JF: Reciprocal silencing, transcriptional bias and functional divergence of homeologs in polyploid cotton (Gossypium). Genetics. 2009, 182: 503-517. 10.1534/genetics.109.102608.PubMed CentralView ArticlePubMedGoogle Scholar
- Akhunova AR, Matniyazov RT, Liang H, Akhunov ED: Homoeolog-specific transcriptional bias in allopolyploid wheat. BMC Genomics. 2010, 11: 505-10.1186/1471-2164-11-505.PubMed CentralView ArticlePubMedGoogle Scholar
- Blanc G, Wolfe KH: Functional divergence of duplicated genes formed by polyploidy during Arabidopsis evolution. Plant Cell. 2004, 16: 1679-1691. 10.1105/tpc.021410.PubMed CentralView ArticlePubMedGoogle Scholar
- Chen ZJ, Pikaard CS: Transcriptional analysis of nucleolar dominance in polyploid plants: biased expression/silencing of progenitor rRNA genes is developmentally regulated in Brassica. Proc Natl Acad Sci USA. 1997, 94: 3442-3447. 10.1073/pnas.94.7.3442.PubMed CentralView ArticlePubMedGoogle Scholar
- Shewry PR: Wheat. J Exp Bot. 2009, 60: 1537-1553. 10.1093/jxb/erp058.View ArticlePubMedGoogle Scholar
- Dvorak J, Zhang HB: Variation in repeated nucleotide sequences sheds light on the phylogeny of the wheat B and G genomes. Proc Natl Acad Sci U S A. 1990, 87: 9640-9644. 10.1073/pnas.87.24.9640.PubMed CentralView ArticlePubMedGoogle Scholar
- Brenchley R, Spannagl M, Pfeifer M, Barker GL, D’Amore R, Allen AM, McKenzie N, Kramer M, Kerhornou A, Bolser D: Analysis of the bread wheat genome using whole-genome shotgun sequencing. Nature. 2012, 491: 705-710. 10.1038/nature11650.PubMed CentralView ArticlePubMedGoogle Scholar
- Saintenac C, Jiang D, Akhunov ED: Targeted analysis of nucleotide and copy number variation by exon capture in allotetraploid wheat genome. Genome Biol. 2011, 12: R88-10.1186/gb-2011-12-9-r88.PubMed CentralView ArticlePubMedGoogle Scholar
- Page JT, Gingle AR, Udall JA: PolyCat: a resource for genome categorization of sequencing reads from allopolyploid organisms. G3 (Bethesda). 2013, 3: 517-525. 2013.View ArticleGoogle Scholar
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R: The sequence alignment/Map format and SAMtools. Bioinformatics. 2009, 25: 2078-2079. 10.1093/bioinformatics/btp352.PubMed CentralView ArticlePubMedGoogle Scholar
- Li H, Durbin R: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009, 25: 1754-1760. 10.1093/bioinformatics/btp324.PubMed CentralView ArticlePubMedGoogle Scholar
- Francki M, Appels R: Wheat functional genomics and engineering crop improvement. Genome Biol. 2002, 3: reviews1013.5Google Scholar
- Arumuganathan K, Earle ED: Nuclear DNA content of some important plant species. Plant Mole Biol Reporter. 1991, 9: 208-218. 10.1007/BF02672069.View ArticleGoogle Scholar
- Sears ER: Nullisomic-tetrasomic combinations in hexaploid wheat. Chromosome Manipulations and Plant Genetics. Edited by: Riley R, Lewis KR. 1966, London: Oliver and Boyd, 29-45.View ArticleGoogle Scholar
- Cheung F, Trick M, Drou N, Lim YP, Park JY, Kwon SJ, Kim JA, Scott R, Pires JC, Paterson AH: Comparative analysis between homoeologous genome segments of Brassica napus and its progenitor species reveals extensive sequence-level divergence. Plant Cell. 2009, 21: 1912-1928. 10.1105/tpc.108.060376.PubMed CentralView ArticlePubMedGoogle Scholar
- Paterson AH, Wendel JF, Gundlach H, Guo H, Jenkins J, Jin DC, Llewellyn D, Showmaker KC, Shu SQ, Udall J: Repeated polyploidization of Gossypium genomes and the evolution of spinnable cotton fibres. Nature. 2012, 492: 423-10.1038/nature11798.View ArticlePubMedGoogle Scholar
- Linares C, Ferrer E, Fominaya A: Discrimination of the closely related A and D genomes of the hexaploid oat Avena sativa L. Proc Natl Acad Sci USA. 1998, 95: 12450-12455. 10.1073/pnas.95.21.12450.PubMed CentralView ArticlePubMedGoogle Scholar
- Winfield MO, Wilkinson PA, Allen AM, Barker GL, Coghill JA, Burridge A, Hall A, Brenchley RC, D’Amore R, Hall N: Targeted re-sequencing of the allohexaploid wheat exome. Plant Biotechnol J. 2012, 10: 733-742. 10.1111/j.1467-7652.2012.00713.x.View ArticlePubMedGoogle Scholar
- Thorvaldsdottir H, Robinson JT, Mesirov JP: Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Brief Bioinform. 2013, 14: 178-192. 10.1093/bib/bbs017.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.