Sequencing of the wheat transcriptome
Wheat mRNA from a single cultivar, the elite variety “Kukri” [22, 23], was sequenced using: a) short-read Illumina GAIIx technology for sequencing depth, and b) long-read Roche GSFLX Titanium technology for homoeolog-sensitivity. RNA was extracted from root and shoot tissue from seedlings ranging from 8–12 days after germination, as well as florets collected at various stages from pre-meiosis to just prior to anthesis. Collection from multiple tissues and developmental stages was essential in order to obtain a reasonably comprehensive representation of the complete transcriptome. The RNA was normalized [24] in order to reduce the dominance of abundantly expressed genes.
After quality checks, trimming of adapters and size selection using custom scripts (see Methods) 14,563,748 Illumina GAIIx reads (6,913,826 read pairs, insert size ~250-300 bases, and 736,096 single reads; mean sequence length 107.8 bases) and 1,495,941 GS FLX sequences (mean sequence length 363.2 bases), i.e. 16,059,689 reads in total, were used as input to the sequencing assembly pipeline.
Assembly algorithm performance testing
We investigated the suitability of various assembly algorithms by comparison to 65 validated bread wheat (cv. Chinese Spring) homoeologous sequence triplets (the “OM” set; see Methods), obtained from [18] and [25]. Reads that bore some similarity to sequences in the OM dataset were extracted from the Illumina and GS FLX reads and assembled, using various parameters, with ABySS ([26], Version 1.2.6), Velvet ([16], Version 1.0.18), Velvet/Oases (Version 0.1.18) as well as MIRA ([19, 20] Version 3.2.1; [19, 20]). Assembled contigs were subsequently compared to the OM homoeologs, as described in Methods, and evaluated according to criteria such as evidence of chimeric sembly of homoeologs, sequence length, total number of homoeologs assembled, etc. As can be seen in Figure 1A, genome assemblers such as Velvet and ABySS tend to produce a significant number of chimeric assemblies of homoeologous copies of each gene, thereby reducing the total number of homoeologs in the assembly. The chance of chimeric assembly decreased as the k-mer length associated with the underlying de-Bruijn graph was increased, as one might expect. Usage of the transcript assembler Velvet/Oases significantly increased the rate of chimeric assembly to around 60-80%. While this was undesirable, the lack of homoeolog-specificity allowed Velvet/Oases to produce significantly longer contigs than Velvet and ABySS (Figure 1B). We found that MIRA, which is not a de-Bruijn graph-based assembler, exhibited good homoeolog-specificity (Figure 1A) over a wide range of assembly parameters (Additional file 1: Table S1), without significantly compromising contig length (Figure 1B), but at the expense of prohibitively increased memory and CPU-time requirements. For comparison, Figure 1 also shows results obtained with Trinity, another short-read de-Bruijn graph based assembler developed more recently ([27], Version 5-19-2011). Homoelog-specificity (Figure 1A) obtained with this assembler is somewhat better than that obtained with Velvet/Oases, but considerably worse than that obtained with MIRA, with evidence for chimeric assemblies in around 50% of cases. Coverage is reduced somewhat (Figure 1B) compared to Velvet/Oases, presumably reflecting the greater homoelog-specificity.
Sequence assembly
In view of the results shown in Figure 1, a two-step assembly procedure using Velvet/Oases and MIRA was employed. Initially, a total of 16,059,689 Illumina and GS FLX reads were assembled with Velvet/Oases as described in Methods. This assembly resulted in a total of 69,975 contigs, with an average length of 840 bases. Using the number of genes in rice (~41,000; [28]) as a guide, and taking the polyploidy of wheat into account, it is to be expected that the total number of wheat genes significantly exceeds this number of contigs. This supports the conclusion reached during assembly testing against the OM set that the Velvet/Oases assembler is largely homoeolog insensitive. Definitive conclusions, however, are difficult to draw because not all genes would have been expressed in the samples that were sequenced (decreasing the expected number of contigs), while alternate splice-forms would tend to separate in the assembly (increasing the expected number of contigs).
In order to produce clean homoeolog-specific contigs, the Velvet/Oases contigs were subsequently used purely to cluster the initial reads, as described in Methods, and then discarded. Importantly, as we wanted to avoid individual (possibly mis-assembled) homoeologs ending up in separate clusters at this early stage, we also included a lenient clustering of the Velvet/Oases contigs in this step. 78% of reads could be clustered in this way, with the largest cluster containing 24,806 GS FLX and 1,066,593 Illumina reads. The largest 14,000 clusters contained just over 95% of the total number of reads (Figure 2A). In each cluster, the number of GS FLX reads amounted to, with large fluctuations, 15-20% of the total number of reads in that cluster.
Next, each read cluster was assembled separately using MIRA implemented on a compute cloud, as described in Methods. After filtering for quality (average base error probability <10-4) and contig length (>250 bases), 128,628 contigs were left in the assembly. Their cumulative size distribution is shown in Figure 2B, with 27,958 contigs larger than one thousand bases and the longest contig being 11,572 bases. For comparison, the cumulative size distribution of rice cDNAs is also shown (rice cDNA sequences were obtained from [28], with those rice sequences starting with ATG discarded as they did not contain the 5’ UTR region; the total number of sequences was normalized to that of the wheat assembly for comparison). As can be seen, while the cumulative distributions are of similar shape, in general the assembled wheat contigs are shorter than the rice cDNAs. While longer contigs could presumably be obtained by reducing assembly stringency, this would in general reduce contig accuracy and, in particular, homoeolog specificity.
Sequences were annotated through comparison to rice cDNAs, and assigned to map locations when available, as described in Methods. In this way, almost 78% of sequences could be annotated through the rice cDNAs, while 27,694 sequences (including 1,667 over 1,000 bases long) could not be identified.
Sequence coverage
Characteristics of transcriptome coverage provided by the assembly were estimated by comparison to a set of 6,166 full length cDNA clones (“FLs”; [29, 30]) as well as the Harvard Tentative Contigs (“TCs”; [31, 32]) that are themselves assembled from ESTs in public databases. While it is difficult to be certain about the level of homoeolog-specificity in the TCs, the assembly algorithm that generated them (see Methods) makes it likely that they are dominated by the most abundantly expressed homoeologs. ESTs originating from homoeologs with lower expression are likely to be frequently mistaken as sequencing errors of the more abundant member of the homoeologous group.
In Figure 3, we show the average coverage along the length of the FLs and TCs as a function of the minimum%-identity demanded for a match (% identity, %ID for short, refers to the similarity of the highest scoring pair in a BlastN hit; only BlastN hits with an E-value <10-50 are considered). The coverage is fairly uniform, only dropping off significantly in the first 20% and in the last 5% of the length of the FLs. The distributions for the TCs are more symmetrical than for the FLs as the former’s orientation is, in general, not fixed. The average coverage of the FLs drops by a factor of around 3.0 (2.7 for the TCs) when moving from low specificity (%ID~90%) to high specificity (%ID~99%). As discussed below, we expect a %ID of 90% or less to be largely homoeolog-insensitive, while a %ID of 99% or more likely to be homoeolog-discriminating. The drop by a factor of around two or three is, therefore, consistent with the assembly being a homoeolog-discriminating assembly of a (homozygous) hexaploid.
5,034 of 6,166 (81.6%) FLs were matched by at least one assembled contig at 90% identity. This reduced to 4,110 (66.7%) at 99% identity. For 1,439 and 642 of these, respectively, the alignment extended for more than 80% of the length of the FL sequence. For the TCs, 69,187 out of 93,508 (74.0%) were matched by a least one assembled contig at 90% identity and 39,430 (42.2%) at 99% identity. We note, however, that at 99% ID the results have a strong dependence on the %ID (e.g., at 98 %ID, the numbers are 74.2% and 54.7% for the FLs and TCs, respectively). We comment on the implications of this in the Discussion.
Homoeolog specificity
The quality of the final assembly was investigated by again comparing the results to the 65 validated bread wheat (cv. Chinese Spring) homoeologous sequence triplets of the OM set, as described earlier. The sequence diversity of the OM sequences is shown in Figure 4A. An all-vs-all BlastN comparison was used to quantify this diversity, with hits with an E-value better than 10-50 and an alignment length larger than 400 bases being retained. As can be seen, homoeologs in the OM set predominantly exhibit a %ID of between 93 and 99%.
The sequence diversity of our own assembly was quantified in the same way and is shown in Figure 4B. This plot is normalized in such a way that at most one hit per sequence pair, with the greatest sequence similarity, is retained. A sharp rise in sequence-pairs with %ID above 93% may again be observed and it is tempting to conclude, by comparing to Figure 4A, that this peak above background is largely caused by the presence of homoeologous sequences. If this is indeed the case, one would deduce from Figure 4B that the average sequence identity of homoeologs in wheat is about 97.2%, with a spread (standard deviation) of about 1.8%. This in turn corresponds to a SNP frequency per sequence of ~1.4%, which is a little higher (1 SNP/71 bases) than that reported by Mochida et al. [18]. Furthermore, under the extreme assumption that the excess above background seen in Figure 4B (approx. 83,600 Blast hits) is entirely due to homoeologous sequence triplets, one ends up with a lower bound of around 27,900 × 3 transcripts in the dataset. Because in reality one would expect the RNA sample to consist of a mixture of expressed homoeologous triplets, doublets and single sequences, this lower bound, being 68% × 3 of the gene content in rice, is not unreasonable.
Next, the assembled sequences were compared to the OM set using BlastN (E-value < 10-100, word size 11). Sequences were allocated to individual homoeologs, using %ID as a criterion, by iteratively identifying and removing highest quality matches, as before (see Methods). Results of this comparison are shown in Additional file 2: Table S2. On average, the %ID of the identified matches is 99.67%. The SNP frequency between varieties of bread wheat is naturally somewhat variable and dependant on cultivar and lineage, with published estimates ranging from 1 SNP/540 bases [33] to 1 SNP/335 bases [34]. It appears likely, therefore, that the 1 SNP/300 bases observed between the assembled Kukri sequences and the OM set (Chinese Spring) can largely be attributed to inter-cultivar polymorphisms.
In total, this procedure identified 186 out of a total of 195 (95.4%) possible homoeologs in the OM set. For 57 triplets all three homoeologs were identified (87.7%), for 7 triplets two out of three homoeologs were identified (10.8%) and in one case only one member of the triplet was identified.
The level of chimeric assembly between reads originating from separate homoeologs was quantified in the final assembly, as described earlier. In total, only 27 out of 186 (14.5%) assigned sequences showed some evidence of chimeric assembly.
These results are almost identical to those shown for the assemblies produced by MIRA and shown in Figure 1, which did not involve an initial Velvet/Oasis assembly. This is consistent with the interpretation that the two-stage assembly was not compromised by mis-assembly in stage one.
Finally, a set of 19 random clusters was selected from the final assembly in such a way that 3 contigs therein were mutually overlapping. These contigs were then compared to unassembled reads from a genomic wheat sequencing project of the variety Chinese Spring ([35]; see Methods). The existence of any reads that could clearly be associated with more than 1 contig was taken as evidence of chimeric assembly. In total, 11 suspect contigs were identified in this way, leading to an independent estimate of the rate of chimeric assembly of about 18%. This is in rather good agreement with the estimate provided by the comparison to the OM set described earlier.
Comparison to other grass genomes
The assembled wheat sequences were compared to those available for the published genomes of rice [36], sorghum [37] and brachypodium [38] (see Methods). Just under 80% of the wheat contigs have an apparent homologue in one of these diploids (see Figure 5). In most cases (70%) a homologue was identified in all three species and, quite reasonably given the relative proximity of brachypodium to wheat in the phylogenetic tree, more homologues were identified in brachypodium than in the other two species. We do not believe that all of the 20% of wheat sequences that did not have a match in the other species are unique to wheat; rather, closer inspection reveals that these sequences tend to be those contigs that are shorter than average, and so we think that inadequate sequence length prevented a reasonable match from being identified. This interpretation is supported by the fact that, if only contigs longer than 2000 bases are retained in Figure 5, the proportion of unmatched sequences reduces dramatically to around 1.2%.
In most cases, several wheat sequences are associated with each brachypodium, rice and sorghum sequence in Figure 5 simply because of the polyploid nature of the wheat genome and so this figure does not in itself provide information about the proportion of sequences in brachypodium, rice and sorghum that are represented in wheat. Inspection of the sequence comparison results (see Methods) shows that 88.8% of brachypodium sequences, 89.4% of sorghum sequences and 70.0% of rice sequences are matched by at least one wheat contig. These numbers lend support to the estimates of coverage resulting from the comparison to the FL and TC wheat datasets earlier on.
Finally, the wheat contigs were compared against 23,614 published sequences of full length barley cDNAs [39]. 86,631 wheat sequences (67.4%) were represented in the barley full length cDNAs and 19,762 barley sequences (83.7%) were found to have at least one wheat homologue. One would expect the number of wheat sequences represented in the full length cDNAs to be lower than that for rice, brachypodium and sorghum simply because the barley cDNA dataset [39] does not correspond to the full barley genome. This appears to be the case, however the comparison to barley sequences was performed on the DNA level, while the comparison to the other three genomes was performed on the peptide level, so an accurate quantitative comparison of numbers is not possible.
Gene Ontology analysis and Pfam domains
The wheat contigs were classified into the gene ontology hierarchy [40] by transferring the GO terms of the best sequence match in rice to each matched wheat sequence, as described in Methods. In this way, 64,071 of the 128,628 wheat contigs inherited 248,403 GO terms. This compares to 18,307 of the 57,624 rice loci annotated with 70,425 GO terms.
The results of this GO analysis for the “Molecular function” ontology are shown in Additional file 3: Table S3. It is noticeable that GO categories associated with binding to RNA (particularly RNA silencing-related proteins DCL3 and HEN1, occurring in the ratios 25/1 and 22/1 in wheat vs. rice), chromatin binding (particularly various regulators of chromosome condensation) and translation factor activity (particularly various translation initiation and elongation factors) appear greatly enriched in wheat compared to rice. It is tempting to speculate that all three categories are enriched in wheat because of its enormous number of transposable elements [41]. This may require an increased need for RNA silencing and/or translational inhibition [42].
The wheat contigs were also scrutinized for occurrences of Pfam domains (see Methods). In total, 65,826 contigs were found to contain 3,073 unique domains. For comparison, the rice coding sequences [36] contained 51,437 sequences with 3,265 unique domains. 3,043 of these domains are in common between the two species. Generally speaking, the number of occurrences of any particular Pfam domain is strongly correlated in wheat and rice (Spearman rank coefficient 0.69; see Additional file 4: Figure S2), indicating that there is no strong evidence of genome wide functional bias associated with the assembly procedure. Some notable exceptions to this are shown in Additional file 5: Table S4, corresponding to off-diagonal points in Additional file 4: Figure S2 with domains associated with transposable elements (e.g. MULE, RVT_2, rve, Plant_trans etc.) being particularly over-represented in wheat, in line with the expectation outlined above [41].
Homologous cluster analysis
OrthoMCL [43, 44] was used to cluster wheat contigs into 19,086 putative homologous groups together with the peptide sequences of the three sequenced genomes of brachypodium, rice and sorghum, as described in Methods. As shown in Figure 6, the three diploid species contribute, on average, around 11% of the sequences of each group, while wheat contributes the remaining 2/3.
Groups with an abnormally low number of wheat sequences (Additional file 6: Figure S3) include one annotated as subtilisin (45 sequences only present in sorghum) and a group of 44 cytochrome P450s, while groups with an abnormally large number of wheat sequences (Additional file 7: Figure S4) include a group of 55 alanyl-tRNA synthetates and 72 splicing factors, broadly consistent with the over-representation of translation and RNA binding related activities indicated in Additional file 6: Table S3.