RNAseq and the genomic sequence of related species can be used to define SNVs
To develop a collection of putative SNVs for Miscanthus, we sequenced transcriptomes of M. sinensis 'Grosse Fontaine' and 'Undine' leaves and leaf rolls using deep RNAseq. Across both accessions, we generated over 21 Gbp in predominantly paired 80 bp Illumina GA II reads (Additional file 7: Table S2, NCBI Short Read Archive accession number SRA051293). From these RNAseq data we assembled a unified set of 29,933 contigs longer than 100 bp (Additional file 2). The median contig length was 522 bp, with half of the total contig length accounted for by 6,433 contigs longer than 1,071 bp (the contig N50). We identified SNVs by realigning the RNAseq reads against the assembled transcriptome contigs and requiring strong support for two alternate variants embedded in otherwise nearly identical flanking sequence, to enable straightforward high-throughput genotyping. Other variation observable in the dataset was not considered further.
Since our aim was to define variants that could be genotyped by a GoldenGate assay with genomic rather that transcriptomic samples, we excluded from consideration probe sequences that spanned a putative exon-exon boundary. To do this in the absence of a Miscanthus genomic reference, we took advantage of the extensive conservation of exon-exon boundaries in grasses [5] to identify and reject likely exon-junction-spanning probe sequences by comparison with the genomes of sorghum, maize, and rice. To facilitate syntenic comparisons between Miscanthus and related species, we also chose for genotyping those SNVs that (1) could be readily assigned to homologs in sorghum by sequence similarity and (2) had homologs that were distributed across all sorghum chromosomes (Additional file 8: Figure S1).
Results of GoldenGate genotyping
Out of 1,536 putative markers on the Miscanthus GoldenGate array (Additional file 3: Table S3), 1,243 showed one or more clusters in GoldenGate signal space (Figure 2), indicating consistent genotyping across individuals. The remaining 293 putative markers showed dispersed or very low signal in Genome Studio and were considered failed assays, and not investigated further. Of the 1,243 successful oligonucleotide assays, we found that 93 assays showed signal for only one probe, and appear to be homozygous across both parents and their progeny or represent cases where the second oligo probe failed. After excluding these failed or invariant assays, we were left with 1,150 markers, of which 658 formed 2 or 3 clusters in signal space. The remaining markers appear as either a single centrally located cluster, more than three clusters, or dispersed signal, and were not considered further.
Intepretation of GoldenGate SNP genotypes
By considering the patterns of genotypes across our F1 mapping population, we found that many of the SNV's discovered by RNAseq analysis are indeed segregating biallelic markers (i.e., single nucleotide polymorphisms, or SNPs). Others, however, represent fixed differences between closely related paralogous loci. Furthermore, many segregating biallelic markers have their GoldenGate signal affected by a closely related paralog that has the same sequence as the marker allele. Signal from such paralogous alleles causes the cluster positions in Genome Studio to be skewed in a characteristic manner that is readily recognized. A plot of normalized theta (ratio of signal intensities assayed for A and B SNP alleles) against normalized R (signal intensity) per marker for each individual can be used to visualize genotypes in a segregating population (Figure 2A-2F). The values of normalized theta are close to 0 in samples where the genotype is AA, close to 0.5 if it is AB and close to 1 if it is BB.
In situations where more than one locus is being sampled, and where the sequence of a second (paralogous) locus matches one of the two allelic states of the SNV in the segregating locus, the clusters are skewed towards the allele sharing the common nucleotide (Figure 2B, C and 2D). In Figure 2B, locus 1 is heterozygous for A and B SNVs in both parents and hence produces AA, AB, or BB progeny, whereas the second paralogous locus is fixed for the B SNV in both parents and progeny. This results in all three clusters being skewed to the right due to the higher dosage of SNV B. Figure 2D shows a scenario where the GF parent is AB and the Undine parent BB at locus 1, whereas the second locus is fixed for SNV A in both parents and progeny, which shifts clusters to the left due to higher dosage of SNV A. A similar situation is shown in Figure 2F where UN rather than GF is segregating at locus 1. For mapping of segregating loci, panels A and B indicate markers that are heterozygous in both GF and UN parents, panels C and D show markers heterozygous in only the GF parent, and panels E and F markers heterozygous in only the Undine parent. Markers shown in Figure 2E and 2F share the feature where the genotype of the two different sampled doubled haploid lines carry either the A or B SNV, but no progeny share the B/B genotype because their parents have either an A/A (GF) or A/B (UN) genotype.
Notably, 26% of the two-cluster SNV's showed skewed signal intensities in the GoldenGate assay, indicating that the two alternative sequences are not present in equal dosages. This observation is consistent with the sequence variants being detected from more than one locus, and suggests that many of the variant sequence pairs A and B appear as heterozygous alleles at one locus (A/B) but are fixed at a second locus (i.e., A/A or B/B), resulting in a ~3:1 ratio of signal intensities on the GoldenGate assay. If both parents show allelic variation at one locus but are fixed for the same allele at a second paralogous locus, then segregating progeny may show 2:2, 3:1, and 4:0 dosages, consistent with observations (Figure 2A. EBI 832, EBI 693 and EBI 635).
A second class of SNV (33%) formed only a single cluster of genotypes (data not shown). For these SNVs, both parents and all progeny had the same genotype. This is consistent with the pattern expected from fixed differences between paralogous loci (e.g., A/A at one locus and B/B at another) that do not segregate in progeny. These SNV's are not useful as genetic markers, since both parents and all progeny fall into a single "heterozygous" cluster and there is no genetic segregation of alleles. The proportion of both single cluster and skewed two-cluster SNVs (59%) should not be used as a direct estimate of the degree of paralogy due to the potential biases introduced by our SNV discovery and selection. These paralogous loci, however, do suggest extensive paralogy in the Miscanthus genome, which is corroborated by the genetic map as shown below.
Only a small minority (5 out of 1536) of the SNVs that we identified by RNAseq analysis formed more than three clusters in signal space, and could not be simply interpreted either as segregating alleles or fixed paralogous variants. The rarity of such SNV's in this analysis suggests that a similar RNAseq-based protocol could be useful in SNP discovery from other Miscanthus populations and species lacking genomic reference sequences.
For 658 out of 1,150 genotyped Miscanthus SNVs, the GoldenGate intensities in our F1 mapping population could be grouped into two (467) or three (191) clusters of genotypes in signal space, indicating variants that are found in both homozygous and heterozygous states in the population. We interpreted the two-cluster class of SNV's as segregating SNPs that are heterozygous in one parent and homozygous in the other, with progeny of both types. Similarly, the three-cluster classes of SNVs are interpreted as SNPs that are heterozygous in both parents, allowing for homozygous offspring of two types as well as heterozygotes. The interpretation of these SNV as segregating SNPs in our cross is supported by the integration of these markers into a consistent linkage map with limited segregation distortion (below).
Corroboration of allelic and fixed differences using doubled haploid lines
To test our hypothesis that many SNV's represent fixed differences between paralogous loci, we also genotyped two M. sinensis double haploid lines and their parents. Since the doubled haploids were developed by another culture from outbred diploid parents (Glowacka, unpublished observations), we had two expectations.
First, for the SNV's that are inferred to be biallelic SNPs in our F1 cross, we expect that some of them will correspond to heterozygous loci in other M. sinensis accessions, including the outbred parents of the doubled haploid lines. If these SNV's are bona fide allelic variants, however, then the doubled haploids should be homozygous for all such variants. Figure 2G shows the segregation of alleles in the GoldenGate assay. In situations where two or three clusters are observed in the GoldenGate, consistent with a biallelic SNP, the double haploids are either A/A or B/B homozygotes while the mapping population has all three allelic states, as expected.
Second, for SNV's that are inferred to be fixed differences between paralogs, both variant states should be observed in the doubled haploids as well as their parents. This is observed as a single AB cluster on the GoldenGate array (Figure 2G).
Genotyping summary
Taken together, our analyses of the F1 mapping population and the two doubled haploid lines show that we can distinguish segregating allelic variants at a single locus from fixed differences between paralogs, even in the face of extensive gene duplication. These data suggest that many Miscanthus genes have a closely related paralog that cannot be easily differentiated in the short read transcript data, but which assort independently. Using segregation patterns from a high density of genetic markers a linkage map can be constructed.
SSR primers from sugarcane identify allelic and paralogous polymorphism in Miscanthus
Since Saccharum (sugarcane) is a close relative of Miscanthus, we reasoned that primer pairs that amplify simple sequence repeats in Saccharum would also be likely to amplify polymorphic SSRs in Miscanthus[36]. Sixty-eight percent of the 2,640 SSRs primer pairs mined from sugarcane ESTs produced amplicons when tested with Miscanthus. Only 51% of the 2,628 SSR primer pairs derived from Saccharum genomic sequences produced amplicons with Miscanthus. Of these, 188 EST- and 237 genome-derived primers generated polymorphic amplicons between the two parental genotypes. Primers that produced non-specific amplicons were excluded. We genotyped the F1 mapping population using 107 primers pairs (29 and 78 primers from EST and intergenic sequences, respectively) out of 425 polymorphic primers. One hundred and seven primers produced 20 marker configurations (Additional file 4: Table S4, Additional file 5: Figure S2). Among them, 69 primers follow disomic marker configurations but 38 primers (35.5%) do not fit disomic configurations, producing more than 3 amplicons in one or both parents (Additional file 4: Table S4). One hundred and seven primers produced a total of 301 amplicons and among them, 210 were polymorphic between two parental genotypes and segregated in progeny populations. One hundred ninety three amplicons out of 210 were actually mapped (Table 1 Additional file 9: Table S5).
An integrated linkage map for M. sinensis
Using the 868 segregating markers defined above, we constructed an integrated linkage map for M. sinensis using JoinMap 4.1. We took advantage of a newly implemented multipoint maximum likelihood model for constructing a map from an F1 cross of two outbred parents, using the Haldane mapping function [30]. In contrast to a pseudo-testcross approach, which utilizes markers that are heterozygous in one parent but homozygous in the other, the new method can also incorporate markers that are heterozygous in both parents. While pseudo-testcross based analysis results in separate maps for each parent, the combined approach allows direct integration into a single map of crossovers that occur in either or both parents by using the markers that are heterozygous in both parents as anchors.
Only 48 out of 868 markers show segregation distortion (p < = 0.005 using the chi-squared goodness of fit test). Of these, 35 were highly distorted markers, and not included in the initial framework map. These highly distorted markers include 21 with a segregation ratio greater than 2:1 when they should have been 1:1, and 14 with a segregation ratio greater than 10:1 when they should have been 3:1 (Table 1). Of the remaining 833 segregating markers (641 SNPs and 192 SSRs), 829 were incorporated into 19 major linkage groups using a minimum logarithm of odds (LOD) score of 11 and maximum recombination frequency of 0.4. The JoinMap 4.1 maximum likelihood method was used to calculate the map order of the framework map. Four SNP markers that were placed more than 40 cM away from the nearest marker on the linkage group were excluded and marker order for those linkage groups was recalculated. An attempt was then made to replace the 35 highly distorted markers on the ML map, keeping the marker order of the framework map constant and using the same map calculation parameters as before. Seventeen of the 35 highly distorted markers were incorporated. This map with 846 markers is shown in Figure 3.
Independent regression maps for each parent were also constructed to corroborate the robustness of marker order (Additional file 9: Table S5, Additional file 10: Figure S3 and Additional file 11: Figure S4). The total length of the 19 linkage groups on the ML map is 1782 cM, with an average intermarker spacing of 2.7 cM (excluding markers with identical map positions). Thus we expect that the missing map length from the telomeric ends of the linkage groups [37, 38] accounts for roughly 2 × 19 × 2.7 cM = 102 cM, for a total estimated map length of 1884 cM. In the Grosse Fontaine map, 94% of the markers lie within 10 cM of each other, while in the Undine map only 90% meet this criterion. In the integrated map, 97% of the mapped markers lie within 10 cM of another marker, attesting to the dense coverage of the map.
Disomic inheritance and limited segregation distortion
Transmission of each linkage group is consistent with pure disomic inheritance in M. sinensi s (i.e., complete preferential pairing of homologs), with no evidence for tetrasomic inheritance (i.e., pairing and recombination between homoeologs). Furthermore, very few markers show segregation distortion (48 out of 868; p < = 0.005 using the chi-squared goodness of fit test), and those that do are concentrated on Ms2, Ms3, Ms4, Ms12, and Ms13. Overall there is more segregation distortion in Undine. Twenty of the 24 distorted UN markers lie on Ms4 (Additional file 12: Table S6). Potential causes of segregation distortion include the following three possibilities: (1) Failure to complement deleterious recessive alleles heterozygous in both GF and UN parents that reduce viability of F1 progeny; (2) Interactions between genomes, e.g., meiotic drive in F1 gametophytes, gametophytic competition or pollen-pistil interactions like self-incompatibility; (3) Proximity to areas of suppressed recombination like centromeres and nucleolus organizer regions. The design of our cross makes it difficult to differentiate among these possible explanations.
Whole genome duplication with extensive conserved synteny to sorghum
Since our Miscanthus markers were derived from (1) transcribed regions with reduced sequence variation (SNPs) and (2) sequences from conserved ESTs and intergenic regions (SSRs), many of them could be unambiguously assigned to orthologous (i.e., evolutionarily homologous) positions on the Sorghum bicolor genome sequence by straightforward sequence alignment. Out of 653 SNP loci on the integrated Miscanthus map, 618 could be placed on the sorghum genome. Similarly, out of 193 SSRs on the map, 126 could be placed on the sorghum genome.
A simple dot plot (Figure 4A) strikingly reveals complete whole-genome duplication in M. sinensis relative to sorghum, with most chromosomes showing near perfect colinearity at the scale of our genetic map. After recognizing this extensive synteny, we oriented and renumbered the Miscanthus linkage groups to emphasize this correspondence between Miscanthus and sorghum. Every sorghum chromosome exhibits nearly complete marker synteny with a pair of Miscanthus linkage groups. Eight sorghum chromosomes are completely duplicated, showing a 1:2 correspondence to Miscanthus linkage groups. We infer the whole genome nature of the duplication by the density of colinear markers in sorghum euchromatin, where our gene-biased markers are found. The only evident rearrangement in these chromosomes is a small inversion near the top of Sb4 relative to Ms8 and Ms9. Since Ms8 and Ms9 share the same ordering in this region, this inversion either occurred in the sorghum lineage, or in the stem lineage of Miscanthus prior to the tetraploidization event, or is an error in the sorghum genetic map or sequence assembly.
The remaining two sorghum chromosomes, Sb4 and Sb7, are also duplicated over their entire euchromatic spans, but show a more complex pattern of synteny with Miscanthus. Ms8 is an intact copy of Sb4, and Ms13 is an intact copy of Sb7. The second copies of these two sorghum chromosomes, however, are fused into the single linkage group Ms7. Ms7 then appears as a copy of Sb7 inserted into the centromeric region of Sb4 (Figure 4B). This single fusion explains the odd base chromosome number of Miscanthus. By following the relative orientations of sorghum chromosome arms in Miscanthus, we see that this fusion has the characteristic form of a type of insertion previously observed in other grasses [39]. Since all Miscanthus species have the same base chromosome number, this fusion presumably occurred in the lineage leading to the last common Miscanthus ancestor.
Mapping C4-PPDK loci in Miscanthus
C4 photosynthesis in the Panicoideae (including maize, Saccharinae, millet, switchgrass, Miscanthus) is facilitated by a C4-specific form of the pyruvate, phosphate dikinase enzyme (C4-PPDK). Physiological and molecular evidence suggest that altered expression of C4-PPDK may contribute to cold tolerant C4 photosynthesis in Miscanthus x giganteus[40, 41]. The closely related Sorghum bicolor has a single C4-PPDK gene located on chromosome 9 [5]. Sequencing of cloned cDNAs from triploid Miscanthus x giganteus identified five distinct transcripts, including one apparent pseudogene [40], which suggests even greater genetic complexity than three homoeologous C4-PPDK alleles. Based on our observation of whole genome duplication, we reasoned that M. sinensis might have an unlinked pair of paralogous C4-PPDK genes. Based on synteny considerations, we expected that these C4-PPDK's would lie on Miscanthus LG's, 16 and 17, both of which are syntenic to Sorghum 9.
To look for C4-PPDK paralogs in M. sinensis, and to identify the genetic map position or positions of these gene(s), we designed primers to amplify the third intron of the gene based on the single previously known Miscanthus C4-PPDK. Two amplicons were observed in both parents, and both were cloned and sequenced. Cladistic methods identified two distinct paralogs of C4-PPDK (Figure 5A), which we named C4-PPDK1 and C4-PPDK2.
By aligning partial sequences of C4-PPDK in M. sinensis with the homologous sequence in S. bicolor, S. officinarum, and Z. mays, we measured the sequence divergence and phylogenetic relationship between the two Miscanthus homoeologs and homologous sequences in related outgroups (Figure 5A). The divergences between Ms C4-PPDK1 and sorghum and sugarcane C4-PPDK are comparable, suggesting that the origin of Miscanthus could be contemporaneous with the split between sorghum and sugarcane. Ms C4-PPDK2 branches outside of the Ms C4-PPDK1/sorghum/sugarcane clade, which could indicate that the other parent involved in Miscanthus tetraploidy was more divergent. These inferences, however, are weak due to the limited sequence length used in the analysis.
To map the two evident paralogs of C4-PPDK, we designed markers for each gene based on observed intronic sequence variation. Marker EBI 847 is a Cleaved Amplified Polymorphic Sequence (CAPS) marker designed to detect the SNV at position 397 in Additional file 6: Figure S6A, and marker EBI 848 is a sequence length polymorphism (SLP) marker that detects two indels between 1354 bp and 1388 bp (Additional file 6: Figure S6B). Both markers show a 1:1 segregation ratio (Additional file 13: Table S7). EBI 847 maps to Miscanthus linkage group Ms16 at 36.8 cM on the integrated map (41.2 cM on the GF maximum likelihood map) while EBI 848 is placed on linkage group Ms17 at 19.2 cM on the integrated map (20.1 cM on the UN maximum likelihood map). Miscanthus linkage groups 16 (C4-PPDK1) and 17 (C4-PPDK2) are the homoeologs of S. bicolor chromosome 9, which contain sorghum C4-PPDK (Figure 5D). This demonstrates both the utility of our genetic map and sorghum synteny for mapping genes in Miscanthus. This is the first documentation of the presence of two paralogous (indeed, homoeologous) C4-PPDKs in Miscanthus. The presence of two paralogs provides an opportunity for regulatory divergence and could contribute to the ability of Miscanthus to perform cold tolerate photosynthesis.