Genome and transcriptome sequencing identifies breeding targets in the orphan crop tef (Eragrostis tef)

Background Tef (Eragrostis tef), an indigenous cereal critical to food security in the Horn of Africa, is rich in minerals and protein, resistant to many biotic and abiotic stresses and safe for diabetics as well as sufferers of immune reactions to wheat gluten. We present the genome of tef, the first species in the grass subfamily Chloridoideae and the first allotetraploid assembled de novo. We sequenced the tef genome for marker-assisted breeding, to shed light on the molecular mechanisms conferring tef’s desirable nutritional and agronomic properties, and to make its genome publicly available as a community resource. Results The draft genome contains 672 Mbp representing 87% of the genome size estimated from flow cytometry. We also sequenced two transcriptomes, one from a normalized RNA library and another from unnormalized RNASeq data. The normalized RNA library revealed around 38000 transcripts that were then annotated by the SwissProt group. The CoGe comparative genomics platform was used to compare the tef genome to other genomes, notably sorghum. Scaffolds comprising approximately half of the genome size were ordered by syntenic alignment to sorghum producing tef pseudo-chromosomes, which were sorted into A and B genomes as well as compared to the genetic map of tef. The draft genome was used to identify novel SSR markers, investigate target genes for abiotic stress resistance studies, and understand the evolution of the prolamin family of proteins that are responsible for the immune response to gluten. Conclusions It is highly plausible that breeding targets previously identified in other cereal crops will also be valuable breeding targets in tef. The draft genome and transcriptome will be of great use for identifying these targets for genetic improvement of this orphan crop that is vital for feeding 50 million people in the Horn of Africa. Electronic supplementary material The online version of this article (doi:10.1186/1471-2164-15-581) contains supplementary material, which is available to authorized users.


Supplementary Notes
. Summary  M13 primer (5′-CACGACGTTGTAAAACGAC). 0.2 mM of each dNTPs, and 0.5 U of GoTaq polymerase (Promega, Dübendorf, Switzerland). Thermocycling started with a denaturation step for 2 min at 94 °C followed by 45 cycles of 20 s at 94 °C, 20 s at 50 °C, and 1 min at 72 °C, and stopped after a final extension step of 72 °C for 10 min. After PCR, samples were denatured by adding 30 μL formamide stained with bromophenol blue. Finally, 0.5 μL of the PCR products were loaded on 7% polyacrylamide gels. Gels pictures were analyzed using the program GelBuddy [5]. Eighteen other tef ecotypes as well as four other Eragrostis species (E. curvula, E. minor, E. pilosa and E. trichodes) were amplified with the primers M13-tailed forward and reverse used for the marker amplification with the same PCR conditions and the amplicons were sequenced by Sanger method with the M13 primers by Microsynth (Microsynth AG, Balgach, Switzerland).
Sequencing the entire 10 kbp region using Sanger sequencing and then aligning the scaffold to the amplicon resulted in 9,707 aligned nucleotides between CNLTs316 and CNLTs472 on scaffold2429 with 99% sequence identity. Sequencing of the other fragment of length 8,369 bp between CNLTs77 and CNLTs322 on scaffold8420 resulted in an alignment with 97% sequence identity between the tef scaffolds and the corresponding Sanger sequence. The number of N's was often poorly estimated.

Supplementary Note 2. Comparison of tef genome to other grasses. The tef genome and Maker gene
predictions were uploaded to CoGe [6,7] a platform containing many draft and whole genomes and providing numerous tools for genome alignment, comparison and visualization [8][9][10]. The SynMap function of CoGe aligns two genomes by using sequence similarity as well as syntenic information.
First, putative genes or regions of homology are found between two genomes, then collinear sets of genes are used to infer synteny and syntenic pairs of genes are assigned. These can be used to generate dotplots of homology as in Figure 2 and Supplementary Figure S3. In addition, a host of integrated tools can then be used for genome analysis and visualization. SynMap was run with default settings including the LastZ option for Blastz [11] as well as the following parameters: Minimum number of aligned pairs=5 or 3, Maximum distance between two matches=20, Tandem duplication distance=10.
SynMap first finds regions of high homology using BLAST or Last, a much faster variant of Blast [12]. SynMap identifies collinear putative homologous sequences in two genomes using DAGChainer [13]. The SynMap function was first used to align the tef scaffolds with the Sorghum bicolor genome. The tef scaffolds ordered according to the sorghum genome were then downloaded as a list and their sequences joined to form artificial tef "pseudo-chromosomes". These tef pseudo-chromosomes were used to orient the linkage groups of Zeid [1] in Figure 2 and Supplementary Table S17. Circos was used to generate the plot [14]. The Synmap function of CoGe was used to do pairwise comparisons of the following genomes: Eragrostis tef (Coge id 38364; current work), Sorghum bicolor (Coge id 38364; [15]), Zea mays (Coge id 333; [16]), Oryza sativa japonica (Coge id 3; [17]) and Setaria italica (Coge id 32546; with CNS PL2.0l v2.1,id2240 [18]) using the default settings.
CodeML of PAML [19] is integrated into CoGe and can be used to estimate the number of synonymous and nonsynonymous substitutions per site (Ks and Ka, respectively) for the complete set of orthologous genes between two genomes. The mode(s) of the distribution of Ks values between two genomes represents either a speciation or a genome duplication event. The ages of the modes of the peaks were estimated using a molecular clock rate of 6.5 x 10 -9 synonymous substitutions per synonymous site per year [20]. These estimates can be found in Supplementary Table S16. Additionally, the Maker gene predictions were uploaded to CoGe and can there be visualized and compared to other grasses as shown for the SAL1 gene in Figure 4B. The CoGe URL for this analysis is http://genomevolution.org/r/bsyp.

Supplementary Note 3. Annotation.
Annotation of the proteins predicted from the transcirptome was performed by the Praise (PRotein Automated annotatIon SystEm) UniProtKB/Swiss-Prot internal automated annotation platform [21]. Praise is an annotation templating system driven by sequence analysis results via manually curated context sensitive annotation templates. Templates (called annotation "rules" -UniRules) are manually curated context sensitive annotation fragments and represent a language that is interpreted by the Praise template engine. It propagates detailed functional annotation (e.g. active site positions) derived from Prosite and HAMAP motif matches, resolves redundant or conflicting predictions (e.g. for transmembrane) and aggregates all generated annotations into UniProtKB/Swiss-Prot format entries. The Praise platform annotates fewer proteins than systems using a simple BLAST or InterPro-associated GO terms but generates high quality and more numerous annotation "elements" (Supplementary Table S21). All different annotation types were pooled by entry. When excluding non-informative matches (against hypothetical proteins), the percentage or proteins annotated drops to 57%.
Supplementary Note 4. Abiotic stress. The sequences of 27 genes implicated in abiotic stress in various grasses were downloaded from NCBI and used to find the protein sequence of the Sorghum biocolor homolog (Phytozome, version 79) using blastx. Then tblastn was used to search each sorghum abiotic stress protein sequence in the tef genome and transcriptomes of tef (core, extended, 454Isotigs, drought (TrinityGNY11and2), waterlogging (TrinityGNY12and3) and control (TrinityGNY10and1) and other grasses using an e-value of 1e-05. The number of copies found with length greater than or equal to 70% of the length of the query sequence was recorded.

Supplementary Figure S8. Tandem duplication of SAL1 gene confirmed by Sanger sequencing.
A) Structure of the SAL1 gene in tef and sorghum. B) Eight primer sets were designed to do PCR amplification and Sanger sequencing of the region on scaffold6855 where tandem duplications of the SAL1 gene was found. C) Sequences of primers used to amplify the region of the SAL1 tandem triplication on scaffold6855. D) The piece of scaffold6855 containing the tandem triplication. Sanger sequencing confirmed the genomic sequence containing three SAL1 genes in a tandem arrangement. Despite repeated attempts, one region (exons 2-4 of the first copy in blue) was not confirmed due to failure of the PCR amplification. Exons 1 and 5-8 were found. Exons are shown in uppercase and bolded. Supplementary Table S1. Summary of genome sequencing data for the tef genome.

Supplementary Table S4. Percentage of genes and bases found in tef transcriptome and genome.
Transcripts were compared with blastn with e-value 1e-10. The number of genes with a homolog found and the percentage of query bases in genes with a homolog that were aligned are reported.  [6,7,19] and compared to the estimates generated from the rht1 and sd1 genes [44]. In CoGe, each analysis is assigned a unique URL so that the workflow can be later recalled. The complete URL for these analyses is http://genomeevolution.org followed by the directory name given in the  Table S22. Abiotic stress genes and their numbers in grass genomes. Number of copies of genes known from the literature to be implicated in abiotic stress have been counted in the genome of tef (Eragrostis tef), sorghum (Sorghum bicolor), rice (Oryza sativa), Brachypodium distrachyum and foxtail millet (Setaria italica). For each genome, the number of matches having a length greater than or equal to 70% of the length of the query sequence is shown. Sources for indicated genes in different species are B. distachyon [66], S. italica [28], Z. mays [16], S. bicolor [15], O. sativa [67], H. vulgare [68], S. cereal [68], T. aestivum [68], and for E.tef (the current work

Supplementary Table S23. Presence of wheat, barley and rye gluten epitopes and their amounts in grass genomes.
Epitopes of lengths 20, 16, 13, 12 and 11 from wheat, barley and rye from [22] have been searched in several grass genomes. The epitopes were found only in wheat, barley and rye. No epitope was found in tef, sorghum, setaria, bracypodium, rice or mays. Sources for B. distachyon, S. italica, Z. mays, and S. bicolor were Phytozome; for O. sativa was IRGSP; for H. vulgare was MIPS; S. cereal and T. aestivum were NCBI; and for E.tef was current work. Columns with gray shading show three species with gluten reaction while the other columns indicate six species with no gluten reaction. Sources for indicated genes in different species are B. distachyon [66], S. italica [28], Z. mays [16], S. bicolor [15], O. sativa [67], H. vulgare [68], S. cereal [68], T. aestivum [68], and for E.tef (the current work).