Building the sugarcane genome for biotechnology and identifying evolutionary trends
- Nathalia de Setta1, 2,
- Cláudia Barros Monteiro-Vitorello3,
- Cushla Jane Metcalfe1,
- Guilherme Marcelo Queiroga Cruz1,
- Luiz Eduardo Del Bem4,
- Renato Vicentini4,
- Fábio Tebaldi Silveira Nogueira5,
- Roberta Alvares Campos6,
- Sideny Lima Nunes6,
- Paula Cristina Gasperazzo Turrini1,
- Andreia Prata Vieira1,
- Edgar Andrés Ochoa Cruz1,
- Tatiana Caroline Silveira Corrêa1,
- Carlos Takeshi Hotta6,
- Alessandro de Mello Varani3,
- Sonia Vautrin7,
- Adilson Silva da Trindade8,
- Mariane de Mendonça Vilela4,
- Carolina Gimiliani Lembke6,
- Paloma Mieko Sato6,
- Rodrigo Fandino de Andrade6,
- Milton Yutaka NishiyamaJr6,
- Claudio Benicio Cardoso-Silva4,
- Katia Castanho Scortecci8,
- Antônio Augusto Franco Garcia3,
- Monalisa Sampaio Carneiro9,
- Changsoo Kim10,
- Andrew H Paterson10,
- Hélène Bergès7,
- Angélique D’Hont11,
- Anete Pereira de Souza4,
- Glaucia Mendes Souza6,
- Michel Vincentz4,
- João Paulo Kitajima12 and
- Marie-Anne Van Sluys1Email author
© de Setta et al.; licensee BioMed Central Ltd. 2014
Received: 17 December 2013
Accepted: 19 June 2014
Published: 30 June 2014
Sugarcane is the source of sugar in all tropical and subtropical countries and is becoming increasingly important for bio-based fuels. However, its large (10 Gb), polyploid, complex genome has hindered genome based breeding efforts. Here we release the largest and most diverse set of sugarcane genome sequences to date, as part of an on-going initiative to provide a sugarcane genomic information resource, with the ultimate goal of producing a gold standard genome.
Three hundred and seventeen chiefly euchromatic BACs were sequenced. A reference set of one thousand four hundred manually-annotated protein-coding genes was generated. A small RNA collection and a RNA-seq library were used to explore expression patterns and the sRNA landscape. In the sucrose and starch metabolism pathway, 16 non-redundant enzyme-encoding genes were identified. One of the sucrose pathway genes, sucrose-6-phosphate phosphohydrolase, is duplicated in sugarcane and sorghum, but not in rice and maize. A diversity analysis of the s6pp duplication region revealed haplotype-structured sequence composition. Examination of hom(e)ologous loci indicate both sequence structural and sRNA landscape variation. A synteny analysis shows that the sugarcane genome has expanded relative to the sorghum genome, largely due to the presence of transposable elements and uncharacterized intergenic and intronic sequences.
This release of sugarcane genomic sequences will advance our understanding of sugarcane genetics and contribute to the development of molecular tools for breeding purposes and gene discovery.
KeywordsSaccharum Bacterial artificial chromosome sequencing Polyploidy Genome Genetics Grasses
Sugarcane is an important crop worldwide, producing 80% of the world’s raw sugar and is increasingly used for bio-fuel . A key goal in meeting growing demand is to improve sugarcane yield and accelerate selection for desirable traits. Genomics has been shown to be successful in genome-assisted breeding programs for selecting superior genotypes and more efficient breeding strategies.
Species of the Saccharum complex (sugarcane) are part of the Poaceae family and together with Sorghum, Zea and other genera comprise the Panicoidae superfamily, one of the C4 photosynthetic grass lineages (Additional file 1: Figure S1) . At the end of nineteenth century, early sugarcane breeders in Java and India carried out crosses between S. officinarum and S. spontaneum in order to introduce vigor and resistance genes from wild S. spontaneum, while quickly recovering the high sugar content of S. officinarum cultivars . Modern sugarcane cultivars are derived from those early interspecific genotypes, followed by several cycles of intercrossing and selection. They are polyploid aneuploid hybrids with unequal contribution from S. officinarum (80–90%) and S. spontaneum (10–20%) parental genomes and a small percentage of recombinant chromosomes [4, 5]. Sugarcane hybrids have ploidy levels of 10 or more and have a much larger total genome size (R570 cultivar, 10,000 Mb and 2n = 115) than that of maize (5500 Mb, 2n = 20), sorghum (1600 Mb, 2n = 20) or rice (860 Mb, 2n = 24) reflecting the high polyploidy level of sugarcane cultivars .
The sorghum genome, the closest related fully sequenced and annotated genome to sugarcane, is widely recognized as reference genome for comparative analysis. The origin of modern sugarcane cultivars raises issues not only related to the extent and nature of the divergence of the sugarcane and sorghum genomes, but also about the relationships (meiosis and expression dosage) among hom(e)ologous loci. Equally importantly, deciphering the sugarcane genome is a major goal for improving genome wide assisted selection breeding opportunities worldwide. However, the hybrid polyploid nature of modern cultivars imposes limitations to breeders in understanding genotype to phenotype allelic variation and dosage. The present study was undertaken within the framework of a larger sequencing initiative to generate a comprehensive dataset, providing information on sugarcane genome structure and function as a basis for future functional genetic studies.
BAC sequencing and repeat annotation
Three hundred and seventeen sugarcane bacterial artificial chromosome (BAC) inserts of a R570 cultivar genomic library  were sequenced. A total of 189 BACs were selected using probes homologous to 84 previously described expressed genes . Seventy-eight BACs were selected for using probes based on five superfamilies of transcriptionally active transposable elements (TEs). The remaining 50 BACs were selected in a previous study using RFLP markers from nine sugarcane linkage groups  (Additional file 2: Table S1).
In total, 36.58 million bases were sequenced with an average of 361 bp per read, 25,000 reads per BAC, and 92 X coverage (Additional file 2: Table S1). This represents 3.7% of the monoploid complement, based on the estimate of a 10 Gb genome size for the decaploid hybrid cultivar R570 . Two hundred and five BACs were assembled into one contig each and the remaining 112 BACs were assembled into an average of 3.15 contigs. Although not all BACs were single contigs, all have a proposed scaffold and are a single-fasta file. To date, most of the gaps have repetitive sequences at the ends. The minimum, maximum, median, and average sizes of BAC assemblies were 12.25, 259.2, 115.38 and 112.34 Kb, respectively. A BLASTn search indicates that none of the sequences were derived from chloroplast or mitochondrial genomes.
The repetitive content was estimated from BAC assemblies using the Repbase database  and a curated sugarcane Long Terminal Repeat (LTR) retrotransposon database . Fifty percent of the BAC sequences are repetitive, 49.4% transposable elements (TEs) and 0.43% satellite repeats (Additional file 3: Table S2 and Additional file 4: Table S3). Of the TEs, LTR retrotransposons are the most abundant (40.86%), followed by DNA transposons (7.93%), and non-LTR retrotransposons. TE content of individual BACs is highly heterogeneous, varying from zero (a ribosomal DNA BAC) to 98.7%. Miniature inverted-repeat TEs (MITEs) represent 3% of the sequences. Of 3,663 curated MITEs, the most abundant types are Tourist (63.8%), followed by Stowaway (27.9%), hAT (5.6%), MULE (1.7%), unclassified (0.6%) and CACTA (0.4%). Sugarcane has a ratio of Gypsy-Ty3 to Copia-Ty1 elements (1.3 to 1) more closely resembling that of maize (1.6 to 1), than of sorghum (3.7 to 1) or rice (4.9 to 1) genomes, suggesting a closer correlation with genome size rather than with phylogeny.
Gene annotation and CDS validation
BAC assemblies masked for repetitive sequences were analyzed by a combination of de novo gene prediction software programs and searches against databases to identify non-TE coding genes. For 14 BACs there were no predicted protein-coding genes identified. A total of 1,400 coding regions were predicted and annotated. An average of 3.8 CDSs (Coding DNA Sequences) were found per 100 Kb, representing one protein-coding gene per 26.12 Kb. RNA-seq data from the cultivar RB92-5345 and the sugarcane assembled EST sequences (SASs) from the SUCEST (sugarcane EST) Project  were used to validate CDSs. All CDSs mapped against at least one SAS and 1,218 mapped against at least one pair of RNA-seq reads (87% of the total). This may be because there was no detectable expression of these genes under the experimental conditions used, or high sequence divergence between the two cultivars (R570 and RB92-5345) for these specific loci, and/or false positive gene annotation.
Using Blast2GO, GO (gene ontology) terms were assigned to 1,081 of the 1,400 predicted protein-coding sequences (77.8%). A total of 4,730 GO functional terms were assigned to the 1,081 sequences. GO terms were placed into three broad categories, Biological Process 1,884 (39.8%), Molecular Function 1,502 (31.7%) and Cellular Component 1,344 (28.5%). The most abundant terms in the Biological Process category include cellular process and metabolic process and in the Molecular Function category, catalytic activity and binding. In the Cellular Component category, the most common terms were organelle, membrane and cell (Additional file 5: Figure S2). BLASTp searches against the NCBI nr database  confirmed that most of the sugarcane protein sequences are most similar to those of sorghum (Additional file 6: Figure S3). The top BLAST match for 908 protein sequences was to sorghum sequences.
CDSs were broadly distributed amongst the 17 functional categories described by the SUCEST Project  (Additional file 7: Figure S4). Transcriptionally active genes (as determined by SUCEST) were evaluated by a WU-blast search using SASs as queries against the BACs. Sixteen ½ percent of the SASs matched the unmasked BACs, i.e. for 83.5% of the SASs there was no match to the unmasked BACs. In the masked BACS, there were matches for 13% of the SASs. These percentages may represent an overestimation due to multiple matches to hom(e)ologous or paralogous genes. Annotated TEs were homologous to 3.5% of SASs, suggesting that 3.5% of the transcriptome is derived from TEs. Our present estimate is close to that of a previous estimate of 2.3% .
Metabolic pathway genes
Mapping of annotated CDSs and SASs using the KEGG Mapper tool at MG-RAST  provided a global view of known sugarcane metabolic pathways. The comparison between BAC CDSs and SASs mapping identified genes not previously reported in the sugarcane transcriptome. EC numbers were assigned to 803 predicted enzyme-coding genes distributed amongst various metabolic pathways, including those involved in carbohydrate, lipid and amino acid metabolism (Additional file 8: Figure S5). Most of the predicted enzymes (594) were identified in the SASs collection only, 122 were common to the SAS and BAC sequences and 66 identified by BAC sequence alone. Genes predicted from BAC sequence alone included enzyme-coding genes from the carotenoid, amino acid, diterpenoids and other fatty acids biosynthesis pathways (Additional file 9: Table S4).
Three loci were identified for the glucose-1-phosphate adenylyltransferase gene, the enzyme that catalyzes the conversion of α-D-glucose-1-phosphate into ADP-glucose. This is the first and key regulatory step in starch synthesis . Based on our RNA-seq mapping data, one locus was more highly expressed than other two (Figure 1). The glucose-1-phosphate adenylyltransferase enzyme is composed of two large and two small subunits . In maize, the large subunit is coded by the maize Sh2 locus, which is well characterized in plants, and in particular in grasses. Four loci are responsible for this reaction in sorghum (Sb09g029610, Sb01g008940, Sb02g020410 and Sb03g028850). Three of these loci were identified in sugarcane BACs, Sb09g029610 (SHCRBa_003_M06, SHCRBa_026_K06 and SHCRBa_078_K12), Sb01g008940 (SHCRBA_027_I16, SHCRBa_033_L20, SHCRBa_073_J10 and SHCRBa_119_J13) and Sb03g028850 (SHCRBa_009_B01, SHCRBa_012_A01 and SHCRBa_022_D05). The sugarcane orthologous of BACs SHCRBa_022_D05, SHCRBa_012_A01 and SHCRBa_009_B01 correspond to the Sh2 locus (Figure 1).
Sucrose-6-phosphate phosphohydrolase (S6PP) catalyzes the reaction from sucrose-6P to sucrose. There is a tandem duplication of this gene, not previously published, but evidenced by genomic sequences in the Phytozome database , in sorghum (Sb09g003460.1 and Sb09g003463.1), Setaria italica (Si022142m and Si024709m) and Panicum virgatum (Pavirv00037112m and Pavir00037113m). However, the same duplication is not found in maize or any of the other grass genomes available. The BAC SHCRBa_104_G22 annotation suggests that s6pp is duplicated in the sugarcane genome. In order to better understand this region, which is important for sucrose synthesis, we examined the composition of the intergenic region between the two copies of s6pp in modern sugarcane cultivars, Saccharum species, sorghum and Miscanthus sp. by sequencing a PCR amplified 1,539 bp fragment. One hundred and ninety amplicons were aligned against sorghum sequences and the R570 BAC (SHCRBa_104_G22). Overall nucleotide identity is high, 99.988% (SD 0.001). The sorghum sequence is the most divergent with an average 99.935% (SD 0.005) identity compared with all other sequences. S. spontaneum sequences are more divergent from the other sugarcane sequences (99.985%, SD 0.006). Neighbor-joining and maximum likelihood phylogenetic analyses resulted in unresolved evolutionary relationships (data not shown).
sRNAs in sugarcane BACs
A sRNA library from sugarcane leaf tissue  was mapped against the BACs to evaluate the sRNA landscape and to identify new microRNA (miRNA) genes (Additional file 11: Figure S6). This library was derived from the hybrid SP80-3280, the main cultivar used to produce the SUCEST database . Most sRNAs from grasses are in the 24-nucleotide size range and therefore are most likely small interfering RNAs (siRNAs) or repeated-associated small interfering RNAs (rasiRNAs) . Sixty-one percent of the sRNAs in the SP80-3280 sRNA library were in the 23–25 nt range, and 48% of them mapped to TEs identified in the BACs. The 23–25 nt RNAs mapped to the TEs are 3 × more frequent than did the smaller 20–22 nt RNAs. This pattern is expected for rasiRNAs  and suggests that TEs are the origin of the 23–25 nt rasiRNAs, as well as the target for sRNA-mediated gene regulation.
Ribosomal and pericentromeric and/or centromeric BACs
No protein-coding genes were identified in 14 BACs and these were further analysed to better understand their sequence composition. Three BACs (SCHRBa_239_N21, SCHRBa_013_I13 and SCHRBa_029_018) were predicted to be pericentromeric and/or centromeric and one (SHCRBa_039_D18) was entirely composed of ribosomal tandem repeats. The other BACs were TE rich or had no significant matches to grass protein sequences available.
The ribosomal DNA (rDNA) BAC consisted of 14 45S ribosomal transcription units with a portion of one unit in the reverse orientation to the other 13 (Additional file 12: Figure S7A). Each 45S ribosomal unit was 8.8 Kb long, consisting of the 18S (1.8 Kb) ribosomal gene, the 208 bp internal transcribed spacer 1 (ITS1), the 5.8S (163 bp) ribosomal gene, the 216 bp internal transcribed spacer 2 (ITS2), the 26S (3.39 Kb) ribosomal gene, and the 527 bp intergenic spacer (IGS). The 45S ribosomal transcription units were 99.8% identical at the nucleotide level.
The three BACs classified as pericentromeric and/or centromeric contain the previously described sugarcane 137 bp centromeric repeat SCEN , and plant specific Gypsy-Ty3 centromeric specific-like retrotransposons (CRM) . Annotation of one of the centromeric BACs is shown in more detail in Additional file 12: Figure S7B. SCHRBa_239_N21 is 23% SCEN repeats and contains multiple copies of CRM and Tat elements. CRM_3 was the only complete CRM element identified. Three Tat elements were identified (Tat_2, 3 and 5), all full-length. Four hundred and thirty nine copies of the SCEN repeat were identified with a pairwise nucleotide identity of 76.8%.
R570 cultivar metaphase spreads were examined for localization of the pericentromeric/centromeric SCHRBa_239_N21 and ribosomal BAC by FISH (Additional file 12: Figure S7C). The pericentromeric/centromeric BAC SCHRBA_239_N21 hybridized to a region consistent with it being a component of the centromeric or pericentric region of all chromosomes, however, signal strength varied among chromosomes. Additional fainter signals observed on chromosome arms were probably from non-centromeric specific LTR retrotransposons in the BAC . For the ribosomal BAC, there were seven terminal, three interstitial and two undetermined signals.
Comparative genomics with sorghum
Comparison of sugarcane (Sc) and sorghum (Sb) genome size variation using a two-tailed Welch’s t-test
Average ± SD
Intron length/gene (Kb)
2.22 ± 3.58
1.53 ± 1.57
Intergenic length/syntenic blocka (Kb)
19.30 ± 25.05
11.15 ± 18.92
CDS length (Kb)
1.28 ± 0.75
1.29 ± 0.75
The syntenic blocks were further analysed to determine the nature of the sugarcane genome expansion. These were 3.533 Mb long in total in sugarcane and 1.990 Mb in sorghum. Sugarcane and sorghum have equivalent numbers of bases encoding gene exons, 0.446 Mb and 0.449 Mb, respectively. Therefore, introns, promoters and intergenic regions may account for the sugarcane syntenic region being 1.543 Mb larger. Repeat content in sugarcane was 1.356 Mb (consisting of 1.334 Mb TEs and 0.022 Mb of single sequence repeats (SSR) and low complexity regions). Repeat content in sorghum was 0.580 Mb (TEs: 0.562 Mb; SSR and low complexity: 0.018 Mb). The difference between the two species in repeat content indicates that the 0.776 Mb expansion of the sugarcane genome is due mainly to TE amplification. Among the TE sequences, Copia-Ty1 elements are the most common (16.17%), followed by Gypsy-Ty3 (12.28%) and DNA transposons, including CACTA (5.64%), hAT (1.09%) and Mutator elements (0.19%). The large fraction of unaccounted nucleotides in both sorghum (0.961 Mb) and sugarcane (1.731 Mb) may represent unidentified novel genes, uncharacterized TEs, or as-yet-unknown genomic elements.
Hom(e)ologous diversity and expression of rpa1alocus
The CDSs of the six genes shared by the 12 BACs and sorghum were aligned and concatenated to construct a phylogenetic tree (Figure 5A). Most of the sugarcane sequences fell into three well-supported groups (I, II and III) that were in agreement with the structural analysis. BACs SHCRBa_232_H22 and SHCRBa_227_O17 did not group with BACs that they were structurally closely related to, but fell in a separated, less related group (group III). BACs SHCRBa_035_B09 and SHCRBa_196_O13 fell outside all these groups. Interestingly, the topology of the phylogeny based on the concatenated CDSs generally reflects differences of TE content in the region examined. The main structural variation between the BACs in groups I and II is the presence of a DNA transposon between the fifth and sixth genes in group I, instead of the Harbinger cluster found in group II. No structural variation between groups II and III was detected, apart from a variant region downstream from the sixth gene, which contained several copies of different LTR retrotransposons.
The BACs selected using the rpa1a gene were also mapped against the SUCEST database (EST sequences) and the RNA-seq and sRNA libraries. Figure 5B shows mapping for one BAC of each of the three phylogenetic groups (I, II and III) and the other two BACs (SHCRBa_035_B09 and SHCRBa_196_O13). While most of the mRNA transcripts mapped against the CDSs, some sRNAs also mapped against TE sequences or non-coding regions, as expected. Two expression patterns were of particular note. First, there was a common region in all BACs, downstream from a LTR retrotransposon, with peaks of mRNA sequences (Figure 5Bi). Transcription of this region may be directed by promoter sequences in the 3′ LTR of the LTR retrotransposon. Alternatively, there may be an unidentified gene in this region. Second, differential hom(e)ologs expression was identified (Figure 5Bii). There are different sRNA patterns between BACs in the region between the rpa1a gene (white arrow) and gene 5. No sRNA reads were mapped to BACs SHCRBa_035_B09 or SHCRBa_196_O13, in which the intergenic region between the rpa1a gene and gene 5 is the shortest (1030 and 866 bp). On the other hand, in BACs SHCRBa_101_B12, SHCRBa_201_D09 and SHCRBa_232_h22, where the intergenic regions are longer (1556, 1556 and 1553 bp, respectively), there is also a higher number of mapped sRNAs. The main difference in the intergenic region in the BAC with the lowest (SHCRBa_196_O13) and the highest (SHCRBa_101_B12) sRNAs number of reads is the presence of DNA transposon fragments and a Harbinger TE in BAC SHCRBa_101_B12 (not shown in diagram), which support current models that these elements contribute to gene modulation .
The present study releases the largest and most diverse collection of sugarcane genomic regions to date. Based on comparative analysis, these regions are distributed throughout all sorghum chromosomes and are chiefly euchromatic. An understanding of these genomic regions will increase our knowledge of the structure of the sugarcane genome. The selected BAC collection includes genes known to be expressed and reveals a diverse set of sugarcane sequences associated with major biological processes. Insight into transcriptional patterns and epigenetic regulation were provided by the complementary RNA sequencing approaches.
Previous studies have shown that there is a high level of colinearity, gene structure, and sequence conservation between sorghum and sugarcane [28–31]. However, these reports conflict in terms of whether the sugarcane genome is expanding, or has expanded, relative to the sorghum genome or vice versa. Our data, based on a much larger sampling of linear genomic sequence, and assembled regions (about 100,000 bases per BAC), confirms the colinearity and conservation between the sorghum and sugarcane genomes. It also suggests that overall the sugarcane genome has undergone or is undergoing expansion within euchromatic regions compared with sorghum. This expansion is highly variable depending on the syntenic block examined (Additional file 13: Table S6), possibly explaining why the previous reports are conflicting. Nearly one-fourth of the sugarcane genome expansion compared with the sorghum can be attributed to differences in TE content, largely LTR retrotransposons. The presence of these dynamic elements within euchromatic regions may act as key factors in chromosome rearrangements, gene gain and loss, as well as epigenetic marks. Similar mechanisms have been shown to be associated with TEs in other grass genomes such as maize [32–34].
BACs from repetitive genomic regions were examined. Among these was a BAC composed entirely of 45S ribosomal units. A consistent variation in signal intensity from chromosome to chromosome was observed using rDNA and pericentromeric and/or centromeric BACs as probes for FISH (Additional file 12: Figure S7C). Centromeres within a species are generally composed of the same types of repeats, while the abundance and arrangement of repeats can vary both between and within species . Given the hybrid nature of sugarcane modern cultivars, variation in the pericentromeric/centromeric BAC signal may therefore be a reflection of the differences in pericentromeric/centromeric composition within or between the parental species. Based on previous findings using the same cultivar, the less intense interstitial rDNA signals are on S. spontaneum chromosomes, while the more intense terminal signals are on S. officinarum chromosomes . While the rDNA genes are highly conserved, ITS sequence divergence can be used to resolve species relationships within a genus. Following polyploidization events, ITS units can suffer several different fates, depending on the species and time since polyploidization, for example, loss of one parental type, or homogenization . It would appear that the position of the rDNA units from both parental species have been retained in modern sugarcane cultivars.
We estimate that almost one-half of the sugarcane BAC sequences are TEs. This estimate is close to that based on BAC-end sequences (BESs) from two sugarcane cultivars, R570 (42.8%)  and SP80-3280 (45.16%) . In general, as genome size increases the proportion of the genome composed of repeats increases . A significant proportion of grass genomes are composed of repetitive sequences, 40%, 62% and 82% for rice (420 Mb), sorghum (740 Mb), and maize (2160 Mb), respectively [25, 32, 39]. The basic genome size (1 ×) for S. officinarum, the main component of modern sugarcane cultivars genome, is 930 Mb , larger than sorghum (740 Mb). S. spontaneum, also one of the ancestors of modern sugarcane cultivars, is 750 Mb , similar to sorghum. The total monoploid genome size of the R570 modern cultivar, however, is 1 Gb. The percent of the sugarcane genome composed of repeats based on the BAC sequences is most likely an underestimate because the BACs were mainly from euchromatic gene-rich regions. Nevertheless, the low percent of repeats compared to genomes of a comparable size may also be a reflection of the size of modern sugarcane cultivar genome as a result of polyploidization events rather than as the result of massive TE expansion.
We examined all hom(e)ologous regions identified containing the rpa1a gene to better understand the consequences of polyploidization in terms of genome structure and regulation. Most of the structural variation among the BACs was due to variability in TE insertion patterns, although the topology of the phylogenetic tree inferred using the coding gene sequences reflects structural variation between hom(e)ologous regions. The topology indicates that there is at least three well-defined haplogroups in this region. We speculate that these haplotypes are derived from the parental species. The 10 BACs from the groups A, B, and C (approximately 80% of the BACs) were inherited from S. officinarum and the two remaining from S. spontaneum. We were not able to evaluate if there is any selective constraint driving the diversification of the putative hom(e)ologous sequences from S. officinarum as haplogroups. Further sequencing of this region in S. officinarum and S. spontaneum may identify any selective constraint.
The high conservation of gene content and colinearity between sugarcane haplotypes has been previously shown, here we confirm this finding analyzing more hom(e)ologs of a single region [29, 30]. These results contrast with the high DNA sequence elimination and recombination observed between hom(e)ologous chromosomes in allopolyploid wheat and other monocot and eudicot plants (see Liu et al.  for a review). In all these cases, it is not clear if the changes occurred immediately after polyploidization. We consider that the high conservation is not due to the gene richness of the regions, since the gene frequency in the rpa1a BACs is much lower than those regions studied in previous colinearity studies in sugarcane [28–30]. There are several possible reasons for the low recombination rate in this region. The hybridization events resulting in the modern sugarcane cultivar are very recent, within the last two centuries, and it has been estimated that there have been few meiotic divisions since . The parental species themselves have recently diverged, between 1.5 and 2.0 million years ago. Finally, the evolution of the haplogroups in this polyploid genome may have been shaped by the phenomenon of pairing behavior, which favors the transmission of non-mutated chromosomes to the progeny .
We mapped sRNA and mRNA libraries against the rpa1a region. The results show that the hom(e)ologous BACs have differential mapping patterns for both kinds of RNAs. Most of the variation was observed in promoters, TEs and intergenic sequences. The promoter regions within the LTRs at each end of an LTR retrotransposon can act as novel promoters or enhancers, driving changes in host-gene expression patterns . There are peaks of RNA (ESTs and RNA-seq) mapping in a region downstream to an LTR retrotransposon, where no non-TE-coding genes have been identified. Promoters within the 3′ LTR region may be driving expression of this region in an allelic dependent manner. The region between two host genes is variable both in length and for the presence/absence of TE fragments. Peaks of sRNA mapped between the two non-TE coding genes correlates with the intergenic length and the presence of TEs. Several studies have shown that hom(e)ologous diversity needs to be evaluated not only in terms of gene coding DNA, but also in terms of regulatory regions, since regulatory regions have important roles in genetic control and are under independent evolutionary pressures . This can be particularly important in polyploids, due to the high number of hom(e)ologous loci. In the rpa1a BACs there are indications of mRNA and sRNA variation, as evidenced by the sRNA and transcriptome mapping, that could influence gene expression and function. Interestingly, the two most highly expressed rpa1a hom(e)ologs correspond to a BAC that did not cluster within the three main haplotypes and another from cluster III.
Crop genomics is being used to increase the effectiveness of breeding, since traits of interest can be selected more precisely, directly and cost-effectively . For over 10 years directed genetic modification of sugarcane has been a reality in laboratories with field trials also being conducted . Genomics could also aid in traditional marker-based breeding, providing putative marker sequences derived from genes, TEs, intergenic or low complexity regions [45–47]. Here we have sequenced and annotated 1,400 sugarcane protein-coding genes and several non-coding genes, including ribosomal and miRNA genes. The protein-coding genes code for enzymes in several metabolic pathways. For some of the genes that are clearly important in sugarcane breeding, for example, genes from the sucrose and starch metabolism, the complete CDSs are available in the transcriptome database, but no information about introns and intergenic regions have been previously published. Other genes have not been previously sequenced in sugarcane, among them genes involved in the metabolic pathways not traditionally considered in sugarcane breeding, such as the carotenoid biosynthesis pathway. The sequencing of complete genes, including coding regions, UTRs, introns and promoters in genomic sequencing projects, instead of sequencing only transcribed sequences, as in transcriptome sequencing projects, provides a broader database for the design of transgene constructs. This work has shown that it is fundamental to combine genome and transcriptome sequencing approaches (sRNA and mRNA) to validate genome annotation and provide a broad understanding of functional genomics.
The potential of the sequenced BAC collection was demonstrated by sequencing 16 genes related to the central enzymatic steps of carbon partitioning in source-sink-growth in plants. Three main conclusions were drawn. First, the sequenced regions enabled the identification of differential expression levels in specific enzymatic steps in actively growing bud tissue. Second, we were able to differentiate the expression of paralogous loci. Finally, a previously unreported gene duplication was described for the s6pp gene in sugarcane and sorghum. Examination of a region covering the intergenic region and part of the two genes from a commercial hybrid breeding panel, S. officinarum, S. spontaneum, Miscanthus sp and the sorghum genome shows that S. spontaneum did not contribute to the haplotype identified in hybrid cultivars. Interestingly, the Miscanthus sp. sequences fell into four major haplotype groups. Another haplotype group consisted exclusively of commercial hybrids. Sequence variability among paralogous or hom(e)ologous allelic loci has to be considered, since the most effective gene copy should be selected in order to avoid non-additive effects . Thus, it is essential to sequence further candidate hom(e)ologous regions that have the potential to be valuable in transgenic breeding, in order to increase our knowledge of sugarcane gene variability.
The genome sequence released in the present work contributes towards a fundamental understanding of the structure of the sugarcane genome. The present data will also contribute to improving our understanding of the genetic basis of sucrose content and physiology, providing molecular tools for breeding purposes and gene discovery related to traits such as plant defense, metabolism, flowering, and responses to biotic and abiotic stresses.
Sugarcane BAC library and BAC selection
Two approaches were used to select BACs from the sugarcane hybrid R570 : macro-array hybridization using PCR-amplified and overgo probes, and 3D pool screening by real-time PCR. Overgo probes were designed using the BACMAN database  and were used as queries in a BLASTn search against the sorghum (v1.0)  and rice (v6.1)  genome assemblies. Since the overgo probes are only 40 nts in length, we used a series of cut-off values; for matches from 40 to 38 nt (allowing for 2 mismatches for 40 nt or 1 mismatch for 39 nt and no mismatches for 38 nt alignments). Only those probes that hybridized to 1–10 BACs were used to select BACs for sequencing to reduce false positives and to exclude multigene families.
Macro-array hybridizations were performed according the manufacturer’s instruction and Bowers et al. . After purification of the PCR fragment using the GFX PCR DNA and Gel Band Purification Kit (GE Healthcare) the probes were labeled using the Random Primer DNA Labeling System (Invitrogen), following the manufacturer’s instructions. 3D pools were constructed according to Adam-Blondon et al. . Briefly, the 269 plates of the SHCRBa library were arranged in 11 blocks of 24 plates and the BACs pooled by plate, line and row in growth medium. Each pool was amplified using the Phi29 enzyme from the IllustraGenomi Phi V2 DNA Amplification Kit (GE Healthcare), according the manufacturer’s instructions. We screened for specific markers by RT-PCR as follows: in a final volume of 5 μl, 100 ng of the pool DNA was amplified using 0.4 μM of each primer and 1 X SYBR Green Master Mix (Roche). The cycling parameters used for amplification were 95°C for 5 min for initial denaturation, 40 cycles of 95°C for 20 sec, 60°C for 20 sec and 72°C for 40 sec.
BAC sequencing and assembling
Three hundred and thirteen BACs were sequenced using the 454/Roche sequencing platform and four were sequenced using Sanger/ABI technology. BAC DNA was extracted using the QIAGEN Large-Construct Kit, following the manufacturer’s protocols. 454 DNA libraries were prepared using either the General or the RAPID GS FLX Titanium Library Preparation Kit, individually emPCR amplified and sequenced using 454 Titanium kits, according the manufacturer’s specifications and default parameters. Nine different gasket pooling strategies were tested (Additional file 2: Table S1). Sanger sequencing was performed according Manetti et al. . The sequencing reads were assembled using Phrap  with different parameter values, depending on the results of the first assembly, which was performed using default parameters and repeat_stringency set at 0.3 (for ease of contig joining). Given the deep coverage achieved for each BAC, all sequences were used for assembly. Bad quality reads were automatically kept as singletons by Phrap . Medium-to-low quality reads were used in the assembly because, given the high coverage, the impact of these medium-to-low quality reads is expected to be low. This strategy was adopted because in regions that are difficult to sequence (e.g., homopolimeric regions), median-to-low quality reads could help to close gaps or at least confirm a scaffold.
Gene and repeat annotation
BACs were first annotated using an automated pipeline for identification of genes and TEs based on de novo prediction and BLASTx. TEs were screened using RepeatMasker  against repeats from Viridiplantae  and sugarcane LTR retrotransposons , with a cut-off score of 250. Both complete and incomplete elements were identified. MITE Hunter  with default parameters was used to extract all MITEs from the BACs. The alignment files generated were grouped together using clans  and screened using RepeatMasker . MITEs were then classified according to type of target site duplication and terminal inverted repeat features .
Genes were annotated using masked BACs, using the software programs Augustus , Glimmer HMM , PASA , Evidence Modeler , SignalP , and TMHMM . Exon-intron boundaries were examined using Artemis , compared to results based on BLAST alignments against sugarcane ESTs  and annotated sorghum and rice proteins [14, 66, 67] and adjusted if necessary. If no hits against sugarcane ESTs, rice and sorghum protein sequences were found, the predicted ORF was not modified. Validation of splice sites were performed by GenBank tools at sequence submission. Lastly, putative intergenic sequences were BLASTx (nr database) screened for additional genes not annotated by the de novo prediction programs. Manual categorization was performed according to the SUCEST project database . Blast2GO analyses were performed as previously described  using BLASTp with an e-value cut-off of e-10. Screening for mitochondrial and chloroplast sequence contamination was done by a BLASTn of the sugarcane organellar genomes (GenBank: NC_005878 and NC_008360).
We used the CDSs from the BACs, the 43,141 SASs  and the KeggMapper tool available at MG-RAST  for global automatic mapping with cut-offs of an e-value of e-5, 60% identity, and a minimum alignment length of 15 bp. We used the sugarcane CDSs and the KEGG Automatic Annotation Server , with the BBH search option, to map sucrose and starch pathways genes.
All BAC sequences generated in this study, with protein-coding gene and full-length TE annotation, can be accessed in the GenBank database under accession numbers [GenBank: KF184657 to KF184973].
Colinearity and synteny analyses
Two analyses were performed to confirm that the BACs represent a homogeneous sampling of the sugarcane genome, and to evaluate colinearity and synteny with the sorghum genome. First, we estimated the chromosomal location of the BACs in the sorghum genome by a BLASTn analysis, using sugarcane non-TE coding sequences as queries against the sorghum genome (v1.0) . We then checked this localization by blasting the predicted protein sequences against the masked sorghum protein database . The chromosomal location of the BAC in the sorghum genome was directly assigned for predicted protein sequences with single-hits. A high-throughput maximum-likelihood phylogenetic analysis  was applied for CDSs with multiple-hits using the sorghum predicted proteome . Redundant sugarcane predicted protein sequences from putative hom(e)ologous BACs were manually evaluated. Sorghum orthologous genes assigned to more than one BAC were examined when there were two or more colinear predicted protein sequences on a single BAC. Redundant predicted protein sequences were removed from the analysis. A Welch’s t test was applied to check the size variation hypothesis.
The structure of the 12 BACs selected using the rpa1a gene as probe was analyzed in detail using Artemis . The CDSs of the six genes shared by the 12 BACs and sorghum were aligned and concatenated. A neighbor-joining phylogenetic tree was inferred with using the highest ranked substitution model (Tajima-Nei) and 1000 bootstrap replications, using MEGA 5 .
Analysis of ribosomal and centromeric BACs
The BAC SCHRBa_039_D18 was initially identified through the gene annotation as containing ribosomal genes. The region syntenic to the BAC in rice, maize, sorghum and Arabidopsis was identified using CoGE . The top 2 hits from BLASTn where the description was not an unidentified sequence were downloaded and manually aligned using BioEdit  with BAC SCHRBa_039_D18 to determine the beginning and end of each ribosomal gene.
Three BACs were identified as putatively pericentromeric and/or centromeric by the absence of coding genes during the repeat annotation. The nucleotide sequence of the BACs were further analyzed by appropriate BLASTs against the SCEN repeat . LTR nucleotide sequence and conceptually translated coding domains from the sugarcane curated LTR retrotransposon database . The centromeric repeat, SCEN, was extracted and aligned using ClustalW in BioEdit , and the pairwise % identity calculated using Genious .
Fluorescence in situ hybridization (FISH)
Distribution of the sugarcane ribosomal (SCHRBa_039_D18) BAC and a pericentromeric/centromeric (SCHRBa_239_N21) BAC were analyzed by FISH on metaphase chromosomes. FISH procedures were as described in , except for preparation of the probes and blocking DNA. All kits were used according to the manufacturer’s instructions. One μg of each BAC was used in a 20 μL nick translation reaction using the NT mix (Roche) with Digoxigenin (DIG)-11-dUTP (Invitrogen) or Biotin-16-dUTP (Invitrogen). Labeling efficiency was tested according to Heslop-Harrison and Schwarzacher  (protocol 4.7). Blocking DNA was prepared from genomic DNA from the sugarcane cultivar SP80-3280. Genomic DNA was extracted from meristem according to Aljanabi et al. , except that the meristem was first ground in liquid nitrogen before adding the homogenization buffer. The genomic DNA was sheared by placing it at 95°C until it was less than 1 Kb in size.
Raw sequences  were retrieved in a FASTQ formatted file and the adapter sequences were removed using Perl Scripts. Reads in the size 20–25 nucleotides were sorted into two separate files, 20–22 nt and 23–25 nt for subsequent analyses. We used the MAQ software  to map the collection of sRNA reads against the BACs. We used the low stringent cut-off parameter of 0–2 nt mismatches because the BACs and sRNA reads were derived from different sugarcane cultivars. Graphical representations of sRNA mapping was created using the SeqMonk software .
RNA-seq sequencing and analysis
A sugarcane transcriptome was constructed from germinating shoot axillary buds five days after planting. Single budded setts from the sugarcane variety RB92-5345 were placed in trays with buds facing upwards, covered with moist vermiculite, and incubated at 26–30°C under greenhouse conditions. Total RNA was extracted from pooled breaking buds using a lithium chloride protocol . For the construction of RNA-seq libraries, all procedures were carried out according to Illumina’s instructions using the ‘TruSeq RNA Sample Prep v2 Low Throughput (LT)’ kit. The libraries were paired-end sequenced on the Illumina system (HiScanSQ) (GA3 – ESALQ-USP). Sequencing reads were mapped using the Burrows-Wheeler Aligner BWA  and the SAM tools . The RNA-seq library can be accessed in the NCBI high-throughput DNA and RNA sequence read archive under the accession number [SRA: SRX500284].
S6pptandem gene duplication network analysis
The evolutionary relationship of the putative tandem gene duplication of the s6pp loci in two clones S. spontaneum (Mandalay and IN8458), one S. officinarium (Badila), and 11 modern sugarcane hybrid cultivars (R570, SP80-3280, SP70-1143, RB835486, RB72454, RB867515, Co-290, POJ2878, NCo-310, NA5679 and SP81-3250), one Miscanthus species, and the sorghum genome was evaluated by sequencing and network analysis. The 1539 bp region was first identified in the SHCRBa_104_G22 BAC (position 2977 to 4515, Additional file 14: Figure S8). Sequence for the sorghum genome was taken from published sequence (3,984,822 to 3,988,630 nt in chromosome_9). The primers 2995 F and 4500R, were used to amplify the fragment from the other cultivars and species (Additional file 14: Figure S8). PCR reactions were performed in a final volume of 25 μL, using 50 ng of genomic DNA, 0.4 μM of each primer, 0.2 mM of each dNTP, and 0.5 μL Elongase Enzyme Mix (Invitrogen) in 0.5 X PCR buffer A and 0.5 X PCR buffer B. The cycling parameters used for amplification were: 94°C for 10 min for initial denaturation, 35 cycles of 94°C for 30 sec, 55°C for 30 sec, and 68°C for 6 min. The fragments obtained were purified directly from the PCR product, using the NucleoSpin Extract II (Macherey-Nagel), and cloned into the pGEM-T EasyVector System (Promega). Seven to 15 randomly chosen clones from each sample were automatically sequenced in an ABI PRISM 3730 (Applied Biosystems) using the primers M13F and M13R, and the six internal primers (Additional file 14: Figure S8). The following PCR conditions were used: in a final volume of 10 μL, 300 ng of plasmid DNA, 1 μM of each primer, 2 μL of BigDye Terminator v3.1 (Applied Biosystems) in 1 X BigDye buffer. Sequence alignment was performed using Clustal W  and the reduced median-joining network analysis using the NETWORK 184.108.40.206 software with default parameters . The sequences and alignments are available on request.
Availability of supporting data
The BAC sequence data set supporting the results of this article is available in the GenBank repository [KF184657 to KF184973 at http://www.ncbi.nlm.nih.gov/genbank], in the CoGe website [Saccharum hybrid cultivar R570 (id23984) in https://genomevolution.org/CoGe/] and in the GaTElab website, as a GBrowser search tool [https://gate.ib.usp.br/GateWeb/en/gbrowse-pagina]. The RNA-seq library data set is available in the Sequence Read Archive (SRA) repository [SRX500284 in http://www.ncbi.nlm.nih.gov/sra].
Coding DNA sequence
Centromeric-specific retrotransposon of maize
Internal transcribed spacer
Miniature inverted-repeat transposable element
Repeated-associated small interfering RNA
Sugarcane assembled EST sequence
Small interfering RNA
Single sequence repeats
Sugarcane EST project
We gratefully acknowledge funding from FAPESP (MAVS– 2008/52074-0; CBMV- 2010/05591-9; RV- 2008/58031-0; APS 2008/52197-4, GMS 08/52146-0) and CNPq (MV- INCT BIOETHANOL). NS (2009/51632-1), CJM (2009/09217-7), GMQC (2008/58243-8), RAC (2009/09116-6), SLI (2011/05317-7) and CGL (2008/54201-9) were recipient of FAPESP fellowships. CEAO, MYN, VM and RFA were the recipients of CNPq fellowships. PCT and PMS were the recipients of CAPES fellowships. Overgo probe selection was supported by grants to AHP (International Consortium for Sugarcane Biotechnology (#24), Consortium for Plant Biotechnology Research (DE-FG36-02GO12026), and the Univ. Georgia Office of the Vice President for Research). We thank CATG and LNCC labs for using BAC sequencing facilities, and the CNRGV team for the help with the 3D pool development and BAC sequencing. Dr Douglas Domingues, Drs Daniel S. Moura and Marcio C. Silva-Filho provided DUR3, Sugarwin and RALF genes primers for BAC selection. Finally, we also thank Dr Paul Moore for critically reading the manuscript.
- European Commission: Agriculture and Rural Development: Sugar. http://ec.europa.eu/agriculture/sugar/index_en.htm,
- Kellogg EA: Evolutionary history of the grasses. Plant Physiol. 2001, 125: 1198-1205.PubMed CentralPubMedView ArticleGoogle Scholar
- Grivet L, Arruda P: Sugarcane genomics: depicting the complex genome of an important tropical crop. Curr Opin Plant Biol. 2001, 5: 122-127.View ArticleGoogle Scholar
- Piperidis G, Piperidis N, D’Hont A: Molecular cytogenetic investigation of chromosome composition and transmission in sugarcane. Mol Genet Genomics. 2010, 284: 65-73.PubMedView ArticleGoogle Scholar
- D’Hont A: Unraveling the genome structure of polyploids using FISH and GISH; examples of sugarcane and banana. Cytogenet Genome Res. 2005, 109: 27-33.PubMedView ArticleGoogle Scholar
- D’Hont A, Glaszmann JC: Sugarcane genome analysis with molecular markers: a first decade of research. Int Soc Sugar Cane Technol Proc XXIV Congr. 2001, 556-559.Google Scholar
- Tomkins J, Yu Y, Miller-Smith H, Frisch D, Woo S, Wing R: A bacterial artificial chromosome library for sugarcane. Theor Appl Genet. 1999, 99: 419-424.PubMedView ArticleGoogle Scholar
- Vettore L, Silva FR, Kemper EL, Souza GM, Silva AM, Ferro M, Henrique-Silva F, Giglioti ÉA, Lemos MVF, Coutinho LL, Nobrega MP, Carrer H, França SC, Bacci MJ, Goldman MHS, Gomes SL, Nunes LR, Camargo LEA, Siqueira WJ, Van Sluys M-A, Thiemann OH, Kuramae EE, Santelli RV, Marino CL, Targon MLPN, Ferro JA, Silveira HCS, Marini DC, Lemos EGM, Monteiro-Vitorello CB, et al: Analysis and functional annotation of an expressed sequence tag collection for tropical crop sugarcane. Genome Res. 2003, 13: 2725-2735.PubMed CentralPubMedView ArticleGoogle Scholar
- Repbase. http://www.girinst.org/repbase/,
- Domingues DS, Cruz GMQ, Metcalfe CJ, Nogueira FTS, Vicentini R, Alves C de S, Van Sluys M-A: Analysis of plant LTR-retrotransposons at the fine-scale family level reveals individual molecular patterns. BMC Genomics. 2012, 13: 137-PubMed CentralPubMedView ArticleGoogle Scholar
- National Center for Biotechnology Information (NCBI). http://www.ncbi.nlm.nih.gov/,
- Meyer F, Paarmann D, D’Souza M, Olson R, Glass EM, Kubal M, Paczian T, Rodriguez A, Stevens R, Wilke A, Wilkening J, Edwards RA: The metagenomics RAST server - a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinformatics. 2008, 9: 386-PubMed CentralPubMedView ArticleGoogle Scholar
- Keeling PL, Myers AM: Biochemistry and genetics of starch synthesis. Annu Rev Food Sci Technol. 2010, 1: 271-303.PubMedView ArticleGoogle Scholar
- Phytozome v9.1: Home. http://www.phytozome.net/,
- Dias ES, Carareto CMA: Ancestral polymorphism and recent invasion of transposable elements in Drosophila species. BMC Evol Biol. 2012, 12: 119-PubMed CentralPubMedView ArticleGoogle Scholar
- Posada D, Crandall K: Intraspecific gene genealogies: trees grafting into networks. Trends Ecol Evol. 2001, 16: 37-45.PubMedView ArticleGoogle Scholar
- Swaminathan K, Alabady MS, Varala K, De Paoli E, Ho I, Rokhsar DS, Arumuganathan AK, Ming R, Green PJ, Meyers BC, Moose SP, Hudson ME: Genomic and small RNA sequencing of Miscanthus x giganteus shows the utility of sorghum as a reference genome sequence for Andropogoneae grasses. Genome Biol. 2010, 11: R12-PubMed CentralPubMedView ArticleGoogle Scholar
- Zanca AS, Vicentini R, Ortiz-Morea FA, Del Bem LE, da Silva MJ, Vincentz M, Nogueira FT: Identification and expression analysis of microRNAs and targets in the biofuel crop sugarcane. BMC Plant Biol. 2010, 10: 260-PubMed CentralPubMedView ArticleGoogle Scholar
- Piriyapongsa J, Jordan IK: A family of human microRNA genes from miniature inverted-repeat transposable elements. PLoS ONE. 2007, 2: e203-PubMed CentralPubMedView ArticleGoogle Scholar
- Barrera-Figueroa BE, Gao L, Wu Z, Zhou X, Zhu J, Jin H, Liu R, Zhu J-K: High throughput sequencing reveals novel and abiotic stress-regulated microRNAs in the inflorescences of rice. BMC Plant Biol. 2012, 12: 132-PubMed CentralPubMedView ArticleGoogle Scholar
- Nagaki K, Tsujimoto H, Sasakuma T: A novel repetitive sequence of sugar cane, SCEN family, locating on centromeric regions. Chromosom Res. 1998, 6: 295-302.View ArticleGoogle Scholar
- Nagaki K, Neumann P, Zhang D, Ouyang S, Buell CR, Cheng Z, Jiang J: Structure, divergence, and distribution of the CRR centromeric retrotransposon family in rice. Mol Biol Evol. 2005, 22: 845-855.PubMedView ArticleGoogle Scholar
- Vicentini R, Del Bem LE, Van Sluys M-A, Nogueira F, Vincentz M: Gene content analysis of sugarcane public ESTs reveals thousands of missing coding-genes and an unexpected pool of grasses conserved ncRNAs. Trop Plant Biol. 2012, 5: 199-205.View ArticleGoogle Scholar
- Kim C, Lee T-H, Compton RO, Robertson JS, Pierce GJ, Paterson AH: A genome-wide BAC end-sequence survey of sugarcane elucidates genome composition, and identifies BACs covering much of the euchromatin. Plant Mol Biol. 2013, 81: 139-147.PubMedView ArticleGoogle Scholar
- Paterson AH, Bowers JE, Bruggmann R, Dubchak I, Grimwood J, Gundlach H, Haberer G, Hellsten U, Mitros T, Poliakov A, Schmutz J, Spannagl M, Tang H, Wang X, Wicker T, Bharti AK, Chapman J, Feltus FA, Gowik U, Grigoriev IV, Lyons E, Maher CA, Martis M, Narechania A, Otillar RP, Penning BW, Salamov AA, Wang Y, Zhang L, Carpita NC, et al: The Sorghum bicolor genome and the diversification of grasses. Nature. 2009, 457: 551-556.PubMedView ArticleGoogle Scholar
- Chang Y, Gong L, Yuan W, Li X, Chen G, Li X, Zhang Q, Wu C: Replication protein A (RPA1a) is required for meiotic and somatic DNA repair but is dispensable for DNA replication and homologous recombination in rice. Plant Physiol. 2009, 151: 2162-2173.PubMed CentralPubMedView ArticleGoogle Scholar
- Feschotte C: Transposable elements and the evolution of regulatory networks. Nat Rev Genet. 2008, 9: 397-405.PubMed CentralPubMedView ArticleGoogle Scholar
- Wang J, Roe B, Macmil S, Yu Q, Murray JE, Tang H, Chen C, Najar F, Wiley G, Bowers J, Van Sluys M-A, Rokhsar DS, Hudson ME, Moose SP, Paterson AH, Ming R: Microcollinearity between autopolyploid sugarcane and diploid sorghum genomes. BMC Genomics. 2010, 11: 261-PubMed CentralPubMedView ArticleGoogle Scholar
- Garsmeur O, Charron C, Bocs S, Jouffe V, Samain S, Couloux A, Droc G, Zini C, Glaszmann J-C, Van Sluys M-A, D’Hont A: High homologous gene conservation despite extreme autopolyploid redundancy in sugarcane. New Phytol. 2011, 189: 629-642.PubMedView ArticleGoogle Scholar
- Jannoo N, Grivet L, Chantret N, Garsmeur O, Glaszmann JC, Arruda P, D’Hont A: Orthologous comparison in a gene-rich region among grasses reveals stability in the sugarcane polyploid genome. Plant J. 2007, 50: 574-585.PubMedView ArticleGoogle Scholar
- Figueira TRES, Okura V, da Silva FR, da Silva MJ, Kudrna D, Ammiraju JSS, Talag J, Wing R, Arruda P: A BAC library of the SP80–3280 sugarcane variety (saccharum sp.) and its inferred microsynteny with the sorghum genome. BMC Res Notes. 2012, 5: 185-PubMed CentralPubMedView ArticleGoogle Scholar
- Schnable PS, Ware D, Fulton RS, Stein JC, Wei F, Pasternak S, Liang C, Zhang J, Fulton L, Graves TA, Minx P, Reily AD, Courtney L, Kruchowski SS, Tomlinson C, Strong C, Delehaunty K, Fronick C, Courtney B, Rock SM, Belter E, Du F, Kim K, Abbott RM, Cotton M, Levy A, Marchetto P, Ochoa K, Jackson SM, Gillam B, et al: The B73 maize genome: complexity, diversity, and dynamics. Science. 2009, 326: 1112-1115.PubMedView ArticleGoogle Scholar
- Tenaillon MI, Hufford MB, Gaut BS, Ross-Ibarra J: Genome size and transposable element content as determined by high-throughput sequencing in maize and Zea luxurians. Genome Biol Evol. 2011, 3: 219-229.PubMed CentralPubMedView ArticleGoogle Scholar
- Zhang J, Yu C, Krishnaswamy L, Peterson T: Transposable Elements as Catalysts for Chromosome Rearrangements. Methods Mol Biol. Edited by: Birchler JA. 2011, Totowa, NJ: Humana Press, 315-326.Google Scholar
- Ma J, Wing RA, Bennetzen JL, Jackson SA: Plant centromere organization: a dynamic structure with conserved functions. Trends Genet. 2007, 23: 134-139.PubMedView ArticleGoogle Scholar
- D’Hont A, Grivet L, Feldmann P, Rao S, Berding N, Glaszmann JC: Characterisation of the double genome structure of modern sugarcane cultivars (Saccharum spp.) by molecular cytogenetics. Mol Gen Genet. 1996, 250: 405-413.PubMedGoogle Scholar
- Bao Y, Wendel JF, Ge S: Multiple patterns of rDNA evolution following polyploidy in Oryza. Mol Phylogenet Evol. 2010, 55: 136-142.PubMedView ArticleGoogle Scholar
- Lynch M: The Origins of Genome Architecture. 2007, Sunderland, Massachussetts, USA: Sinauer Associates Inc.Google Scholar
- International Rice Genome Sequencing Project: The map-based sequence of the rice genome. Nature. 2005, 436: 793-800.View ArticleGoogle Scholar
- Liu B, Xu C, Zhao N, Qi B, Kimatu JN, Pang J, Han F: Rapid genomic changes in polyploid wheat and related species: implications for genome evolution and genetic improvement. J Genet Genomics. 2009, 36: 519-528.PubMedView ArticleGoogle Scholar
- Lisch D: How important are transposons for plant evolution?. Nat Rev Genet. 2012, 14: 49-61.View ArticleGoogle Scholar
- Udall JA, Wendel JF: Polyploidy and crop improvement. Crop Sci. 2006, 46: S3-S14.View ArticleGoogle Scholar
- Varshney RK, Graner A, Sorrells ME: Genomics-assisted breeding for crop improvement. Trends Plant Sci. 2005, 10: 621-630.PubMedView ArticleGoogle Scholar
- Menossi M, Silva-Filho MC, Vincentz M, Van-Sluys M-A, Souza GM: Sugarcane functional genomics: gene discovery for agronomic trait development. Int J Plant Genomics. 2008, 2008: 458732-doi:10.1155/2008/458732PubMed CentralPubMedView ArticleGoogle Scholar
- Palhares AC, Rodrigues-Morais TB, Van Sluys M-A, Domingues DS, Maccheroni W, Jordão H, Souza AP, Marconi TG, Mollinari M, Gazaffi R, Garcia AAF, Vieira MLC: A novel linkage map of sugarcane with evidence for clustering of retrotransposon-based markers. BMC Genet. 2012, 13: 51-PubMed CentralPubMedView ArticleGoogle Scholar
- Andersen JR, Lübberstedt T: Functional markers in plants. Trends Plant Sci. 2003, 8: 554-560.PubMedView ArticleGoogle Scholar
- Kalendar R, Flavell AJ, Ellis THN, Sjakste T, Moisy C, Schulman A: Analysis of plant diversity with retrotransposon-based molecular markers. Heredity (Edinb). 2011, 106: 520-530.View ArticleGoogle Scholar
- PGML BACMan On The Web: Grasses. http://www.plantgenome.uga.edu/bacman/BACManwww.php,
- Rice Genome Annotation Project. http://rice.plantbiology.msu.edu/,
- Bowers JE, Arias MA, Asher R, Avise JA, Ball RT, Brewer GA, Buss RW, Chen AH, Edwards TM, Estill JC, Exum HE, Goff VH, Herrick KL, Steele CLJ, Karunakaran S, Lafayette GK, Lemke C, Marler BS, Masters SL, McMillan JM, Nelson LK, Newsome GA, Nwakanma CC, Odeh RN, Phelps CA, Rarick EA, Rogers CJ, Ryan SP, Slaughter KA, Soderlund CA, et al: Comparative physical mapping links conservation of microsynteny to chromosome structure and recombination in grasses. Proc Natl Acad Sci U S A. 2005, 102: 13206-13211.PubMed CentralPubMedView ArticleGoogle Scholar
- Adam-Blondon A-F, Bernole A, Faes G, Lamoureux D, Pateyron S, Grando MS, Caboche M, Velasco R, Chalhoub B: Construction and characterization of BAC libraries from major grapevine cultivars. Theor Appl Genet. 2005, 110: 1363-1371.PubMedView ArticleGoogle Scholar
- Manetti ME, Rossi M, Cruz GMQ, Saccaro NL, Nakabashi M, Altebarmakian V, Rodier-Goud M, Domingues D, D’Hont A, Van Sluys MA: Mutator system derivatives isolated from sugarcane genome sequence. Trop Plant Biol. 2012, 5: 233-243.PubMed CentralPubMedView ArticleGoogle Scholar
- Phrap. http://www.phrap.org/,
- RepeatMasker. http://www.repeatmasker.org/,
- Jurka J, Kapitonov VV, Pavlicek A, Klonowski P, Kohany O: Repbase update, a database of eukaryotic repetitive elements. Cytogenet Genome Res. 2005, 110: 462-467.PubMedView ArticleGoogle Scholar
- Han Y, Wessler SR: MITE-Hunter: a program for discovering miniature inverted-repeat transposable elements from genomic sequences. Nucleic Acids Res. 2010, 38 (22): e199-doi: 10.1093/nar/gkq862. Epub 2010 Sep 29PubMed CentralPubMedView ArticleGoogle Scholar
- Frickey T, Lupas A: CLANS: a Java application for visualizing protein families based on pairwise similarity. Bioinformatics. 2004, 20: 3702-3704.PubMedView ArticleGoogle Scholar
- Han Y, Qin S, Wessler SR: Comparison of class 2 transposable elements at superfamily resolution reveals conserved and distinct features in cereal grass genomes. BMC Genomics. 2013, 14: 71-PubMed CentralPubMedView ArticleGoogle Scholar
- Keller O, Kollmar M, Stanke M, Waack S: A novel hybrid gene prediction method employing protein multiple sequence alignments. Bioinformatics. 2011, 27: 757-763.PubMedView ArticleGoogle Scholar
- Majoros WH, Pertea M, Salzberg SL: TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders. Bioinformatics. 2004, 20: 2878-2879.PubMedView ArticleGoogle Scholar
- Haas BJ, Delcher AL, Mount SM, Wortman JR, Smith RK, Hannick LI, Maiti R, Ronning CM, Rusch DB, Town CD, Salzberg SL, White O: Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res. 2003, 31: 5654-5666.PubMed CentralPubMedView ArticleGoogle Scholar
- Haas BJ, Salzberg SL, Zhu W, Pertea M, Allen JE, Orvis J, White O, Buell CR, Wortman JR: Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to assemble spliced alignments. Genome Biol. 2008, 9: R7-PubMed CentralPubMedView ArticleGoogle Scholar
- Petersen TN, Brunak S, von Heijne G, Nielsen H: SignalP 4.0: discriminating signal peptides from transmembrane regions. Nat Methods. 2011, 8: 785-786.PubMedView ArticleGoogle Scholar
- TMHMM Server v. 2.0. http://www.cbs.dtu.dk/services/TMHMM-2.0/,
- Rutherford K, Parkhill J, Crook J, Horsnell T, Rice P, Rajandream MA, Barrell B: Artemis: sequence visualization and annotation. Bioinformatics. 2000, 16: 944-945.PubMedView ArticleGoogle Scholar
- UniProt. http://www.uniprot.org/,
- InterPro: Protein sequence analysis and classification. http://www.ebi.ac.uk/interpro/,
- Conesa A, Götz S: Blast2GO: a comprehensive suite for functional analysis in plant genomics. Int J Plant Genomics. 2008, 2008: 1-12.View ArticleGoogle Scholar
- SUCEST-FUN Project. http://sucest-fun.org/,
- MG-RAST: metagenomics analysis server. http://metagenomics.anl.gov/,
- KAAS - KEGG automatic annotation server. http://www.genome.jp/kegg/kaas/,
- Tamura K, Peterson D, Peterson N, Stecher G, Nei M, Kumar S: MEGA5: molecular evolutionary genetics analysis using maximum likelihood, evolutionary distance, and maximum parsimony methods. Mol Biol Evol. 2011, 28: 2731-2739.PubMed CentralPubMedView ArticleGoogle Scholar
- Lyons E, Freeling M: How to usefully compare homologous plant genes and chromosomes as DNA sequences. Plant J. 2008, 53: 661-673.PubMedView ArticleGoogle Scholar
- Hall TA: BioEdit: a user-friendly biological sequence alignment editor and analysis program for Windows 95/98/NT. Nucleic Acids Symp Ser. 1999, 41: 95-98.Google Scholar
- Geneious - Homepage. http://www.geneious.com/,
- Heslop-Harrison P, Schwarzacher T: Practical In Situ Hybridization. 2000, Oxford, UK: BIOS Scientific Publishers LtdGoogle Scholar
- Aljanabi S, Forget L, Dookun A: An improved and rapid protocol for the isolation of polysaccharide-and polyphenol-free sugarcane DNA. Plant Mol Biol Report. 1999, 17: 1-8.View ArticleGoogle Scholar
- Maq: Mapping and assembly with qualities. http://maq.sourceforge.net/,
- SeqMonk. http://www.bioinformatics.babraham.ac.uk/projects/seqmonk/,
- Gasic K, Hernandez A, Korban SS: RNA extraction from different apple tissues rich in polyphenols and polysaccharides for cDNA. Plant Mol Biol Report. 2004, 22 (December): 437a-437g.View ArticleGoogle Scholar
- Li H, Durbin R: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009, 25: 1754-1760.PubMed CentralPubMedView ArticleGoogle Scholar
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R: The sequence Alignment/Map format and SAMtools. Bioinformatics. 2009, 25: 2078-2079.PubMed CentralPubMedView ArticleGoogle Scholar
- Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994, 22: 4673-4680.PubMed CentralPubMedView ArticleGoogle Scholar
- Bandelt HJ, Forster P, Röhl A: Median-joining networks for inferring intraspecific phylogenies. Mol Biol Evol. 1999, 16: 37-48.PubMedView ArticleGoogle Scholar
- Paterson AH, Freeling M, Tang H, Wang X: Insights from the comparison of plant genome sequences. Annu Rev Plant Biol. 2010, 61: 349-372.PubMedView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.