Duplicate gene evolution, homoeologous recombination, and transcriptome characterization in allopolyploid cotton
© Flagel et al.; licensee BioMed Central Ltd. 2012
Received: 9 June 2011
Accepted: 6 July 2012
Published: 6 July 2012
Modern allotetraploid cotton contains an “A” and “D” genome from an ancestral polyploidy event that occurred approximately 1–2 million years ago. Diploid A- and D-genome species can be compared to the A- and D-genomes found within these allotetraploids to make evolutionary inferences about polyploidy. In this paper we present a comprehensive EST assembly derived from diploid and model allotetraploid cottons and demonstrate several evolutionary inferences regarding genic evolution that can be drawn from these data.
We generated a set of cotton expressed sequence tags (ESTs), comprising approximately 4.4 million Sanger and next-generation (454) transcripts supplemented by approximately 152 million Illumina reads from diploid and allotetraploid cottons. From the EST alignments we inferred 259,192 genome-specific single nucleotide polymorphisms (SNPs). Molecular evolutionary analyses of protein-coding regions demonstrate that the rate of nucleotide substitution has increased among both allotetraploid genomes relative to the diploids, and that the ratio of nonsynonymous to synonymous substitutions has increased in one of the two polyploid lineages we sampled. We also use these SNPs to show that a surprisingly high percentage of duplicate genes (~7 %) show a signature of non-independent evolution in the allotetraploid nucleus, having experienced one or more episodes of nonreciprocal homoeologous recombination (NRHR).
In this study we characterize the functional and mutational properties of the cotton transcriptome, produce a large genome-specific SNP database, and detect illegitimate genetic exchanges between duplicate genomes sharing a common allotetraploid nucleus. Our findings have important implications for our understanding of the consequences of polyploidy and duplicate gene evolution. We demonstrate that cotton genes have experienced an increased rate of molecular evolution following duplication by polyploidy, and that polyploidy has enabled considerable levels of nonreciprocal exchange between homoeologous genes.
KeywordsCotton Duplicate gene evolution Gene conversion Gossypium Homoeologous recombination Polyploidy Transcriptome
Cotton produces the world’s most utilized natural fiber, making it a staple of the textile and garment industries. In 2009 cotton fiber production in the US was valued at approximately 3.74 billion dollars, with an additional 666 million dollars derived from cotton seed extracts . This places cotton among the top five most valuable US crops. Accordingly, it is important to understand cotton’s gene repertoire and its evolutionary history. With the vast majority of cotton production being derived from the allotetraploid species Gossypium hirsutum, followed by smaller contributions from a second domesticated allotetraploid, G. barbadense, an appreciation of cotton biology in the context of its allotetraploid history has become a major research focus for the improvement of this crop. Here we present a comprehensive assembly of approximately 4.4 million EST sequences from these two allotetraploid species and from the diploid species representing the progenitor genomes of allotetraploid cotton. We illustrate how this assembly facilitates diagnosis of the genome-of-origin for what otherwise would have been ambiguously derived transcripts, and demonstrate the utility of this assembly for addressing questions pertaining to the phenomenon of allopolyploidy and cotton transcriptome evolution.
The cotton genus comprises about 50 species, divided into eight monophyletic diploid genome groups (historically denoted alphabetically as A through G, and K), and a clade of five allotetraploid species, the latter containing a genome from both the A and D diploid genome groups (with allotetraploids denoted as the AD group). The cytological and phylogenetic history of this genus is well-resolved (reviewed in: [2–5]), and data demonstrate that the best living models of the progenitor diploids of the A- and D-genomes of modern allotetraploid cotton are the diploid species G. arboreum and G. raimondii, respectively. In addition, extensive seed collections are available and include both wild progenitor forms along with their domesticated descendants. Collectively, these resources provide a useful system for studying the genomic and phenotypic changes that accompany domestication. Embedding these resources within their phylogenetic context results in a powerful framework for the study of polyploidy and domestication, and consequently cotton has emerged as a model system in these areas of research. Recent examples of the utility of the cotton system in this regard include detailed analyses of the cotton fiber transcriptome during development [6–10], as well as explorations of the mutational and expression changes that accompany allopolyploidization [11–19]. These types of studies, as well as those aimed at crop improvement, are enriched or facilitated by the availability of cotton EST collections, underscoring the utility and importance of developing a rich and deep database of EST resources.
Prior to this study, the community-driven global assembly of cotton ESTs included ~ 170,000 Sanger sequences, with ~ 31,000 G. arboreum (A-diploid model progenitor), ~ 69,000 G. raimondii (D-diploid model progenitor), and ~ 70,000 G. hirsutum (AD allotetraploid) sequences assembled into a single reference collection yielding 51,107 contigs . Because this assembly (and subsequent incremental updates) contained sequences from both model diploid parents and a natural allotetraploid representing their combination, it could be used to identify A- and D-genome-specific SNPs, which can in turn be used to assign genomic origin to homoeologous transcripts in allotetraploid cotton (homoeologs are duplicate genes originating via polyploidy). Implementation of this EST collection strategy has enabled several research possibilities, including monitoring genome-specific expression in natural allotetraploids [14, 16, 19, 21–23] and the quantification of the extent of nonreciprocal homoeologous recombination among natural allotetraploid species (i.e. recombination between homoeologs of the A- and D-genomes in an AD allotetraploid) . These efforts have elucidated the dynamics of gene expression evolution and genic content variation that accompanied allotetraploid formation and evolution in cotton.
Here we present a vastly expanded cotton EST assembly, which contains approximately 4.4 million Sanger and next-generation (454) transcripts. Like previous assemblies , this one incorporates ESTs from both the A- and D-genome diploid progenitors, along with allotetraploid ESTs from two species of allotetraploid cotton, G. barbadense and G. hirsutum. The 56,373 contigs obtained from this assembly represent a vastly expanded representation of the genic content of cotton. To add additional depth to the assembly, we also generated ~152 million 82 bp Illumina reads, representing the fiber transcriptome of diploid A- and D-genome cotton as well as the allotetraploids G. barbadense and G. hirsutum. Together these resources allow us to detect 259,192 genome-specific SNPs, which in turn can be used to distinguish the A- and D-genome homoeologs found in the allotetraploid cotton genome.
We describe this collection and document its utility for genome-specific transcriptome analysis in allotetraploid cotton. We also present a characterization of the functional properties of the cotton transcriptome and analyses of molecular evolution following the most recent whole genome duplication that accompanied allotetraploid formation 1–2 million years ago[3, 5]. At the time of writing, allotetraploid cotton is now among the most important crops lacking a whole genome sequence, but as progress is made in this regard, the EST assembly and genome-specific SNP resources presented here will be of use in assembling and annotating the cotton genome.
EST assembly characteristics
Tissues, sequence platform, and number of reads generated for each of the four cotton species used in this study
Tissues in library
Type of sequence
Approx. number of ESTs (X 1000)
UnassembledESTs (X 1000)
G. arboreum (cv. AKA8401)
gynoecium, calyx, fiber, roots, whole seedlings
G. arboreum 1
cotton fiber 7–10 dpa
G. raimondii (GN33)
gynoecium, calyx, fiber, roots, whole seedlings
G. raimondii (GN33)
meristem, calyx, fiber, root, petal, seedling
G. raimondii (GN33)
whole seedling, normalized floral organs including developing embryos
G. hirsutum (cv. Maxxa)
gynoecium, calyx, fiber, roots, whole seedlings
G. hirsutum (Tx2094)
bud, leaf, stem, whole seedling (with roots)
G. hirsutum 1
various sources. ESTs that are publically available in GenBank
G. barbadense (cv. Pima S6)
bud, leaf, stem, whole seedling (with roots)
G. barbadense (K101)
bud, leaf, stem, whole seedling (with roots)
G. barbadense 1
various sources. ESTs that are publically available in GenBank
Among the contigs, 48,729 include a predicted coding region, with an average coding length of 863.0 bp. This estimated coding sequence length for cotton is a little less than the average coding sequence length of some angiosperms with sequenced genomes. For example, the average coding sequence lengths of A. thaliana Populus trichocarpa (poplar), and Vitis vinifera (grape) are 1209.3, 1115.9, and 1072.3 bp, respectively (retrieved from Phytozome v5.0 ). If we assume that the true coding sequence length in cotton is comparable to the average length for these species (1132.5 bp), we estimate that our EST assembly provides coverage of about 76% of the coding sequence for the average contig. In addition to breadth of coverage, the assembly achieves a new level of depth of coverage; the median number of ESTs per contig was 44, with a range from 2 to 2,167, and the average depth of coverage for each assembled nucleotide was 23.6. These results mark a significant improvement over previous cotton transcriptome assemblies, where the average contig length was 791 bp, the median number of ESTs per contig was 3, and the average depth of coverage for each assembled nucleotide was 3.4 .
Interestingly, we observed 17,577 contigs for which sequences were lacking only for G. arboreum and only 750 contigs constructed entirely from G. arboreum ESTs (Figure 1). These findings together suggest that the G. arboreum ESTs collection represents a significantly narrower pool of genes relative to the other species. Although it is possible that some G. arboreum orthologs are missing because of gene loss or gene silencing, there is substantial evidence that G. arboreum has not lost a large portion of its transcriptome or experienced massive global shifts in gene expression [7, 11, 19, 20, 25]. We note that G. arboreum was the only species lacking 454 sequences generated by Titanium chemistry, having instead only 454 FLX reads. In support of this point, we do observe a 13 % enrichment of 454 Titanium reads from G. barbadense G. hirsutum, and G. raimondii in the contigs missing only G. arboreum when compared to their proportions in all other contigs (data not shown). Our best hypothesis is that technical differences between these two sequencing chemistries, including sequencing bias, differences in library construction, and assembly characteristics underlie the unexpectedly narrow breadth of the G. arboreum EST collection.
Sequence similarity searches (BLASTX) to the RefSeq (NCBI) database were used to provide an approximation of gene function. The contigs and singletons (>300 bp) were compared to annotated proteins from RefSeq with a minimum e-value score of 1e-15. 6,763 contig sequences had a high similarity (e-value = 0) to RefSeq sequences and 43,336 contig sequences had a match less than 1e-15. The contig sequences were also compared to the annotated genomes of Carica papaya and A. thaliana. When only the top match was considered, 4,984 contigs had a near identical match to an A. thaliana gene (e-value = 0) and 17,419 contigs had top matches with an e-value lower than 1e-100. More than 43,000 contigs had a top match to an A. thaliana gene with an e-value lower than 1e-15. When only the top C. papaya match was considered, 4,844 contigs had a near identical match to a C. papaya gene (e-value = 0) and 16,569 had top matches with an e-value lower than 1e-100. More than 40,000 contigs had a top match to a C. papaya gene with an e-value lower than 1e-15. The similar results obtained with both Arabidopsis and Carica lend confidence that this set of annotations is accurate.
Identification of protein coding domains
Among the 56,373 contigs in the assembly, 48,729 had a predicted translational product (assessed using ESTScan ), and among this subset 31,921 protein products had at least one annotated Pfam domain or could be assigned to a Pfam family. Not surprisingly, the most abundant domain was the Pentatricopeptide repeat (PPR; Pfam: PF01535; Additional file 1). This domain is among the most numerous in the angiosperms and may be associated with RNA stabilization and processing . Other highly prevalent Pfam annotations include the WD40, protein kinase, and Myb DNA binding domains, as well as the leucine rich repeat and RNA recognition motifs. Like the PPR repeat, these domains are also abundant in other angiosperm genomes.
Because our cotton EST assembly includes a large number of sequences from A- and D-genome diploid progenitors as well as AD-genome allotetraploid derived ESTs, it can be used to diagnose homoeologs and homoeolog-specific SNPs. In a strategy similar to Buggs et al., , we identified homoeolog-specific SNPs directly from the 454/Sanger assembly and supplemented this assessment with 151,863,736 Illumina sequences, also derived from the A- and D-genome diploid progenitors and AD-genome allotetraploids, and mapped to the de novo 454/Sanger EST assembly. This strategy leverages the strengths of both the 454 and Illumina next-generation platforms. The 454 (and legacy Sanger) reads are longer and assemble into contigs more readily, while the Illumina platform offers deep sequence coverage that can be used to confirm the 454/Sanger SNPs and find new SNPs in regions with low 454/Sanger read coverage.
In total, we identified 259,192 SNPs between the diploid A- and D-genomes, including 186,464 that were exclusive to the de novo 454/Sanger assembly, 122,940 exclusive to the Illumina reference mapping, and 50,212 common to both strategies (Additional file 2). The intersection of both SNP pools is only a fraction of the total. However, to identify homoeologous SNPs in the 454/Sanger assembly it is necessary to have ESTs from the A- and D-genome diploids present, as is the case for 21,643 contigs (Figure 1). Furthermore, within these contigs the A- and D-genome diploid sequences must overlap one another. This occurs for approximately 29 % of all available nucleotide positions within the eligible contigs (data not shown). Finally, cross-validating a SNP with both the 454/Sanger assembly and the Illumina reads requires an overlap between both diploid parents from both sources of ESTs (454/Sanger and Illumina ESTs). This occurs for about 23% of all sites from the 21,643 eligible contigs (data not shown). The diminishing effect these requirements have on SNP identification likely explains the modest proportion of SNPs that intersect between both platforms.
Among those genome-specific SNPs found in both the 454/Sanger and Illumina data sets, only 677 (~1.3%) were in disagreement about the nucleotides representing the A- and D- genomes, leaving 49,535 (~98.7%) in agreement (Additional file 2). This last class of SNPs represents a high confidence category, though the SNPs identified only in the 454/Sanger or Illumina data sets have considerable internal support and also are likely reliable. For the cross-platform verified SNPs, we confirmed the presence of all A- and D-genome diploid SNP variants within the allotetraploid Illumina reads. This additional constraint of no further nucleotide evolution subsequent to polyploidization in either the diploid or allotetraploid lineages ensures that these SNPs almost certainly occurred prior to the formation of the allotetraploids. These SNPs can be used to diagnose both diploid homologs and allotetraploid homoeologs.
Detection of nonreciprocal homoeologous recombination (NRHR)
A total of 6,122 unique NRHR were detected in the new assembly. Of these, approximately 46 % and 31 % are found only in G. hirsutum and in G. barbadense, respectively, whereas about 23% of the exchanges were found in both species (Figure 3). Interestingly, the unique set of 6,122 NRHR belong to only 3,852 contigs. Contigs with multiple exchanges may have experienced independent NRHR events, or alternatively, they may represent only one complex event, with several internal exchanges, that we have been unable to bridge together and resolve. This finding was initially reported by Salmon et al., and its reiteration here with a much larger sample size provides evidence that certain duplicated genes or regions of the allotetraploid genome may harbor lengthy complex NRHR or could be “hotspots” of repeated exchange.
Molecular evolution in diploid and allotetraploid cotton
Molecular evolution in diploid and allopolyploid cotton
Number of Contigs1
Mean dS (Standard Dev.)
Mean dN (Standard Dev.)
Mean dN/dS (Standard Dev.)
A2 vs. D5
AT vs DT - G. barbadense
AT vs DT - G. hirsutum
A2 vs AT - G. barbadense
A2 vs AT - G. hirsutum
D5 vs DT - G. barbadense
D5 vs DT - G. hirsutum
We isolated the coding regions (as predicted by ESTScan) from which we calculated the rates of synonymous (dS) and nonsynonymous (dN) divergence within and between species (Table 2). These two metrics can also be combined in as a ratio (dN/dS), which gives an indication of the selection pressures operating on the protein composition of a gene. Elevated dN/dS ratios (> 1) may be indicative of positive selection, though with low overall levels of divergence, as is seen in cotton, this interpretation must be viewed with caution as stochastic processes operating on a small number of total mutations can lead to spurious inferences of selection. Nonetheless, gene and genome duplication are thought to provide the opportunity for rapid gene evolution of one duplicate, so long as the other duplicate maintains sufficient ancestral functionality [29, 30]. In the present context, we might expect to find an elevated dN/dS ratio between cotton AT and DT homoeologs, when compared to the dN/dS ratio of their A2 and D5 diploid progenitors.
Phylogenetically polarized lineage-specific substitutions for a) Gossypium barbadense , and b) Gossypium hirsutum allotetraploid genomes relative to their A- and D-genome diploid counterparts
Number of Contigs
a) Gossypium barbadense lineage-specific SNPs relative to A- and D-genome diploids
b) Gossypium hirsutum lineage-specific SNPs relative to A- and D-genome diploids
Number of Contigs
The cotton EST assembly
In the introduction we noted that because of cotton’s economic significance, it is important to understand both its transcriptome and its evolutionary history. Here we illustrate the interconnectedness of these two aspects of Gossypium, showing how insight into the transcriptome of allotetraploid cotton requires a sound phylogenetic and molecular evolutionary framework. It has become clear from Gossypium and investigations of other allotetraploids that genome doubling often entails violations of the assumption of additivity, i.e., the allotetraploid genomes do not represent a simple combination of their diploid parents in their gene expression or content. For example, microarray analyses in allotetraploid cotton have shown that various tissues exhibit a pattern of parental genome expression described as “genomic dominance” (over-expression of one parental genome relative to the other when compared to the ancestral expression state of both diploid parents) [14, 17, 19]. Additionally, our analyses and those of Salmon et al. have catalogued a surprising amount of NRHR in cotton allotetraploids that exhibit exclusive bivalent pairing at meiosis with no obvious cytogenetic hints at homoeologous exchanges. In light of these findings, it is important to understand the gene content similarities and differences among diploid and allotetraploid cotton. Toward that end, we present here a combined diploid and allotetraploid cotton transcriptome assembly, and utilize this assembly to assess broad patterns of genic evolution prior to and following polyploidy.
The cotton transcriptome assembly presented here contains 56,373 contigs. This assembly offers only a slight increase in contig number over the previous iteration , which had 51,107 contigs, but increases nucleotide coverage by nearly 7-fold. This expanded coverage makes it much easier to identify nucleotide substitutions. We used this added power and further supplementation by Illumina reads to document approximately 250,000 genome-specific SNPs and demonstrate the utility of these SNPs in diagnosing genic evolution and NRHR and in cotton allotetraploids. We expect that this collection of genome-specific SNPs will have broad utility in a diverse array of future studies spanning the full breadth of investigations in cotton that are enriched by genomic insight, including myriad applied objectives where SNP information will facilitate breeding, for example, or in fundamental investigations of genome-specific expression.
A surprising level of non-independent evolution among homoeologs
With respect to the concept that allopolyploidy is accompanied by novel genetic or genomic phenomena of possible evolutionary significance, we build upon the approach of Salmon et al.  to show the extent of non-independent evolution among genic homoeologs in Gossypium. For comparison, in an initial survey of NRHR, Salmon et al., attempted to determine the age of these events using a phylogenetic approach involving the 5 natural Gossypium allotetraploid species. Among six NRHR events, they found that five occurred in only one species, with the sixth event being ancestral. Here we extend this approach to examine tens of thousands of genes for conserved NRHR from two allotetraploid species (Figure 3). Based on these data, we report the surprising result that about 7% of all contigs in the assembly show evidence of NRHR in at least one of the two polyploidy species. As discussed by Salmon et al. , bioinformatic inferences of NRHR are subject to several possible sources of false positives, including assembly artifacts, sequencing error, and autapomorphic homoplasy subsequent to polyploid formation. However, Salmon et al.  established a 70% rate of validation using independent laboratory methods (PCR, cloning and sequencing), thus applying this false-positive rate we infer a true rate of NRHR in cotton to be approximately 5% of all contigs.
This high frequency of homoeologous contact is an astonishing result given the absence of prior cytogenetic observations that would have suggested this possibility, and apparently complete bivalent formation at meiosis. Furthermore, about one-quarter of all detected events are shared by both allotetraploids, indicating that they likely occurred in a common ancestor and have been retained to the present. The remaining exchanges occurred more recently, since the divergence of the two allotetraploid lineages, indicating that homoeologous exchange need not be restricted to the initial stages of allopolyploidization, during which time bivalent formation may not yet have been evolutionarily stabilized, but that it arises persistently on an evolutionary time scale. We also uncover approximately 50% more NRHR events in G. hirsutum compared to G. barbadense. The detection of NRHR relies on diagnostic reads, and thus detection should increase with deeper read sampling. Interestingly, G. hirsutum is only represented by about 7% more reads than G. barbadense in our assembly. For this reason, the increase in identified NRHR in G. hirsutum likely does not stem from greater read sampling alone. Instead, our results suggest that the rates of NRHR in the G. hirsutum and G. barbadense lineages have diverged following speciation.
Equally interesting is the possibility that NRHR events are not randomly distributed, instead that some genes maybe “hotspots” for NRHR. It will be fascinating to explore the genomic distribution of these hotspots once genome sequences become available. This approach might lead to insights into the mechanistic underpinnings of this form of “illegitimate intergenomic contact” following allotetraploidy in cotton. In addition, experimental analyses are required to assess whether any of the detected NRHR events have had physiological consequences and hence may be adaptively relevant.
Molecular evolution in diploids and allotetraploids
The data set generated here provides an exceptionally detailed view of genic evolution in diploid and allotetraploid cotton. One of the key biological insights of the present study emerges from the observation of evidence for an increased rate of nonsynonymous to synonymous substitution between the A- and D-genomes in G. hirsutum when compared to the diploid A- and D-genomes or those in G. barbadense. We reiterate, however, the caveats that there is a limited timeframe for substitutions to have accumulated, which would tend to make these ratios sensitive a small number of substitutions. On the other hand, we do find strong evidence for a generalized increase in the nucleotide substitution rate in both allotetraploid genomes relative to their diploid progenitors. This same data set also provides a rich quantitative depiction of the genomic composition and history of allotetraploids, revealing the evolutionary relationships among the A2, D5, AT and DT genomes.
To assess whether the data could be mined to detect genic targets of selection, we estimated the rates of nonsynonymous to synonymous (dN/dS) mutation between the diploid A2 and D5 genomes and the allotetraploid AT and DT genomes. Overall, the vast majority of these dN/dS values are less than one. Regarding outliers, we find 943 contigs have a value greater than one between the A- and D-genome diploids, 470 between the AT and DT genomes in G. barbadense, and 532 between the AT and DT genomes in G. hirsutum. These contigs have experienced a greater amount of amino acid substitution than silent substitution (per site), a hallmark of positive selection, but we note again (as above) the high likelihood of false positives for dN/dS ratios when dealing with small numbers of substitutions. Among the genes that might be explored for selection, the largest outliers in each lineage are a cupin domain containing protein, a conserved plant protein of unknown function, and a putative phytocyanin protein (for the A- and D-genomes of G. barbadense, G. hirsutum, and the diploid A2 and D5 genomes, respectively). None of these rapidly evolving genes have yet been linked to phenotypic traits in cotton, but this is not surprising as relatively few cotton genes have a demonstrated phenotypic effect.
In this manuscript we produce a new transcriptome assembly from diploid and allotetraploid cotton species and demonstrate several biological insights that can be derived from this resource. The large number of EST sequences generated for this study offer the most comprehensive view of the cotton transcriptome to date, and possibly the largest examination of the transcriptome from a polyploid plant and its diploid progenitors. By sequencing extant A- and D-genome diploids along with A- and D-genome containing allotetraploids, we are able to identify a vast number of SNPs that differentiate the allotetraploid A- and D-genomes. These SNPs can be used to explore molecular evolution within and between these genomes in both the presence and absence of polyploidy. Moreover, though cotton is a major crop, its genome is unsequenced. Our characterization of the transcriptome sheds light on the functional properties of the cotton genome.
Our results reveal an increased rate of nucleotide substitution in both polyploid lineages, though this increase only impacts the nonsynonymous to synonymous substitution ratio in G. hirsutum. We also uncover a surprising level of nonreciprocal homoeologous recombination (NRHR) between the allotetraploid A- and D-genomes. From these two findings we can conclude that genome duplication has impacted the mutation rate, and through long-term co-residency between the cotton A- and D-genomes, resulted in frequent illegitimate contact, resulting in homoeologous sequence conversion through nonreciprocal exchange.
Plant material and EST library construction and sequencing
454-FLX and Titanium ESTs were derived from various Gossypium species and tissue types (Table 1). RNA was independently extracted from each tissue source using a modified hot-borate method (Wilkins and Smart, 1996) and checked for integrity on Bioanalyzer (Agilent Technologies, Santa Clara, CA). Equimolar amounts of RNA from each extraction were combined into a single sample for cDNA library construction. cDNA libraries were constructed using SMART method (Clontech, Mountain View, CA) and the resulting amplified, double-stranded libraries were normalized using a double-strand nuclease (Trimmer, Evrogen, Moscow, Russia). To prevent poly-A (or poly-T) homopolyers in the 454 reads, we employed two strategies. The first strategy was applied to the FLX reads where a TypeIIS endonuclease was used to cleave 18–20 bp of transcript from a modified 3' SMART adapter (K. Delehaunty, personal communication). The second strategy, used for the 454 Titanium reads, employed PCR-suppression oligos to target particular regions in the transcript (5', internal, or 3'; ). 5’, internal, and 3’ transcript segments were pooled for cDNA sequencing of the G. raimondii sample. Only 5’ and internal segments were pooled for Titanium sequencing of G. hirsutum (Tx2094) and G. barbadense (K101 and Pima S6).
Sequencing was performed using 454 sequencing (454 Life Sciences, Branford, CT) at the Brigham Young University DNA sequencing center (FLX and Titanium) and Washington University (FLX). The reads have been made publically available through NCBI's Sequence Read Archive (Study #SRP001603). All publicly available Sanger reads were downloaded from GenBank (Feb. 2009) and filtered for duplicates, short ESTs (< 30 bp), and low-quality, vector, and low-complexity sequences using Lucy v. 1.20 .
All of the ESTs were combined to create a single, omnibus assembly of the Gossypium transcriptome that we have named Cotton46a where ‘46a’ refers to the assembly iteration. This assembly can be accessed and explored at our project website . Sanger and 454 ESTs were assembled using the CLCBio Genomics Workbench (v. 3.7.1; CLC bio, Aarhus, Denmark) with the following parameters set for all collections of input sequence (similarity = 0.95, minimum length fraction = 0.5, insertion cost = 3, deletion cost = 3, mismatch cost = 2). Quality values were used for all 454 reads and as well as the G. raimondii Sanger reads, though most other Sanger reads retrieved from GenBank lacked quality scores. ESTScan v. 3.0.3  was used to predict the coding sequences of each contig based on the codon preference matrix of A. thaliana.
We screened the contigs from our Cotton46a EST against well-annotated cloned and sequenced coding regions from the cotton diploid and tetraploid A- and D-genomes. In total we compared 14, 21, 19, and 13 genes and from A2, D5, and G. hirsutum AT and DT, respectively (Additional file 4). We found no evidence for homoeologous miscategorization between AT and DT. Among all accessions we analyzed 22,828 nucleotides and found 81 total substitutions, resulting in an error rate of 0.35 %. This error rate is likely inflated by allelic polymorphisms, as the exact genotypes used in our study were not necessarily the same as those retrieved from GenBank. For this reason this error rate should be considered a conservative upper-bound on the true error rate.
Illumina mRNA-Seq generation and alignment to 454/Sanger reference contigs
Cotton fiber RNAs were extracted from G. arboreum (accession cv. AKA8401), G. barbadense (accessions K101 and cv. Pima S6), G. hirsutum (accessions cv. Maxxa and TX2094), and G. raimondii (accession unnamed) using a modified hot-borate protocol described in Hovav et al.. RNA extractions were prepared for sequencing using the Illumina mRNA-Seq Sample Prep. Kit. Illumina sequencing (Illumina, Inc., San Diego, CA) was performed by the Iowa State DNA Facility. The reads have been made publically available through NCBI's Sequence Read Archive (Study #SRP001603). Reads were aligned to the 56,373 454/Sanger reference contigs using the bwtsw algorithm implemented by the BWA read-mapping software , leaving all parameters set to default.
SNPs were extracted from the 454/Sanger contig assembly and the Illumina mRNA-Seq alignments in parallel. In the 454/Sanger assembly SNPs each contig alignment of the assembly consisted of overlapping reads. Where possible, a consensus sequence of each diploid genome was created from its respective reads. Similar to the parameters of CLCBio, a majority-rule at each position was used to determine all consensus sequences. Individual allotetraploid reads were then compared to the diploid consensus sequences so that they could be categorized as either belonging to the A- or D-genome. Consensus A- and D-genome allotetraploid sequences were then created from these allotetraploid ESTs and SNPs were identified based on differences between these two consensus sequences. For all consensus sequences, the major allele had a frequency greater than 90 %. This method was implemented using bioperl  and custom perl scripts.
In parallel, SNPs were extracted from the Illumina mRNA-Seq alignments using the pileup program from the SAMtools package . All Illumina SNP calls were required to have a pileup SNP-quality score ≥ 20. Diploid A- and D-genome (G. arboreum and G. raimondii) specific SNPs and allotetraploid specific gene losses can confound homoeolog detection within allotetraploid cotton (see Figure 1 in Salmon et al. for additional description). To overcome this problem, we checked if both the A and D parental SNP alleles were present among the allotetraploid reads. In so doing, we created 2 categories, Illumina SNPs with confirmed or unconfirmed presence in the allotetraploids. The Illumina SNPs with a confirmed presence are likely the result of shared ancestral substitutions within the cotton A- and D-genome lineages, and therefore have the highest reliability. The unconfirmed class of Illumina SNPs may include many bona fide SNPs shared by both the parental diploids and the allotetraploids, but we lack the evidence necessary to prove that they are not a result of diploid lineage-specific mutations or allotetraploid gene loss. For this reason we report only the confirmed class of Illumina SNPs.
Pfam and Gene Ontology annotation
The search for protein domains and families within the current cotton assembly was performed with HMMER3  using the manually curated Pfam-A protein database . To broadly assess functional characteristics of the cotton assembly we assigned the Gene Ontology (GO) terms to contigs by converting their Pfam terms to GO terms using the pfam2go database [40, 41].
Sequence similarity searches
The assembled contigs were compared to the refseq_protein database from NCBI  using BLASTX to characterize the functional diversity within the transcriptome of cotton. BLASTX comparisons of the cotton transcriptome and Arabidopsis thaliana and Carica papaya (retrieved from Phytozome ) were also performed to compare gene content among these species.
Nonreciprocal homoeologous recombination detection
The computational methods used to identify putative NRHR events are described in Salmon et al., although the present study makes use of approx. 1.5 million G. barbadense ESTs that were not available to Salmon et al.. Custom scripts were used to traverse the EST alignment for each contig, searching for G. barbadense or G. hirsutum reads that contain both A- and D-genome specific SNPs. Once identified, these reads were resolved into sets that support unique NRHR events. This was done by binning together all reads that share or extend a continuous NRHR SNP haplotype. Then, by tracking the species found within these sets we were able to identify events that were specific to either allotetraploid, as well as those that were shared by both species and thus likely occurred in a common allotetraploid ancestor.
Using the same method of consensus sequence construction that was used for the 454/Sanger SNP analysis, the coding sequences of A2 and D5 were extracted from their respective consensus sequences using the coding frame predicted by ESTScan . In addition for both G. barbadense and G. hirsutum, we extracted the homoeologous AT and DT genome sequences. After excluding genes with a putative NRHR event, synonymous and nonsynonymous substitutions were calculated in a pairwise manner using coding sequence alignments for each pair of species and the method of Nei and Gojobori . These alignments were constructed using the maximum number of bases for each pairwise contrast, rather than the subset common to all species. Paired alignments less than 198 bp were not used and codons with ambiguous bases were removed. These parameters and variable amounts of sequence overlap between the individual consensus sequences within contigs resulted in differential numbers of alignments for overall diversity estimates. For these reasons, the number of contigs used for Table 2 and Table 3 is smaller than the total number of available contigs reported in Figure 1. All dN an dS estimates were corrected for multiple substitutions using the Jukes-Cantor correction implemented in bioperl . The lineage-specific SNPs compiled in Table 3 were extracted from the alignments described above by phylogenetically polarizing substitutions to find those that occur only in one lineage, as assessed by comparing to the remaining lineages.
Allotetraploid Gossypium A-genome
Contiguous DNA fragment
Nonsynonymous mutation rate
Synonymous mutation rate DT, allotetraploid Gossypium D-genome
Expressed sequence tag
Nonreciprocal homoeologous recombination
Single nucleotide polymorphism.
We thank Kara Grupp and Daniel Standage for technical assistance and Cotton Incorporated and the National Science Foundation Plant Genome Program for their generous support.
- USDA Crop Values: Summary. 2009,http://usda.mannlib.cornell.edu/usda/current/CropValuSu/CropValuSu-02-19-2010.pdf,
- Endrizzi JE, Turcotte EL, Kohel RJ: Genetics, cytology, and evolution of Gossypium. Adv Genet. 1985, 23: 271-375.View Article
- Wendel JF, Brubaker CL, Seelanan T: The origin and evolution of Gossypium. Physiology of cotton. Edited by: Stewart JM, Oosterhuis D, Heithholt JJ, Mauney JR. 2010, Springer, Netherlands, 1-18.View Article
- Wendel JF, Cronn R: Polyploidy and the evolutionary history of cotton. Adv Agron. 2003, 78: 139-186.View Article
- Wendel JF, Brubaker CL, Alvarez I, Cronn RC, Stewart JM: Evolution and natural history of the cotton genus. Genomics of cotton, plant genetics and genomics; crops and models 3. Edited by: Paterson AH. 2009, Springer, New York, 3-22.View Article
- Rapp R, Haigler C, Flagel L, Hovav R, Udall J, Wendel J: Gene expression in developing fibres of upland cotton (Gossypium hirsutum L.) was massively altered by domestication. BMC Biology. 2010, 8: 139-10.1186/1741-7007-8-139.PubMed CentralView ArticlePubMed
- Arpat A, Waugh M, Sullivan J, Gonzales M, Frisch D, Main D, Wood T, Leslie A, Wing R, Wilkins T: Functional genomics of cell elongation in developing cotton fibers. Plant Mol Biol. 2004, 54: 911-929. 10.1007/s11103-004-0392-y.View ArticlePubMed
- Chaudhary B, Hovav R, Flagel L, Mittler R, Wendel J: Parallel expression evolution of oxidative stress-related genes in fiber from wild and domesticated diploid and polyploid cotton (Gossypium). BMC Genomics. 2009, 10: 378-10.1186/1471-2164-10-378.PubMed CentralView ArticlePubMed
- Chaudhary B, Hovav R, Rapp R, Verma N, Udall JA, Wendel JF: Global analysis of gene expression in cotton fibers from wild and domesticated Gossypium barbadense. Evol Dev. 2008, 10: 567-582. 10.1111/j.1525-142X.2008.00272.x.View ArticlePubMed
- Wu Y, Machado AC, White RG, Llewellyn DJ, Dennis ES: Expression profiling identifies genes expressed early during lint fibre initiation in cotton. Plant Cell Physiol. 2006, 47: 107-127.View ArticlePubMed
- Grover CE, Kim H, Wing RA, Paterson AH, Wendel JF: Microcolinearity and genome evolution in the AdhA region of diploid and polyploid cotton (Gossypium). Plant J. 2007, 50: 995-1006. 10.1111/j.1365-313X.2007.03102.x.View ArticlePubMed
- Flagel LE, Chen L, Chaudhary B, Wendel JF: Coordinated and fine-scale control of homoeologous gene expression in allotetraploid cotton. J Hered. 2009, 100: 487-490. 10.1093/jhered/esp003.View ArticlePubMed
- Senchina DS, Alvarez I, Cronn RC, Liu B, Rong J, Noyes RD, Paterson AH, Wing RA, Wilkins TA, Wendel JF: Rate variation among nuclear genes and the age of polyploidy in Gossypium. Mol Biol Evol. 2003, 20: 633-643. 10.1093/molbev/msg065.View ArticlePubMed
- Flagel LE, Wendel JF: Evolutionary rate variation, genomic dominance and duplicate gene expression evolution during allotetraploid cotton speciation. New Phytol. 2010, 186: 184-193. 10.1111/j.1469-8137.2009.03107.x.View ArticlePubMed
- Adams KL, Cronn R, Percifield R, Wendel JF: Genes duplicated by polyploidy show unequal contributions to the transcriptome and organ-specific reciprocal silencing. Proc Natl Acad Sci USA. 2003, 100: 4649-4654. 10.1073/pnas.0630618100.PubMed CentralView ArticlePubMed
- Flagel L, Udall J, Nettleton D, Wendel J: Duplicate gene expression in allopolyploid Gossypium reveals two temporally distinct phases of expression evolution. BMC Biology. 2008, 6: 16-10.1186/1741-7007-6-16.PubMed CentralView ArticlePubMed
- Rapp R, Udall J, Wendel J: Genomic expression dominance in allopolyploids. BMC Biology. 2009, 7: 18-10.1186/1741-7007-7-18.PubMed CentralView ArticlePubMed
- Salmon A, Flagel L, Ying B, Udall JA, Wendel JF: Homoeologous nonreciprocal recombination in polyploid cotton. New Phytol. 2010, 186: 123-134. 10.1111/j.1469-8137.2009.03093.x.View ArticlePubMed
- Yang S, Cheung F, Lee JJ, Ha M, Wei NE, Sze SH, Stelly DM, Thaxton P, Triplett B, Town CD, et al: Accumulation of genome-specific transcripts, transcription factors and phytohormonal regulators during early stages of fiber cell development in allotetraploid cotton. Plant J. 2006, 47: 761-775. 10.1111/j.1365-313X.2006.02829.x.PubMed CentralView Article
- Udall JA, Swanson JM, Haller K, Rapp RA, Sparks ME, Hatfield J, Yu Y, Wu Y, Dowd C, Arpat AB, et al: A global assembly of cotton ESTs. Genome Res. 2006, 16: 441-450. 10.1101/gr.4602906.PubMed CentralView ArticlePubMed
- Udall JA, Swanson JM, Nettleton D, Percifield RJ, Wendel JF: A novel approach for characterizing expression levels of genes duplicated by polyploidy. Genetics. 2006, 173: 1823-1827. 10.1534/genetics.106.058271.PubMed CentralView ArticlePubMed
- Udall JA, Cheung F, Woodward AW, Swanson JM, Flagel L, Hovav R, Rapp RA, Nettleton D, Lee JJ, Gingle AR, et al: Spotted cotton oligonucleotide microarrays for gene expression analysis. BMC Genomics. 2007, 8: 81-10.1186/1471-2164-8-81.PubMed CentralView ArticlePubMed
- Chaudhary B, Flagel L, Stupar RM, Udall JA, Verma N, Springer NM, Wendel JF: Reciprocal silencing, transcriptional bias and functional divergence of komeologs in polyploid cotton (Gossypium). Genetics. 2009, 182: 503-517. 10.1534/genetics.109.102608.PubMed CentralView ArticlePubMed
- Grover CE, Kim H, Wing RA, Paterson AH, Wendel JF: Incongruent patterns of local and global genome size evolution in cotton. Genome Res. 2004, 14: 1474-1482. 10.1101/gr.2673204.PubMed CentralView ArticlePubMed
- Iseli C, Jongeneel C, Bucher P: ESTScan: a program for detecting, evaluating, and reconstructing potential coding regions in EST sequences. Proc Int Conf Intell Syst Mol Biol. 1999, 138: 48-
- Schmitz-Linneweber C, Small I: Pentatricopeptide repeat proteins: a socket set for organelle gene expression. Trends Plant Sci. 2008, 13: 663-670. 10.1016/j.tplants.2008.10.001.View ArticlePubMed
- Buggs RJ, Chamala S, Wu W, Gao L, May GD, Schnable PS, Soltis DE, Soltis PS, Barbazuk WB: Characterization of duplicate gene evolution in the recent natural allopolyploid Tragopogon miscellus by next-generation sequencing and Sequenom iPLEX MassARRAY genotyping. Mol Ecol. 2010, 19: 132-146.View ArticlePubMed
- Conant GC, Wolfe K: Turning a hobby into a job: how duplicated genes find new functions. Nat Rev Genet. 2008, 9: 938-950. 10.1038/nrg2482.View ArticlePubMed
- Ohno S: Evolution by gene duplication. 1970, Springer, New York, NYView Article
- Meyer E, Aglyamova G, Wang S, Buchanan-Carter J, Abrego D, Colbourne J, Willis B, Matz M: Sequencing and de novo analysis of a coral larval transcriptome using 454 GSFlx. BMC Genomics. 2009, 10: 219-10.1186/1471-2164-10-219.PubMed CentralView ArticlePubMed
- Chou H-H, Holmes MH: DNA sequence quality trimming and vector removal. Bioinformatics. 2001, 17: 1093-1104. 10.1093/bioinformatics/17.12.1093.View ArticlePubMed
- . , , , http://www.cottonevolution.info/
- Hovav R, Udall JA, Hovav E, Rapp R, Flagel L, Wendel JF: A majority of cotton genes are expressed in single-celled fiber. Planta. 2008, 227: 319-329.View ArticlePubMed
- Li H, Durbin R: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009, 25: 1754-1760. 10.1093/bioinformatics/btp324.PubMed CentralView ArticlePubMed
- Stajich JE, Block D, Boulez K, Brenner SE, Chervitz SA, Dagdigian C, Fuellen G, Gilbert JGR, Korf I, Lapp H, et al: The bioperl toolkit: perl modules for the life sciences. Genome Res. 2002, 12: 1611-1618. 10.1101/gr.361602.PubMed CentralView ArticlePubMed
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, Subgroup" GPDP: The sequence alignment/map format and SAMtools. Bioinformatics. 2009, 25: 2078-2079. 10.1093/bioinformatics/btp352.PubMed CentralView ArticlePubMed
- Hunter S, Apweiler R, Attwood TK, Bairoch A, Bateman A, Binns D, Bork P, Das U, Daugherty L, Duquenne L, et al: InterPro: the integrative protein signature database. Nucleic Acids Res. 2009, 37: D211-215. 10.1093/nar/gkn785.PubMed CentralView ArticlePubMed
- Nei M, Gojobori T: Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. Mol Biol Evol. 1986, 3: 418-426.PubMed
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.