We assembled ~24 Mbp of transcriptome sequence in each of four individuals of two species of Eichhornia. The data represent an important genetic resource of nearly 27,000 transcripts, many of which are common to all four samples. Further, using read mapping to a consensus transcriptome we have generated statistically informed genotypes for each individual in our study. By choosing to extract RNA from buds and mature flowers we were able to recover many genes involved in reproduction and floral development, some of which are likely to provide future insights on genetic changes to floral traits governing mating-system variation. We now compare our assembly and analysis with previous de novo transcriptome sequencing projects and briefly review some of the challenges and interpretations specific to our assembly. In addition, we also discuss the utility of short-read sequencing for characterizing genetic changes in transcriptomes and the expression level of different loci.
Assembly & consensus transcriptome
Our study is among the first published attempts at a de novo transcriptome assembly using short-read (Illumina GAII) sequencing. Although there are now many studies using next generation transcriptome sequencing, most have used the Roche 454 platform (but see [15–17]). As previously mentioned, this platform has the advantage of longer reads but at the expense of less sequence data per run. Longer reads may be critical for resolving assembly challenges associated with repetitive elements, such as gene duplications, allelic differences and alternative splicing (reviewed in ). However, despite using shorter reads, our assembly is comparable to other published transcriptomes, which at this time average 37,286 contigs (12,883 - 72,977, n = 19 studies), a number similar to our own (26,994 contigs). In addition, despite the longer read lengths (~ 400 bp; 197 - 581 bp, n = 19 studies), of previously published transcriptome studies using the Roche 454 platform, our assembly generated contigs that were over twice as long averaging 884 bp. One of the reasons that this is the case is because the Illumina generates greater depth of sequencing thus ensuring more complete coverage of the transcriptome.
Although summaries of the distribution of contig lengths are informative, the ultimate goal of transcriptome assembly is not long sequences, but accurate assembly of full-length sequences. However, it is difficult to assess the success of an assembly without a priori knowledge of the transcriptome. One metric that may be informative is the proportion of contigs that have significant similarity to known proteins. It is difficult to compare this measure across studies because each reports slightly different results using different BLAST parameters and databases. However, nearly 87% of our contigs had matches in NR and this value is as high or higher than all other comparable statistics reported in other de novo assemblies. Another useful metric is the proportion of the contig and its corresponding BLAST hit that align to one another (Figure 2). 7273 (26.9%) contigs cover greater than 75% of their best BLAST hit and 12,659 (47%) contigs are fully covered by their best BLAST hit. This means that although we assembled a large number of full-length proteins, many of the contigs appear to be fragments of larger proteins. One explanation is that gene duplication or alternative splicing creates repetitive elements in the assembly and these cannot be resolved. Although we found a fraction of our ESTs (5.1%) that had conflicts with Sanger sequenced ESTs, this is likely an overestimate of assembly error because some of these conflicts could result from paralogy or alternative splicing. There was no evidence that any of the 11 conflicting ESTs were chimeric assemblies of two or more proteins and most of the conflicts (9 of 11) appear to be the result of misaligned untranslated regions flanking coding regions. It is possible that some of the discrepancies between the ESTs we assembled and Sanger ESTs or known proteins are true differences, for example, paralogous transcripts or alternatively spliced isoforms.
Genotype calling & SNP detection
One of the major challenges in dealing with very short sequence reads is that they must be assembled into longer contigs based on overlap with other reads. The algorithms used in many de novo assemblers, including Velvet , may misinterpret small differences between alleles (SNPs) or gene duplicates as sequencing errors. If this occurs they can be 'collapsed' or purged from the assembly. Although the program Oases has begun to address this problem for transcriptome data, we chose to use read mapping to a consensus transcriptome because it allowed us to use allele frequencies at each site to statistically determine the genotype of each individual . From this approach we obtained, on average, more than 20 reads for each position to inform genotype calls. This allowed us to generate sequences for ~23,500 loci/genotype, 18, 000 of which were found in all samples. An additional benefit of using this approach is that we were able to identify heterozygous loci and potential read mapping errors. Of the 26,994 consensus sequences, 8712 heterozygous loci were identified. As expected, the fraction of heterozygous loci was highest in the outcrossing genotype of E. paniculata from Brazil. The selfing genotypes from Jamaica and Nicaragua appear to retain some residual heterozygosity, despite their predominantly autogamous mating systems. We also found evidence of read mapping errors in 8469 loci where more than one selfer appeared heterozygous. Possible explanations for these read-mapping errors include, sequencing errors, alternative splicing and most likely paralogy. With our current data and the available methods we have no way of determining their relative contribution to read mapping errors.
Sequencing of multiple paralogous transcripts will generate short reads that are similar or identical to many other reads derived from different loci. As a result, when there has not been sufficient divergence between gene copies, reads may be erroneously mapped to the reference sequence. One possible explanation for false heterozygosity is that repetitive elements, for example, conserved motifs in gene families, may be difficult or impossible to assemble into long contigs using current technology. As a result, a fraction of the original short reads may not have been assembled by Oases, or included in our consensus transcriptome. However, if they share enough similarity with a paralog in the transcriptome they may be incorrectly mapped. This could explain, in part, why the original Oases assemblies contain so many short contigs (< 100 bp, see Figure 1). If there is divergence between paralogous loci, incorrectly mapped reads may create a signature similar to heterozygosity. For future analyses it is critical that potentially paralogous sequences are identified because evolutionary inferences from non-orthologous genes are misleading. Although our approach does not allow us to unambiguously characterize all paralogous sequences it has provided a useful method for detecting single copy transcripts.
Using a conservative set of 5011 transcripts for which there was no evidence of paralogy, based on homozygosity in all three selfers and presence in all four genotypes, we determined the number of SNPs between each pair of genotypes. As expected, E. paniculata samples were more differentiated from E. paradoxa than with one another (Table 4). The patterns of divergence among E. paniculata genotypes reflect relationships previously reported (see figure three ). Specifically, the Nicaraguan selfing genotype is more similar to the outcrossing Brazilian genotype than it is to the selfing Jamaican genotype, despite more similar biogeographical origins and mating systems. This suggests that the Nicaraguan population is more closely related, or more recently derived from the Brazilian population. Further, when we calculate nucleotide polymorphism across 5011 sequences for the three E. paniculata genotypes the value we obtained (θ
= 0.0104) is comparable to our previously published species-wide estimate of total diversity (θ
= 0.0101 ), based on 10 nuclear-derived EST loci assayed in samples of 225 individuals from 26 populations. This evidence supports the validity of our SNP detection method.
Expression & enrichment
There was a weak trend indicating that the selfing genotypes of E. paniculata were more correlated in gene expression; however, this difference was not significant using bootstrapping to generate 95% confidence intervals. All three genotypes of E. paniculata retain highly correlated gene expression despite phenotypic divergence and geographic isolation. The slight elevation in correlated expression between the two selfing genotypes of E. paniculata may be caused by a small number of genes that are differentially expressed in both selfing genotypes (see below), although overall patterns of expression during flower development appear to remain largely conserved. This may be because we combined all stages of flower development in our assays and a more careful dissection of expression in each stage individually could potentially reveal different patterns of gene expression in selfers and outcrossers.
Enrichment tests of our annotated transcriptomes demonstrated that genes that were differentially expressed in selfers exhibit a paucity of particular gene classes. This can be interpreted as the conservation of expression of these gene functions, which are rarely differentially expressed. Of the 313 GO categories found to be significantly enriched among all differentially expressed genes only 21 were found to be over-abundant, and 20 of these 21 categories were over-represented in genes absent or expressed at lower levels in the two selfers. Therefore, it appears that many of the differences common to the selfing lineages of E. paniculata are associated with reductions of gene expression in floral tissue. This may be related to the convergence of floral traits in these two lineages, both of which have much smaller, less pigmented flowers, with reduced pollen production compared to the outcrossing genotype. 96 of the 108 gene classes that are expressed at a low level in the selfers are under-represented but 5 of the 12 over-represented GO categories were associated with the regulation of cell development and structure. Selfing flowers display modified stamen positions and floral instability including twisted, fused or missing perianth parts, whereas outcrossing plants rarely display these floral modifications . It is possible that the changes in gene expression we have documented influence the regulation of cell growth and division and are responsible for changes to floral morphology that characterize selfing populations. If so, these regulatory loci could be used as a set of candidate genes to investigate aspects of the evolution of the selfing syndrome.
Differentially expressed floral genes in selfers
By sampling different stages of floral development up to and including anthesis we were able to sequence and annotate a large number of florally expressed genes. In total 812 genes with the GO annotation 'reproduction' were identified, which is a large fraction of the number reported for Arabidopsis thaliana in which 1184 genes for reproduction have been documented . The lower number of genes in our annotation is not unexpected because we did not include tissue from reproductive stages after flowering, such as fruit and seed development. Within the GO category 'reproduction' we found 269 genes involved in floral development, similar to the number that has been annotated in A. thaliana (323 genes). Of particular significance are the floral development genes that are differentially expressed in the three selfing genotypes (Table 5), several of which affect structures that are modified in the selfers. Anther development and filament elongation are influenced by AFB2 , ARF6, ARF8 , BAM1 , GID1c, myb24 , PGP1  and PTR1  genes in A. thaliana, and pollen maturation and pollen tube growth are altered by AFB2, CSDL3 , CER1 , POP2 , myb24 and TIR1 . ERL1 plays an important roles in normal anther lobe formation and anther cell differentiation  and mutants in the ERECTA gene family have reduced lateral organ size and abnormal flower development, including defects in petal polar expansion and carpel elongation . We also identified a number of genes involved in flowering time including DCL1 , PGI1 , SHK1 , SUF4 . Lastly, all of the differentially expressed genes that influence ovule development were significantly up-regulated, including ARF6, ARF8 , OVA9 , PEP . Significantly, most of the candidate genes discussed above cause alterations to attractive structures (perianth) and male function (stamen development) consistent with the relaxation of selection for showy flowers, reduced allocation to pollen production and the loss of herkogamy (stigma-anther separation) through filament elongation of stamens. In contrast, the requirement for functional ovules to maintain seed fertility in selfers may explain the apparent absence of changes to gene expression in female traits.