Gene expression profiling via LongSAGE in a non-model plant species: a case study in seeds of Brassica napus

Background Serial analysis of gene expression (LongSAGE) was applied for gene expression profiling in seeds of oilseed rape (Brassica napus ssp. napus). The usefulness of this technique for detailed expression profiling in a non-model organism was demonstrated for the highly complex, neither fully sequenced nor annotated genome of B. napus by applying a tag-to-gene matching strategy based on Brassica ESTs and the annotated proteome of the closely related model crucifer A. thaliana. Results Transcripts from 3,094 genes were detected at two time-points of seed development, 23 days and 35 days after pollination (DAP). Differential expression showed a shift from gene expression involved in diverse developmental processes including cell proliferation and seed coat formation at 23 DAP to more focussed metabolic processes including storage protein accumulation and lipid deposition at 35 DAP. The most abundant transcripts at 23 DAP were coding for diverse protease inhibitor proteins and proteases, including cysteine proteases involved in seed coat formation and a number of lipid transfer proteins involved in embryo pattern formation. At 35 DAP, transcripts encoding napin, cruciferin and oleosin storage proteins were most abundant. Over both time-points, 18.6% of the detected genes were matched by Brassica ESTs identified by LongSAGE tags in antisense orientation. This suggests a strong involvement of antisense transcript expression in regulatory processes during B. napus seed development. Conclusion This study underlines the potential of transcript tagging approaches for gene expression profiling in Brassica crop species via EST matching to annotated A. thaliana genes. Limits of tag detection for low-abundance transcripts can today be overcome by ultra-high throughput sequencing approaches, so that tag-based gene expression profiling may soon become the method of choice for global expression profiling in non-model species.


Background
Seed developmental processes of plants are complex and follow coordinated gene expression programs. Some genes are exclusively expressed during seed development [1]. Understanding the genetics of developmental proc-esses and regulatory networks involved in embryogenesis and seed development is crucial for improvement of seed quality in crop plants. Gene expression during seed development has been studied intensively in the model plant Arabidopsis thaliana using mutagenesis [2,3] and microar-ray analyses (e.g., [3,4]). These studies revealed major gene expression changes during seed filling and desiccation, along with distinct expression patterns related to carbohydrate metabolism, lipid biosynthesis and storage protein accumulation. However, the regulatory mechanisms that ensure the proper execution of seed development in A. thaliana and other plants remain largely unknown. Furthermore, generalisation of findings regarding gene regulation from model to crop plants is difficult because most major crops have considerably more complex genomes.
Oilseed rape (Brassica napus ssp. napus), the closest major crop relative of A. thaliana, is the world's second most important oilseed crop. The high-value nutritional oil is also an excellent substrate for biodiesel production, whereas the protein-rich seed meal remaining after oil extraction is a valuable livestock feed. Oligonucleotide microarrays constructed for A. thaliana have been used in the past for expression profiling in B. napus, but have not provided optimal signal intensity and reproducibility [4][5][6][7]. Recently the total number of ESTs from Brassica species deposited in public databases has risen dramatically to more than 800,000 entries with about 280,000 from seed developmental stages. 67,000 ESTs from seed developmental stages have been used to develop a B. napus cDNA microarray for analysis of seed gene expression patterns [7].
An alternative for accurate, quantitative global expression profiling is serial analysis of gene expression (SAGE; [8]), which in contrast to microarray hybridization allows the detection of new transcripts. SAGE is an expression profiling technique that simultaneously measures the levels of thousands of genes expressed in a given tissue. The method is based on the excision of short tags from poly A+ RNAs and end-to-end ligation of ditags to form high molecular weight concatemers. This allows cost-effective high-throughput cloning and sequencing of concatemers. Matching of tags to genomic sequences is a critical step in SAGE data analysis, and normally this requires the availability of high quality genome annotation data [9]. SAGE was first used for quantification of gene expression in human using 13-15 bp tags [8]. Modifications of the original SAGE protocols producing 21 bp tags (LongSAGE; [9]) and 26 bp tags (SuperSAGE; [10]) have been developed to enable more efficient and unambiguous tag-togene assignment in higher organisms with more complex transcriptomes. SAGE is commonly used in animal genomics, but has been increasingly used in plant species and tissues [11]. The intention of the present study was to adapt the LongSAGE technique for analysis of global gene expression in B. napus and other similarly complex polyploid plant genomes where the complete genome sequence and annotation are not yet available. A data processing pipeline was adapted by matching B. napus tags via Brassica ESTs to annotated A. thaliana gene loci, including detection of tags matching in sense and antisense orientation.

B. napus seed LongSAGE libraries
Two LongSAGE libraries were produced from B. napus seeds harvested at 23 and 35 DAP, respectively. A total of 3,136 clones from the 23 DAP library yielded 34,984 ditags (22 tags/clone), while 3,168 sequence traces from the 35 DAP library yielded 15,919 ditags (10 tags/clone) ( Table 1). The relatively low average number of tags per clone resulted from a high number of cloned empty vector plasmids (28.2% and 60.3% of all high-quality reads for the 23 DAP and 35 DAP libraries, respectively). The vector plasmid pZErO (Invitrogen) was designed to prevent cloning of empty vector by using a LacZa-ccdB fusion cDNA insert that should destroy bacteria containing only the vector. However, as described by other authors (e.g. [12]) we found that pZErO could efficiently re-ligate without containing inserts. Thus, PCR screening of plasmid libraries for clones containing an empty vector might considerably reduce total Sanger sequencing costs and increase total tag sample sizes in LongSAGE transcription profiling experiments.  (Table 2). This discrepancy might be due to the reduced numbers of ESTs in the database derived from transcripts expressed at low levels, compared to transcripts expressed at high levels within B. napus seeds. Broadly different estimates exist on the proportion of errors within tag sequences caused by experimental artefacts, ranging from 1.67% to 15.6% of LongSAGE tags [13,14] with at least one erroneous base-pair introduced by sequencing. However, the singleton tag category might contain a higher proportion of tags with sequencing artefacts, also contributing to the different ratio of successfully matched tags in the singleton and ≥ 2 count categories. 12,728 tags from library 23 DAP were matched to 7,208 of 20,462 Brassica EST-matched A. thaliana proteins. 7,232 tags from library 35 DAP were matched to 4,875 A. thaliana proteins ( Table 2).

Comparison of LongSAGE and Real-time RT-PCR analysis
A good correlation was observed between expression levels estimated by LongSAGE and Real-time RT-PCR for most of the selected genes (Table 3), although some minor differences in estimated fold-change were observed (e.g. for CCR1). This could arise from the more selective specificity of Real-time RT-PCR for transcripts derived from closely-related gene loci, since specific RT-PCR primers were designed from contigs of B. napus ESTs putatively derived from different B. napus gene loci. Furthermore, Real-time RT-PCR may be more sensitive in detecting low abundant transcripts than LongSAGE under the experimental conditions and sequencing depth applied in the present study.

Transcript diversity
Average tag frequencies for the two analysed seed libraries differed significantly (  sequences were also obtained using different napin tag primers, with up to 30% divergence, within 200 bp of the 3'-UTR, from 16 clones. These observations suggest that a high number of distinct protease inhibitor/seed storage/ lipid transfer protein (LTP) family protein genes or transcript variants are expressed in B. napus seeds. Also these data support the increasing recognition of the high complexity and variability of 3' UTR regions in genes from higher organisms [13,15,16].
Based on the cDNA library production using polyA+ trapped mRNA, LongSAGE tags are expected to match at the anchoring enzyme site closest to the 3' end of fulllength ESTs expressed in sense orientation (canonical position). However, only 95.3% of the matched tags in the sense EST dataset were matched in the last canonical position. There are numerous possible explanations for the high frequency of tags matching a non-canonical position. In particular, they could arise from (a) cloning artefacts caused by partial digestion with the anchoring enzyme NlaIII during SAGE library preparation, (b) alternative transcript variants with varying length derived from one gene locus, or (c) EST cloning artefacts creating chimeric EST molecules derived from two or more different genes. Also, due to partial EST sequences present in the dataset, a non-canonical position might be falsely annotated to the canonical position. Except in the third case, non-canonical tags are still valid tags for accurate annotation. We attempted to reduce annotation artefacts (c) by removing putative chimeric ESTs from analysis (see section 'Genes matched in sense and antisense orientation' for details). This increased the ratio of tags matched to the canonical position of oriented Brassica ESTs from 95.3% to 98.3%. However, even after filtering out putative experimental artefacts using this approach, only 77% of all 243,528 ESTs matched by tags in the canonical position are matched exclusively at this position. Instead, 23% of these individual ESTs are additionally matched by tags at other non-canonical positions (up to the 9 th before last position), suggesting that the usage of alternative transcripts (e.g. by alternative polyadenylation) is a common mechanism. In other words, a high diversity exits in the 3' UTR of many expressed transcripts, as has been described before in humans [13]. No information exists about transcription start points for polyA+ containing transcripts expressed in antisense orientation. Therefore different proportions of matched antisense tags might be expected at canonical and non-canonical positions. In contrast to the tags matched in sense orientation, only 65.5% of tags were matched to canonical positions for the tags matched in antisense orientation.

Comparison of the most abundantly expressed tags and assigned genes
Tag-to-EST matches based on 15,955 tags were assigned to a total of 8,255 A. thaliana gene loci at both time-points (Table 2). About 85% of the tags (13,641) were assigned exclusively to a single A. thaliana gene locus. The other 15% of the tags matched 2 to 45 loci, with some tags matching more than 10 genes exhibiting a polyA-like homopolymer composition. Due to the positioning of the anchoring enzyme restriction site close to the polyA tail in some transcripts, these contained stretches with up to 16 adenosine bases derived from phylogenetically unrelated genes. Other tags matching more than 2 genes were clearly derived from groups of phylogenetically related genes. For example, the tag CATGAACAGTTTCATCAACGA matched to six different members of the histone gene family (AT2G37470, AT3G53650, AT5G22880, AT5G59910, AT1G07790 and AT5G02570).
On the other hand, some A. thaliana loci were matched by up to 96 different B. napus tags via the Brassica ESTs. Moreover, some loci were not only matched by different tags, but these tags also matched at different positions within one particular matched EST molecule. These multiple matches underline the complex paleopolyploid structure of the B. napus genome in comparison to A. thaliana [17]. For example, the 4 A. thaliana gene loci of the 2S seed storage protein family were matched by 47 different tags, including 9 of the 30 most abundant tags, at 35 DAP. Due to this complexity, a three-step procedure was applied for calculating average values of the relative abundance of gene expression, based on the tag-to-gene matching results for each single A. thaliana gene locus. In a first step, the measured counts were evenly distributed to the matched gene loci if a tag matched more than one gene locus. In a second step, tag counts were added together if different tags matched the same gene locus. In a final step the summed tag counts for each gene locus were normalized to a total tag count of 1,000,000 for both libraries and the relative abundances were calculated.
Average cumulative counts and frequencies for the 20 most abundantly expressed genes at 23 and 35 DAP are shown in Table 4 and 5, respectively. The most abundant transcripts at 23 DAP, with 1.59% of all matched tag counts, correspond to the A. thaliana senescence-associated cysteine-type protease SAG12 (AT5G45890), which was also among the 20 most highly-expressed genes at 35 DAP. Two protease inhibitor/seed storage/lipid transfer family proteins (AT5G38195, AT1G48750) were also among the 20 most highly expressed genes at both timepoints. Other highly expressed transcripts at 23 DAP are a number of catalytic enzymes and other genes that are involved in diverse biological processes. At 35 DAP, transcripts related to the four closely-related A. thaliana 2S seed storage protein genes showed the highest counts, with a cumulative abundance of about 10%. Other highly expressed transcripts at 35 DAP are related to other storage proteins (cruciferin, oleosin), protease inhibitor or lipid transfer proteins and genes involved in fatty acid biosyn-   (122) overlapping genes were detected that were mainly highly abundant genes (e.g. 2S seed storage proteins). The percentage of overlapping genes identified here between the two different crucifer species during morphological similar seed developmental stages appears to be high when taking into account that the correlation of microarray hybridization and LongSAGE data analysis has been found to be low (below 0.5) even when using the same RNA preparations and LongSAGE has been characterized to be more efficient at identifying differentially expressed tags than microarray technology [19].

Gene ontology enrichment analysis
A total of 7,208 of the 8,255 matched genes were expressed at 23 DAP, while 4,875 genes were expressed at 35 DAP (Table 2). 3,828 of 8,255 genes (46.4%) were expressed at both time-points. Figure 1 gives a comparison of GO categories that are statistically over-represented for the genes expressed at 23 DAP compared to all other A. thaliana genes. From the 490 genes matched by statistically significant differentially expressed tags between 23 and 35 DAP, 194 were up-regulated at 23 DAP and 296 were up-regulated at 35 DAP. Statistically enriched GO terms found for the differentially expressed genes either up-regulated at 23 or up-regulated at 35 DAP were calculated using GOEAST [20] and details of the results are provided in additional file 2: Enriched_GO_terms_490_differentially_expressed_genes .xls. Within the category 'Biological Process', the GO terms 'developmental process', 'localization' and 'metabolic process' are enriched at both time-points. Generally, GO terms that are statistically enriched at the highly general information level 2 are similar for the two time-points. Comparison of statistically enriched GO terms for 490 differentially expressed genes at 23 and 35 DAP reveals their involvement in seed maturation, regulation of meristem organization and photomorphogenesis, seed coat development, water and fluid transport, cell-cell signalling, cell wall modification and glycerolipid, neutral lipid, acylglycerol, triglyceride, and carbohydrate metabolic processes at 23 DAP, and in protein processing, protein targeting, photosynthesis, fatty acid biosynthesis, lipid localization, storage and lipid metabolic processes at 35 DAP.

Genes matched in sense and antisense orientation
In the 23 DAP library, 6 of the 30 most abundant tags matched in sense as well as in antisense orientation to the Brassica EST dataset, oriented based on the A. thaliana proteome. The same was true for 10 of the 30 mosXt abundant tags at 35 DAP. In most of these cases 500 to 1,500 ESTs were matched by a single tag in one direction, whereas only 1 to 13 ESTs were matched by the same tag in the opposite direction. A detailed analysis of these matches revealed that in all cases the smaller fraction of matched ESTs were chimeric sequences composed of two regions, derived from different gene loci, that aligned to the A. thaliana proteome or to B. napus full-length cDNAs in different orientations. These apparently chimeric ESTs may represent cloning artefacts derived from tail-to-tail or head-to-head ligations during cDNA preparation for EST library production. In some cases a putative chimeric composition was also found for individual ESTs, with two different regions derived from different genes being aligned in sense and antisense orientation to the A. thaliana proteome or to oriented B. napus full-length cDNAs. To remove these putative chimeric ESTs from our analysis, tag-to-EST-to-A. thaliana gene locus matches were only accepted if at least 4 tag-matched Brassica ESTs matching a particular A. thaliana locus were found. In addition, tagto-EST-to-gene matches of particular tags matching in sense as well as in antisense orientation were only included in the analysis if the frequency of ESTs matched in one direction was at least 1% of the frequency of ESTs matched in the other direction. After removal of putative chimeric ESTs, the number of accepted tags that were successfully matched via Brassica ESTs to A. thaliana genes was reduced strongly from 8,255 to 5,120 (Table 2), due to the exclusion of many low-copy ESTs from the analysis. The occurrence of chimeric ESTs in databases has been documented previously. For example, Hillier et al. [21] found chimeric ESTs in a dataset of 280,000 human ESTs at a frequency of up to 1.04%. The strategy we applied  Based on the tag-to-gene matching strategy described above it was found that 24.3% of all matched tags matched A. thaliana genes via oriented Brassica ESTs in antisense orientation (corresponding to 1762 of 5,120, 34.4% of all matched genes, Table 2). In the 23 and 35 DAP libraries, only 68.3% and 74.6% of the respective genes were matched exclusively by sense tags ( Table 2). On the other hand, 6.6% and 6.8% of all genes were matched exclusively by antisense tags at 23 DAP and 35 DAP, respectively, while 25.1% (23 DAP) and 18.6% (35 DAP) of the genes were matched by both sense and antisense tags. Genes expressed with high abundance in sense orientation, and particularly members of complex gene families, often also showed coexpression of antisense transcripts at a lower frequency (Figure 2, 275 differentially expressed genes matched in sense and antisense orientation). A detailed analysis of tags matching to the four 2S seed storage proteins of A. thaliana (AT4G27140, AT4G27150, AT4G27160, AT4G27170) was performed by aligning them to Brassica spp. napin cDNAs and genomic napin sequences. This revealed that some tags matched in antisense orientation to highly conserved regions of napin transcripts, derived from multiple B. napus loci, whereas other tags matched in sense orientation to highly diverse regions of napin transcripts derived from a limited number of B. napus loci. (37,182 to 41,806 tags per million). The four A. thaliana 2S seed storage proteins share 75-91% amino acid (aa) sequence identity with each other. The four 2S seed storage protein genes and 7 other protease inhibitor/seed storage/lipid transfer protein (LTP) family protein genes were also among the genes with the highest tag counts matched in antisense orientation (see additional file 3: All_matched_tags.xls).

Detection of napin-related sense and antisense transcript expression by Real-time RT-PCR
For comparison of SAGE results with quantitative Realtime RT-PCR detection, two sets of primers and their antisense reverse-complemented sequences were derived from SAGE primer amplified napin 3'-termini (see above) and from assembled contigs of 8,977 B. napus ESTs aligning with an e-value cut-off of 1e-6 to A. thaliana 2S seed storage protein 1 (AT4G27140). One primer set (napin) was designed specifically for the major cluster of B. napus assembled EST contigs most similar to the A. thaliana 2S seed storage proteins 1-4 (59% to 70% on the aa level). Another set of primers (napin-related) was designed specifically for the minor cluster of B. napus assembled EST clusters grouping in-between the four A. thaliana 2S seed storage proteins and the related protease inhibitor/seed storage/lipid transfer (LTP) family protein AT5G54740 (about 47% to 57% identity on the aa level). AT5G54740 represents the next closely related protease inhibitor/seed storage/lipid transfer protein (LTP) family protein gene in A. thaliana, which like other more distantly related proteins of this large family (about 113 members in A. thaliana) exhibits an AAI_SS domain and shares about 53-57% aa identity with the A. thaliana 2S seed storage proteins 1-4. Figure 3 shows the expression profile from days 17 to 70 after pollination from seeds for sense and antisense napin and napin-related transcript expression. The coexpression of sense and antisense transcripts was confirmed for a minor subgroup of napin-related transcripts by strand-specific Real-time RT-PCR, but not for the major napin transcript group as indicated by LongSAGE.

Discussion
B. napus and A. thaliana are members of the Brassicaceae family and exhibit around 87% conservation of their protein-coding sequences [22]. Based on this close phylogenetic relationship, the tag-to-gene matching strategy applied in this study for LongSAGE expression profiling in the non-model plant B. napus uses the information available for the well-annotated proteome of the model species A. thaliana. Our results show that expression profiling can be achieved for B. napus seeds using publically available EST resources and the annotated proteome of A. thaliana for tag-to-gene assignment.
Major factors influencing the accuracy of the first step of the SAGE analysis are the sequencing error rate, the tag length and the transcriptome complexity. The accuracy of SAGE depends on the ability to unambiguously match the tags to the genes of origin. Theoretical calculations show that >99.8% of 21 bp LongSAGE tags are expected to occur only once in large genomes such as the human genome, which is about twice the size of the B. napus genome, with the remaining tags matching duplicated genes or repeated sequences [9]. In a polyploid plant genome the proportion of duplicated genes is generally considerably higher, and our strategy of matching tags via ESTs to a related model plant proteome will not per se allow differentiation between closely related duplicated genes. However, matching to multiple members of gene families still allows biological inferences from the LongSAGE data, since the predicted function in each case is identical. In our case only around 90% of LongSAGE tags were assigned to a unique A. thaliana gene. This underlines the highly duplicated nature of the B. napus and A. thaliana genomes [17] and reflects the well-documented ancient hexaploidization events in major angiosperm phyla [23]. Despite this we found no example where a B. napus tag matched multiple A. thaliana genes with apparently unrelated functions. This suggests that the matching strategy we applied is specific enough with 21 bp B. napus tags to draw biological conclusions on the expression levels of functionally related gene families in B. napus seeds. Raising the blastx e-value stringency to 1e-100 for matching of Brassica ESTs to the A. thaliana proteome was found to reduce the average number of genes per gene family to which the observed tags were assigned. However, this also resulted in a higher number of tags that could no longer be assigned to A. thaliana genes and thus could not be annotated (data not shown). For LongSAGE analysis from B. napus we found an e-value of >1e-6 useful; this is also the cut-off value used by NCBI in the UniGene database for EST-to-protein matching to create clusters of transcript sequences that are expected to come from the same transcription locus.
The development of oilseeds can be broadly divided into three phases, (1) the pattern formation/cell proliferation phase, (2) the maturation phase, with lipid and storage protein accumulation, and (3) the desiccation phase. At 23 DAP B. napus seed has reached the end of phase 1 and is beginning the transition to phase 2, while at 35 DAP the seed is in the middle of the maturation phase. The most abundantly expressed gene detected at 23 DAP, SAG12, is a cysteine protease expressed during leaf senescence in A. thaliana [24]. In B. napus SAG12 is encoded by two orthologous copies, BnSAG12-1 and BnSAG12-2, involved in leaf senescence and seed development, respectively. BnSAG12-2 and BnCysP1, a cysteine protease with around 91% identity to BnSAG12-2, are expressed in the early phase of B. napus seed development within the inner integument of the testa and are predicted to function in the disposal of proteins in the inner integument cells committed to programmed cell death [25]. Other genes encoding proteases and protease inhibitor genes are also strongly represented within the 30 most abundantly expressed genes at 23 DAP and 35 DAP, e.g. a δ-vascular processing enzyme (δ-VPE) at 23 DAP. VPEs are cysteinetype proteases which were originally discovered to be responsible for maturation of seed storage proteins. In A. thaliana δ-VPE was found to be involved in developmental cell death during embryogenesis [26]. In general, proteases are considered to be key regulators of plant development involved in post-translational modification or activation of proteins and enzymes of diverse biological processes. Proteases play a role both in general protein turnover and in highly specific regulation of plant development. Dong et al. [27] also described an elevated expression of genes coding for subtilisin-like protease and vacuolar processing enzymes (VPE), consistent with seed endosperm development, along with early expression of lipid transfer proteins that are required for embryo pattern formation [28,29]. The high transcription levels of genes encoding seed storage proteins are also consistent with the results of Dong et al. [27] based on cDNA cloning and Northern blot expression profiling. On the other hand, our LongSAGE analysis revealed that transcripts encoded by the peroxidase gene AT4G21960 (Protein ID CAA66957) had a high abundance at 23 DAP and a subsequent drop in expression at 35 DAP, whereas Northern blot analysis appeared to reveal the opposite expression pattern. Such differences might be due to different specificities and resolutions of the two techniques, e.g. cross-hybridization of Northern blot probes with other closely related peroxidase transcripts.
Antisense transcripts and sense-antisense transcript pairs (natural antisense transcripts, NATs) were found to be prevalent at both developmental stages investigated. Antisense transcripts were observed earlier in A. thaliana and rice cDNA/EST databases and confirmed using tag-based expression profiling approaches like SAGE or Massively Parallel Signature Sequencing [30], with matching to the sequenced genomes. In some of these studies it was noted that the sense and antisense transcripts of an overlapping NAT pair (cis-NATs) tend to be expressed in different tissues or different conditions. Furthermore, in cases where the sense and antisense transcripts of a NAT pair were expressed in the same library, one type of transcript was usually more abundant than the other. One speculation is that cis-antisense transcripts could be involved in downregulating the expression levels of their target mRNA to achieve a low protein concentration by interfering with the transcription of their sense transcript. Due to the matching strategy we applied, the sense-antisense pair matches we found to the A. thaliana proteome might either indicate natural sense-antisense pair expression from one B. napus gene, or a concurrent expression of sense and antisense transcripts from paralogous B. napus genes. Either way, the results clearly indicate that a significant number of antisense transcripts are expressed during Real-time RT-PCR expression profile during B. napus seed development using primers specific for the coding (A -sense tran-scripts) and non-coding strands (B -antisense transcripts) of a group of napin and napin-related transcripts Figure 3 Real-time RT-PCR expression profile during B. napus seed development using primers specific for the coding (A -sense transcripts) and non-coding strands (B -antisense transcripts) of a group of napin and napin-related transcripts. Square symbols represent napin transcripts most similar to 2S seed storage proteins AT4G27140, AT4G27150, AT4G27160, AT4G27170, while triangular symbols represent napin-related transcripts most similar to protease inhibitor/seed storage/lipid transfer protein AT5G54740.
B. napus seed development. The coexpression of sense and antisense transcripts was confirmed for a subgroup of napin-related transcripts coding for proteins from the large group of protease inhibitor/seed storage/lipid transfer protein (LTP) family by strand-specific Real-time RT-PCR. Although Real-time RT-PCR confirms the detection of antisense transcripts that are related to protease inhibitor/seed storage/lipid transfer protein (LTP) family proteins by LongSAGE, it also indicates that the use of the proteome of A. thaliana for annotation of tag-matched B. napus transcripts in case of diverse gene families limits the resolution and the detailed analysis of the genome and transcriptome structure and availability of more genetic resources for B. napus will help to increase the resolution in the future. The detection of transcripts for a diverse number of genes in antisense orientation by LongSAGE at both time-points suggests strong involvement of antisense transcripts in regulatory processes during B. napus seed development.
The comparison of LongSAGE and Real-time RT-PCR data suggests that the sequencing depth applied in this experiment limits the LongSAGE detection to transcripts with medium to high copy numbers in B. napus seeds. On the other hand, recent developments in next-generation sequencing (NGS) technologies allow ultra-deep transcriptome profiling using transcript tag-based techniques (e.g. DeepSAGE; [31]) that can realistically achieve transcriptome saturation and accurate quantification of lowcopy transcripts. The data analysis techniques developed in the present study represent a valuable platform for future NGS-based transcriptome tagging studies in Brassica species. As suggested by Shendure [32], such approaches may ultimately replace microarrays as the method of choice for quantitative global transcriptome profiling as a basis for genetical genomics or systems genetics approaches. In the present study we have demonstrated that tag-based transcriptome profiling can also be effectively applied in a non-model, complex polyploid plant species.

Conclusion
This study underlines the potential of transcript tagging approaches for gene expression profiling in Brassica crop species via EST matching to annotated A. thaliana genes. The data processing pipeline adapted by matching B. napus tags via Brassica ESTs to annotated A. thaliana gene loci enabled differential expression profiling of during seed development in the complex B. napus genome, and furthermore detected an unexpectedly high proportion of EST tags matching in antisense orientation. This suggests a strong involvement of antisense transcript expression in regulatory processes during B. napus seed development. Limits of tag detection for low-abundance transcripts can today be overcome by ultra-high throughput sequencing approaches, so that tag-based gene expression profiling may soon become the method of choice for global expression profiling in non-model species.

Plant materials and RNA isolation
Plants of the homozygous male-sterile winter oilseed rape maternal line 'MSL-Express' were cultivated under controlled growth chamber conditions (16 hour, 20°C day and 8 hour, 16°C night, 60% relative humidity) and manually fertilised with pollen from plants of the isogenic male-fertile line 'Express 617'. Pods were harvested weekly from the main racemes of pollinated plants at time-points 14 days up to 70 days after pollination (DAP). The pods were shock-frozen in liquid nitrogen and stored at -80°C, and seeds were collected by splitting the frozen pods. Total RNA was extracted from seeds using RNA Reagent (Invitrogen) and purified from DNA residues by incubation with RQ1 RNAse-free DNAse (Promega), according to the manufacturer's recommendations.

LongSAGE library production
Total RNA extracted from two seed developmental timepoints, 23 DAP and 35 DAP, was processed using the I-SAGELong Kit (Invitrogen) with several modifications to allow more efficient library production and cloning. Briefly, 50 μg of polyA+ RNA was captured on the surface of magnetic oligo(dT) beads and double stranded cDNA was synthesized using SuperScript II reverse transcriptase. Double stranded cDNA was digested for 2.5 hours with NlaIII, and adapters containing MmeI recognition sites were ligated overnight in ligation buffer with 15% polyethylene glycol (PEG) to the 5' end of the cDNAs. Adaptor-linked tags were released from the beads by restriction with MmeI overnight, then precipitated and washed five times with 70% ethanol before being ligated overnight in pairs to form adaptorlinked ditags. These were amplified by PCR and 200 PCR reactions were pooled. Subsequently, 130 bp fragments were purified from 20 lanes on 16% polyacrylamide gels. Purified adapter-linked ditags were redigested overnight with NlaIII to remove adapters and 34 bp ditags were purified from a 16% polyacrylamide gel. The ditags were ligated overnight according to Kenzelmann and Mühlemann [33] to form concatemers, then partially digested with 2 U of NlaIII for 1 min at 37°C according to Gowda et al. [34] to create linearized clonable concatemers of suitable sizes. The partiallydigested ligation mixture was separated on a 6% polyacrylamide gel. DNA fragments between 0.8 and 2.5 kb were isolated and 1.6 μg concatemers were ligated to 25 ng of linearized pZErO-1 vector digested with SphI. For each of the 23 and 35 DAP libraries more than 3,000 plasmid clones were sequenced in one direction by SeqWright (Houston, TX, USA).

SAGE data processing
Ditags were extracted from sequence traces using the perl script sage-prhed.pl [35]. Due to the 1% error rate associ-ated with single pass Sanger sequencing [14], Phred sequence trace quality scores were used to remove all tags of low sequence quality (containing bases with a Phred score <20). 21 bp monotags were extracted from both ends of all correctly-sized ditags. Contaminating linker and vector sequences and low-complexity polyA tags were identified and removed by screening a local tag database using formatdb and blast (valid tags in Table 1 About 30% of all Brassica ESTs downloaded from the NCBI dbEST database were produced from seed developmental stages. The orientation of all Brassica ESTs was determined by blastx analysis against the TAIR7 A. thaliana proteome accepting only matches with an e-value of less then 1e-6. Any Brassica ESTs that did not match the A. thaliana proteome using these criteria were removed from the dataset. ESTs that did match the NCBI VecScreen vector database using blastn with pre-set parameters were removed from the analysis. The remaining 834,732 ESTs from NCBI and PBI were annotated based on the alignment with the A. thaliana proteome and orientated in sense using ReadSeq [36]. Identitag perl scripts were modified so that virtual tags from all positions were extracted from the Brassica ESTs, and the modified Identitag perl scripts were used to create a MySQL database of virtual tags [37]. To differentiate between tags that matched B. napus seed transcripts in sense and in antisense orientation, the oriented sense EST dataset was used in combination with an oriented reverse-complemented EST dataset (antisense EST dataset). For identification of differentially expressed tags the online-tool DiscoverySpace (http:// www.bcgsc.ca/discoveryspace; [38]) was used and significance calculated according to Audic and Claverie [39]. The online-tools WEGO (http://wego.genomics.org.cn; [40]) and GOEAST (http://omicslab.genetics.ac.cn/GOEAST; [20]) were used for gene ontology (GO) enrichment analysis. Hierarchical cluster analysis with average linkage (UPGMA) and Euclidean distance was performed for GC-RMA processed, normalized data using the software program HCE 3.0 [41]. Spearman's rank correlations were used to compare Affymetrix chip and LongSAGE [19]. Differentially expressed genes were identified using the software program SAM [42].
tags. CO designed the primers for Real-time RT-PCR and BH performed the experiments. CO and RS developed the experimental strategy, and were responsible for the preparation of the manuscript. RS and WF participated in the coordination of the study. All authors read and approved the final manuscript.