Rapid transcriptome characterization and parsing of sequences in a non-model host-pathogen interaction; pea-Sclerotinia sclerotiorum

Background White mold, caused by Sclerotinia sclerotiorum, is one of the most important diseases of pea (Pisum sativum L.), however, little is known about the genetics and biochemistry of this interaction. Identification of genes underlying resistance in the host or pathogenicity and virulence factors in the pathogen will increase our knowledge of the pea-S. sclerotiorum interaction and facilitate the introgression of new resistance genes into commercial pea varieties. Although the S. sclerotiorum genome sequence is available, no pea genome is available, due in part to its large genome size (~3500 Mb) and extensive repeated motifs. Here we present an EST data set specific to the interaction between S. sclerotiorum and pea, and a method to distinguish pathogen and host sequences without a species-specific reference genome. Results 10,158 contigs were obtained by de novo assembly of 128,720 high-quality reads generated by 454 pyrosequencing of the pea-S. sclerotiorum interactome. A method based on the tBLASTx program was modified to distinguish pea and S. sclerotiorum ESTs. To test this strategy, a mixture of known ESTs (18,490 pea and 17,198 S. sclerotiorum ESTs) from public databases were pooled and parsed; the tBLASTx method successfully separated 90.1% of the artificial EST mix with 99.9% accuracy. The tBLASTx method successfully parsed 89.4% of the 454-derived EST contigs, as validated by PCR, into pea (6,299 contigs) and S. sclerotiorum (2,780 contigs) categories. Two thousand eight hundred and forty pea ESTs and 996 S. sclerotiorum ESTs were predicted to be expressed specifically during the pea-S. sclerotiorum interaction as determined by homology search against 81,449 pea ESTs (from flowers, leaves, cotyledons, epi- and hypocotyl, and etiolated and light treated etiolated seedlings) and 57,751 S. sclerotiorum ESTs (from mycelia at neutral pH, developing apothecia and developing sclerotia). Among those ESTs specifically expressed, 277 (9.8%) pea ESTs were predicted to be involved in plant defense and response to biotic or abiotic stress, and 93 (9.3%) S. sclerotiorum ESTs were predicted to be involved in pathogenicity/virulence. Additionally, 142 S. sclerotiorum ESTs were identified as secretory/signal peptides of which only 21 were previously reported. Conclusions We present and characterize an EST resource specific to the pea-S. sclerotiorum interaction. Additionally, the tBLASTx method used to parse S. sclerotiorum and pea ESTs was demonstrated to be a reliable and accurate method to distinguish ESTs without a reference genome.


Background
White mold, caused by Sclerotinia sclerotiorum (Lib.) de Bary, is a devastating disease of over 400 reported dicotyledonous hosts [1]. The disease causes economically significant losses of many crop plants including pea (Pisum sativum L.) under the appropriate environmental conditions [2]. Currently, little is known about the genetic control of pathogenicity in the fungus and mechanisms of resistance in pea. Although hundreds of pea cultivars have been screened for white mold resistance in replicated greenhouse and laboratory tests [3], only partial resistance has been identified to date.
The identification of genes underlying S. sclerotiorum pathogenicity and resistance in pea would increase our knowledge of the pea-S. sclerotiorum interaction and facilitate the introgression of resistance into pea varieties. However, progress in these areas has been hampered by the lack of sequence information regarding the pea genome. Although other legume genomes, including the models Medicago truncatula, Lotus japonicus and economically important Glycine max (soybean) are available [4], Pisum sativum is still genome resource-poor in part due to the large genome size and large fraction of highly repetitive DNA [5].
The performance of Next-Generation sequencing (NGS) technologies continue to rise while costs continue to fall which enables researchers to conduct whole transcriptome sequencing (RNAseq) studies of interactions between plants and pathogenic fungi [6]. The application of NGS in plant-fungal interaction research promises to shorten the overall time of development of molecular genetic information necessary for functional and translational studies. However, RNAseq has rarely been used to study plant-pathogen interactions, particularly in nonmodel systems. One reason for this is the difficulty in distinguishing plant and fungal ESTs and even virus or viroid contamination, particularly when reference genomes are not available. Here we report novel transcriptome sequence information from the pea-S. sclerotiorum interaction obtained by 454 pyrosequencing and propose a method of rapid and efficient transcriptome characterization in a non-model species with little prior molecular information. We also report on the development and validation of a strategy to distinguish plant and fungal ESTs using the tBLASTx program and "proxyreference" genomes in the absence of true reference genomes.

Contiguous EST assembly
10,158 contigs were obtained by de novo assembly of 128,720 high-quality reads produced on a Roche 454 GS FLX sequencer (see Additional file 1). Minimum contig length was 50 bp, maximum length was 1,015 bp and average length was 200 bp. Average read coverage of contigs was 4.5X, and the maximum read coverage was 2,303X ( Figure 1).

Filter for virus or viroid contamination
The tBLASTx program identified 51 contigs with a BLAST hit (alignment identity) to virus or viroid DNA with an e-value cutoff threshold of less than 1e -3 . Further evaluation of these 51 EST contigs with tBLASTx against 3 legume and 7 fungal genome databases (which acted as proxy-reference genomes) revealed that 46 contigs showed significant alignment with the proxy plant genome database, 43 showed significant alignment with the fungal genome database and 40 showed significant alignment with both databases, 2 contigs showed significant alignment only with virus genomes. By comparing the e-value ratio (virus/fungi or virus/plant) of all proxyreference genome alignments, 10 contigs were assigned to pea and 9 contigs were assigned to S. sclerotiorum based on an e-value ratio >1e 20 , 30 contigs were difficult to distinguish, with e-value ratios between 1e -20 and 1e 20 . BLASTn analysis of the 32 unassigned contigs against pea and S. sclerotiorum ESTs from known sources revealed that 20 contigs, including the 2 that only aligned with virus genomes, had high identity matches to pea with 95% accuracy and 95% query coverage, and 1 to S. sclerotiorum. The 11 contigs which were difficult to assign to a genome database by either tBLASTx or BLASTn methods were far more similar to plant or fungi than to virus genomes.
Development and testing a method to distinguish pea and S. sclerotiorum ESTs using an artificially mixed pool Pea and S. sclerotiorum ESTs were downloaded from GenBank to test the tBLASTx sorting method. Three hundred twenty-one ESTs with vector contamination and 71 ESTs highly similar to virus or viroids were removed from the total pool of 36,080 known pea and S. sclerotiorum ESTs. Using an e-value threshold of 1e -3 , 35,688 mixed ESTs from pea and S. sclerotiorum were compared to legume and fungal proxy-reference genome databases and parsed using the tBLASTx program ( Figure 2). 11,191 ESTs only aligned with the legume proxy-reference genomes, 11,259 ESTs only aligned with the fungal proxy-reference genomes, 11,266 ESTs similar to both plant and fungal proxy-reference genomes and 1,972 ESTs did not match either proxyreference genome database. The ESTs with tBLASTx results to both plant and fungal genomes were analyzed further by comparing the e-value ratio (fungi/plant) of fungal and plant proxy-reference genome alignments. 4,098 ESTs were assigned to pea based on an e-value ratio >1e 20 , 5,649 ESTs were assigned to S. sclerotiorum based on an e-value ratio <1e -20 , while 1,519 ESTs were difficult to distinguish due to high e-value alignments to both proxy-reference genome databases, with e-value ratios between 1e -20 and 1e 20 . This method successfully separated 90.1% of the known ESTs into pea or S. sclerotiorum categories, with only a 0.1% misallocation rate. Only 5.5% of ESTs had zero similarity to either of the proxy-reference genomes, and 4.3% of ESTs had high similarity to both the plant and fungal proxy-reference genome databases (Table 1).

Parsing 454 pyrosequence pea and S. sclerotiorum ESTs with tBLASTx and BLASTn
Initial tBLASTx parsing of 10,158 contigs with an evalue threshold of 1e -3 resulted in identification of 4,523 pea ESTs, 2,304 S. sclerotiorum ESTs, 1,974 ESTs that matched both pea and S. sclerotiorum, and 1,357 ESTs that did not matched either proxy-reference genome database. The 1,974 ESTs that matched both proxyreference genomes at the e-value threshold of 1e -3 were further subdivided using the e-value ratio method into 544 pea ESTs (fungi/plant e-value ratio >1e 20 ), 355 S. sclerotiorum ESTs (fungi/plant e-value ratio <1e -20 ) and 1,075 that were ambiguous (fungi/plant e-value ratio <1e 20 and >1e -20 ). This brought the number of classified ESTs for each category to 5,067 for pea, 2,659 for S. sclerotiorum, 1,075 as ambiguous with high matches to both proxy-reference genomes, and 1,357 with no significant alignment. The remaining 2,432 EST contigs that were ambiguous or showed no significant alignment were further parsed with BLASTn analysis against known pea and S. sclerotiorum ESTs if identity and query coverage were both equal to or greater than 95%. 1,232 ESTs of this pool were assigned with BLASTn to pea and 121 ESTs were assigned to S. sclerotiorum, leaving 310 ambiguous and 769 EST contigs with no significant alignment. In total with tBLASTx and BLASTn, 10,158 contigs were separated into 6,299 pea ESTs, 2,780 S. sclerotiorum ESTs, 310 ambiguous ESTs and 769 unassigned ESTs ( Figure 3).

Validation of tBLASTx and BLASTn EST parsing results by PCR
Validation of the tBLASTx and BLASTn assignment was performed for 50 S. sclerotiorum and 50 pea EST contigs randomly sampled from the two assigned categories. All 50 primer sets designed to the pea EST contigs amplified the expected amplicon size in both the pea-S. sclerotiorum and non-inoculated pea cDNA indicating correct parsing assignment of the pea ESTs ( Figure 4). Of the 50 PCR primers designed to the S. sclerotiorum ESTs, 47 amplified a PCR product from both the pea-S. sclerotiorum and S. sclerotiorum only cDNA samples and most of them amplified the same size amplicon in both cDNA samples. Two of the 50 S. sclerotiorum PCR primer pairs amplified the expected PCR products from the pea-S. sclerotiorum cDNA sample but not the S. sclerotiorum only cDNA, perhaps indicating that this transcript is only expressed during the interaction with pea. One S. sclerotiorum primer set failed to amplify any PCR product from either template.
Unique ESTs expressed in the pea-S. sclerotiorum interaction To detect unique genes expressed in our pea-S. sclerotiorum interaction, the 6,299 classified pea ESTs in our data set were compared with BLASTn against 81,449 recently published pea ESTs from flowers, leaves, cotyledons, epi-and hypocotyl, and etiolated and light treated etiolated seedlings [7]. Of these 6,299 ESTs, 3,459 ESTs had significant alignments with an e-value cutoff of 1e -10 , in which 1,668 contigs had a percentage identity threshold of 95% for 95% or more of the query sequence, leaving 2,840 potentially unique pea ESTs to the pea-S. sclerotiorum interaction. It was possible to annotate 1,631 of these ESTs of which 67 contigs encode transcription factors (Table 2), 69 were involved in signaling pathways (Table 3) and 82 contigs were involved in encoding defense-associated proteins ( Table 4). The 2,780 S. sclerotiorum EST contigs were also assessed with BLASTn against 57,751 S. sclerotiorum ESTs (from mycelia at neutral pH, developing apothecia and developing sclerotia). Of these, 1,784 ESTs matched with an e-value cutoff of 1e -10 , in which 294 ESTs matched with 95% identity for 95% of more of the query length to the S. sclerotiorum EST growth libraries. Of the remaining 996 unique ESTs, it was possible to annotate 438 ESTs of which 95 ESTs were described as being related to pathogen virulence or pathogenicity (   Prediction of secretory/signal peptides for the S. sclerotiorum contigs A total of 2,754 coding regions were predicted with OrfPredictor from the set of 2,780 S. sclerotiorum ESTs. The peptide sequences were then used as a query for SignalP 3.0, which predicts the presence and location of signal peptide cleavage sites in amino acid sequences and identifies them as secretory proteins. The neural network (NN) method predicted 244 secretory signals, and the Hidden Markov Model (HMM) predicted 216. A total of 142 ESTs were identified by both NN and HMM and can be considered putative secretory peptides with high confidence (see Additional file 2). Of these 142 predicted secretory proteins, 21 were reported to be involved in pathogen virulence or pathogenicity (Table 6).

Significance of study and summary of the main findings
Despite Pisum sativum being used by Gregor Mendel to propose a model of particulate inheritance and being a highly nutritious food source for populations worldwide, few genomic resources exist for pea. One of the pathogens of pea, S. sclerotiorum is not only capable of causing devastating disease of pea but is able to infect over 400 plant species [1]. By sequencing a normalized cDNA pool of the pea-S. sclerotiorum interaction with next generation sequencing we have catalogued a number of novel genes putatively involved in pathogenicity and resistance. To our knowledge this is the first study to examine the pea-S. sclerotiorum "interactome". Sequencing the transcriptome (RNA-seq) is the method of choice in non-model systems for transcript discovery and genome annotation [8]. However, it has rarely been used to study plant-fungal interactions; one reason for this is the difficulty in distinguishing plant and fungal ESTs, particularly when reference genomes are not available. Using genomes of closely related species and tBLASTx to parse pea and S. sclerotiorum ESTs we demonstrated that Roche 454-pyrosequencing is a useful technique to characterize the host-pathogen interactome when genome resources are limited.  Two different strategies have been utilized previously to identify transcript origins in mixed plant and fungal EST datasets. One is a predictive method based on triplet nucleotide usage frequencies [9] and the other is a homology method using the BLASTp algorithm [10]. One shortcoming of the BLASTp method is that it could not be applied to novel genes or sequences from the noncoding regions of genes. Although the triplet nucleotide frequency method extends the application of the algorithm to both coding and non-coding sequences, the classification accuracy is approximately 90%, and required the use of a training set of ESTs to develop the nucleotide frequency for separation. A combined method was also used by Fernandez et al. [11], although this method distinguished 91% of the ESTs from the Coffea arabica-Hemileia vastatrix interaction no validation of the method was presented [11]. Classification of genes from a pool of mixed cDNA by traditional sequence similarity analysis (BLAST) is of interest to many investigations into plant-pathogen interactions. DNA sequencing is becoming more affordable and whole genome sequences of many organisms are becoming available and will aid in plant-pathogen interaction studies. However, in pea these resources are not available, therefore, we used a standalone BLAST approach against proxy-reference genome databases with high genetic similarity to pea or S. sclerotiorum to distinguish mixed transcripts. Using an artificial mixture of known pea and Sclerotinia ESTs, we found the error rate     using the BLAST method was significantly lower than the triplet nucleotide frequencies method (Table 7). We also demonstrated that the tBLASTx algorithm provided improved sorting of contigs relative to the BLASTn algorithm, and results in fewer ambiguous reads (see Additional file 3). In addition, although one individual genome of S. sclerotiorum (strain 1980) has been sequenced [12], there are still 1.6 Mb of predicted gaps in the 39.6 Mb assembly. To avoid ignoring unique genes between two different strains of the same species, a multi-fungal genome approach was adopted in this study. It was demonstrated that the assignment error rate based on 7 closely related fungal genomes was slightly decreased relative to assignment based on the single S. sclerotiorum genome (see Additional file 4). The e-value and e-value ratio utilized in our study to differentiate pea and S. sclerotiorum reads chosen selected after comparing several e-values, to maximize discrimination while reducing the error rate (see Additional file 5). Additionally, we determined error rates for this method using the artificial EST mix and validated the technique using our EST data set. We found that the percentage of unassigned ESTs (23.9%) in the 454 data set was higher than in the test EST data set (9%). One

Pea ESTs unique to the pea-S. sclerotiorum interaction
In response to pathogen attack, plants have evolved complex signaling and defense pathways. Putatively unique ESTs in our pea-S. sclerotiorum interactome were defined and identified by comparing EST contigs in our library against those of non-interaction EST libraries of pea and S. sclerotiorum. Although we identified a total of 2,840 (45.1%) putatively unique pea ESTs it was only possible to annotate 1,631 of these and only 451 had annotations suggesting roles in defense or response to biotic and abiotic stress. Most of the annotated genes are consistent with previous expression profiling analyses in Brassica napus infected with Sclerotinia sclerotiorum [13]. Following infection, many genes, including those encoding defense-associated proteins, enzymes involved in signaling pathways, and genes encoding transcription factors were induced. Transcriptional control of the expression of stressresponsive genes is a crucial part of plant response to a range of abiotic and biotic stresses [14]. We demonstrated that 67 putative transcription factors were detected. These genes were classified into the MYB family, the Apetala2/Ethylene responsive element binding protein (AP2/EREBP) family, WRKY family and others ( Table 2). Seven MYB family transcription factors were detected in our data and they play a key role in hormone signal transduction and disease resistance [15]. Eight AP2/EREBP transcription factors, including 3 ethylene insensitive transcription factors (contig 1311, 7671 and 8261) and 3 AP2/ERF genes (7472, 8894 and 9486), are key regulatory elements for ethylene signaling and response for biotic or abiotic stresses [16,17].WRKY40 act as negative regulators of defense signaling and have been associated with negatively regulating resistance to P. syringae in Arabidopsis [18].
Plant defenses are regulated through a complex network of transduction pathways [19]. Sixty-nine unique pea ESTs involved in signaling pathways were detected in this study. The signaling pathways were mediated by different signaling molecules, like abscisic acid (ABA), auxin, brassinosteroid, calcium ion, ethylene (ET), gibberellic acid (GA), jasmonic acid (JA), salicylic acid (SA) and small GTPase (Table 3). Those results were consistent with previous studies of signaling pathways involved in plant resistance to Sclerotinia sclerotiorum [20,21].
Expression of downstream proteins, including defenseassociated proteins, was induced through signal transduction and transcription factor regulation after pathogen infection. In this study, 82 unique pea ESTs encoding defense related proteins were detected (Table 4). Four contigs (5422, 6766, 7235 and 9781), encoding putative pathogenesis-related (PR) proteins involved in the response to pathogen attack were prominent. Numbers of cell-wall-related genes were also detected; those contigs involved in the biosynthesis of plant cell wall structures and the disassembly of fungal cell walls. Chitinase (encoding by contig 491, 589, 1622, 2311 and 5742), beta-1, 3glucanase (encoding by 196, 1243 and 4388) and other glycoside hydrolases are known to possess anti-fungal activity by degrading fungal cell walls [22].

S. sclerotiorum ESTs unique to the pea-S. sclerotiorum interaction
Pathogens have evolved a number of strategies to gain entry into the host cell and to overcome the plant defense system. In this study, we identified 996 S. sclerotiorum contigs as specifically expressed during pea-S. sclerotiorum interaction through comparison of EST contigs against S. sclerotiorum ESTs from growth libraries. Ninety-five of 438 annotated contigs were described as being involved in pathogen virulence or pathogenicity (Table 5).
Fungi produce enzymes that degrade the cell wall and wall-associated polymers to penetrate plant cells. There were 39 specifically expressed contigs involved in the penetration of the plant cuticle and cell wall. Contig 6412 encodes an exoglucanase 2 precursor, which has cellulolytic activity [23] and is involved in cellulose degradation; enzymes encoded by 11 contigs (473, 555, 717,  970, 3115, 3513, 5562, 5918, 6759, 6383 and 9655) are involved in hemicellulose degradation; enzymes encoded by 11 contigs (1533, 3147, 3861, 5679, 6431, 6824, 7085, 7661, 7834, 8501 and 10069) are involved with pectin degradation. In addition, carbohydrate esterase encoded by contig 1562 was also involved in plant polysaccharide degradation. Integrity of the fungal cell wall is also very important for pathogenesis and some reports showed the deletion of biosynthetic cell wall enzymes resulted in dramatically reduced virulence [24]. In our data, 18 contigs were identified as affecting biosynthesis and integrity of fungal cell walls. Sclerotinia sclerotiorum differentiates appressoria into infection cushions prior to invasion and we found 12 genes involved in the formation of infection structures. Eight contigs were involved in response to the host immune system, of which 3 efflux transporters encoded by contigs 1220, 4783 and 6180 are responsible not only for export of compounds involved in pathogenesis such as secondary metabolites, but also export of host-derived antimicrobial compounds [25][26][27]. Contig 1769 had similarity to the guanine nucleotide-binding protein (G protein) alpha subunit which is an important signal transducing molecule in cells, essential for growth, asexual and sexual development, and virulence in both animal and plant pathogenic filamentous fungal species [28]. Importin beta-2 encoded by contig 3623 belongs to the importin β family which mediates transport between the nucleus and cytoplasm of macromolecules that contain nuclear import or export signals. All importin β members have the ability to recognize and bind specific cargo involved in the recognition of the host and signaling [29,30].

Secreted/signaling proteins
Proteins secreted by fungi play a key role in the development of plant disease and the evolution of pathogenicity [31]. Some secreted proteins can degrade polymers encountered, such as cellulose, lipid, protein, and lignin, and transport the resulting simple sugars, amino acids, and fatty acids into the growing cell for use [32]. Using the SignalP3.0 program with stringent criteria, 142 contigs encoding putative secreted proteins were identified in the 2,780 S. sclerotiorum contigs. Twenty-one of the 66 annotated contigs were described as involved in pathogen virulence/pathogenicity in previous research (Table 6). Contig 355 encodes an enolase which is usually present on the cell surface or even secreted and is a potential virulence factor. In bacterial systems enolase has been demonstrated to contribute to pathogenicity by binding plasminogen in the infected host, potentially allowing the bacteria to acquire surface-associated proteolytic activity [33][34][35]. The basic leucine zipper transcription factor, encoded by contig 395, is a member of the bZIP family, one bZIP family member (Moatf1) from the rice fungus Magnaporthe oryzae mediates oxidative stress responses and is necessary for full virulence [36]. Contig 1352 encoding fkbp-type peptidylprolyl isomerase, with high homology to the Mip (macro-phage infectivity potentiator) protein, has been shown to be an essential virulence factor in Legionella pneumophila [37][38][39]. Chitin synthase 1 (contig 1434) plays a major role in cell wall biogenesis. Disruption of Botrytis cinerea class I chitin synthase gene Bcchs1 results in cell wall weakening and reduced virulence [40,41]. Autophagy is necessary for turnover of organic matter during the formation of conidia and appressoria and for normal development and pathogenicity in Magnaporthe grisea. Autophagy is required for the virulence of some eukaryotic pathogens [42][43][44]. Contig 6759 encodes endo-β-1,4 xylanase which plays a significant role in the virulence of Magnaporthe oryzae, affects both penetration and expansion of M. oryzae in infected plants [45]. Pectin methylesterase (PME) produced by phytopathogenic bacteria and fungi catalyses the demethoxylation of pectin, a major plant cell wall polysaccharide [46]. The possible role of secreted adenylate kinase (AK), encoded by contig 9219, as a virulence factor is in producing and keeping an intact pool of toxic mixtures of AMP, ADP, and ATP, which allows Pseudomonas aeruginosa to exert its full virulence [47]. Glutathione reductase is important to nitric oxide and macrophage resistance and is essential for virulence [48] and in Candida albicans GRX2, a putative glutaredoxin, is required for virulence in a murine model [49].

Conclusions
Here we present an EST resource that is specific for the pea-S. sclerotiorum interaction. We demonstrate and validate a method to reliably parse host and pathogen ESTs without the need for reference genomes. The ESTs were compared to non-interaction EST libraries to identify candidate resistance and pathogenicity genes. We also catalogued 145 proteins putatively secreted by S. sclerotiorum. The EST dataset will be a useful reference for further plant-fungus interaction studies, particularly for the Sclerotinia and legume research communities. Additionally, the S. sclerotiorum ESTs will be a valuable resource for the annotation of the S. sclerotiorum genome. Although the depth of our sequencing was not sufficient to obtain a global view of transcripts expressed during the pea-S. sclerotiorum interaction, the results are still very useful for the identification of plant resistance, fungal pathogenicity and virulence genes. This study sets the ground work and will be a resource for our current pea-S. sclerotiorum RNAseq expression profiling studies.

Plant, fungal growth and inoculation
Three plants of pea cultivar 'Lifter' (PI628276) were established per 1 gallon plastic pot in Sunshine LA 4 potting mix (Sun Gro Horticulture, Bellevue, WA). The plants were maintained in a greenhouse for 4 weeks with supplemental lighting extending the day length to approximately 14 h (October). Day and night temperatures were 22 ± 2°C and 16 ± 2°C, respectively. S. sclerotiorum isolate WMA-1 was isolated from a diseased pea plant in 2003 from a pea field (Washington, USA) with white mold disease symptoms and stored as air dry sclerotia at room temperature. Isolate WMA-1 (=ATCC MYA-4521) was demonstrated to be genetically representative of eight S. sclerotiorum strains sampled from legume hosts from various geographic locations using randomly amplified polymorphic DNA (RAPD) analysis (Kawabe and Peever, unpublished). Plants were inoculated with a 5 mm plug collected from the leading edge of an actively growing colony on a potato dextrose agar (PDA). The plug was placed fungal side down on the stem between the 4 th and 5 th detectable nodes and held in place by wrapping with Parafilm. Plants were transferred to a growth chamber with a 12 h photoperiod, an approximate 60% relative humidity, temperature of 20 ± 1°C and a 12 h photoperiod, for 72 hours to allow disease lesion development prior to RNA extraction.
Total RNA extraction and purification of mRNA from total RNA A 1 cm stem section was collected from each of 18 infected plants by cutting above and below the lesion front advancing toward the base of the plant. The stem section included both necrotic and green tissue with the advancing lesion front located in the center of the section. Stem sections were snap-frozen in liquid nitrogen and ground to a fine powder with a mortar and pestle. A total of 3 ml of TRIzol (Invitrogen, Carlsbad, CA, USA) was added to the ground tissue and the sample was split in half for column purification with the TRIzol Plus RNA purification kit (Invitrogen, Carlsbad, CA, USA). The additional step of on-column DNA digestion was performed with DNase I (Invitrogen, Carlsbad, CA, USA) to remove contaminating DNA. RNA was eluted in 250 μl of water per spin column. Poly-A RNA was isolated from total RNA with the Oligotex kit using the mRNA spin-column protocol (Qiagen, Valencia, CA, USA). Purified mRNA was eluted in a total of 100 μl of 5 mM Tris (pH 7.5). RNA and mRNA quantity was determined with a spectrophotometer (NanoDrop Technologies Inc., Wilmington, DE, USA). Total RNA and mRNA quality was assessed with an RNA Nano LabChip on an Agilent 2100 Bioanalyzer (Agilent Technologies, Santa Clara, CA, USA).

cDNA synthesis, normalization and 454 pyrosequencing
Purified mRNA was used to construct a full length normalized cDNA pool through the services of Evrogen [50]. Briefly, the service utilized the SMART cDNA cloning methodology to generate a full length cDNA pool [51], which was normalized using a duplex-specific nuclease [52]. The double stranded normalized cDNA pool was sheared by nebulization and prepared for and sequenced as per manufacturer's instructions on a Roche 454 GS FLX sequencer using an entire plate at Washington State University.
Data filtering and de novo assembly 35 Mb of sequence data representing 162,729 reads were generated by 454 sequencing. Quality trimming, adaptor sequence removal and size selection of reads was performed with Galaxy software (http://main.g2.bx.psu.edu/) [53]. After trimming adaptors, 128,720 reads with quality scores over 20 and sequence length longer than 50 bp were assembled with Abyss [54]. Parameters were adjusted for optimal assembly as measured by N50 statistic (a weighted median statistic such that 50% of the entire assembly is contained in contigs or scaffolds equal to or larger than this value).

Virus or viroid contamination detection
To determine whether any viruses or viroids were present in the fungi-infected plant cDNA sample, viroid and virus databases [55], including 41 complete viroid genomes and 2628 virus genomes, were downloaded from NCBI (released in April 2011). All EST contigs were analyzed with tBLASTx against viroid and virus databases. The e-value cutoff threshold was set at 1e -3 . Contigs with a BLAST hit to viroid and virus databases were further analyzed by tBLASTx program against 3 legume genomes database and 7 fungi genomes database individually using the same cutoff threshold (see next section "development of a S. sclerotiorum and P. sativum parsing method").
Development of a S. sclerotiorum and P. sativum parsing method To separate S. sclerotiorum and pea ESTs from the mixed pool, a procedure based on that proposed by Hsiang et al. in 2003 [56] was employed with modifications ( Figure 2). Briefly, the mixed ESTs were compared with tBLASTx (NCBI-BLAST-2.2.24+) to fungal and plant "proxy-reference" genome databases (Table 8). These proxy reference databases were established as the pea genome is not available and the inclusion of additional ascomycetes genomes to S. sclerotiorum (strain 1980) improved the assignment rate. The proxy-fungal genome database was a mixture of Sclerotinia sclerotiorum (strain 1980) and 6 closely related Ascomycete fungi (Botrytis cinerea, Chaetomium globosum, Fusarium graminearum, Magnaporthe grisea, Neurospora crassa and Verticillium dahlia) and a plant genome database including 3 sequenced legume genomes (Glycine max, Lotus japonicus and Medicago truncatula). ESTs that only matched to fungal or plant genome database with an e-value of 1e -03 or better were automatically classified into S. sclerotiorum or pea ESTs, respectively. ESTs, which matched (e-value <1e -3 ) to both fungi and plant databases, were further analyzed by comparing the evalue of best-hit from fungi and plant genome results. An e-value ratio was determined by dividing the best-hit e-value to fungi and plant genomes from the tBLASTx searches. A cutoff ratio were set at > =1e 20 for pea ESTs, <=1e -20 for S. sclerotiorum ESTs and those that fell between 1e -20 and <1e 20 were considered to be ambiguous. To acquire a final sort of results, those ESTs without a BLAST hit or those found to be ambiguous were assigned with BLASTn against known S. sclerotiorum or pea ESTs if their identity was above 95% in similarity across 95% of the sequence length. 81,449 pea ESTs (from flowers, leaves, cotyledons, epi-and hypocotyl, and etiolated and light treated etiolated seedlings) [7] and 57,751 S. sclerotiorum ESTs (from mycelia growing at neutral pH, developing apothecia and developing sclerotia-downloaded from BROAD database) were used to assist in the classification and annotation of contigs.
To verify the feasibility of the EST parsing method, 17,533 S. sclerotiorum ESTs derived from developing S. sclerotiorum libraries were downloaded from BROAD institute and 18,547 P. sativum ESTs were obtained from the GenBank EST database by search keyword 'Pisum sativum'. Vector contamination was removed from the downloaded ESTs by BLAST search with UniVec database (GenBank) in P. sativum and S. sclerotiorum ESTs were trimmed. After vector trimming, tBLASTx analysis of the downloaded ESTs was performed separately against the proxy-reference fungal and plant databases ( Table 8). The following relevant data from tBLASTx output were extracted to an Excel file: query sequence name, query sequence length, fungi database target name, fungi database e-value for top match, total query sequence length for all match to fungi database, plant database target name, plant database top match e-value, total query sequence length for all match to plant database.

PCR to confirm validity of classified contigs
Fifty contigs from S. sclerotiorum and 50 contigs from pea were randomly sampled to check the validity of EST contig classification. Primers were designed for each contig using the program Primer3 [57]. cDNA from pea inoculated with S. sclerotiorum, cDNA from non-inoculated pea, cDNA from S. sclerotiorum growing on PDA medium, and genomic DNA extracted from pea and S. sclerotiorum using DNeasy plant mini kit (Qiagen, Valencia, CA, USA ) were used as template in PCR with primer pairs for each contig. PCR contained 4 μl of 5 × GoTaq PCR Buffer (Promega, Madison, WI, USA), 200 μM each dNTP, 2.5 μM each primer, 0.4 U of GoTaq polymerase, and approximately 50 ng of DNA template in a final volume of 20 μl. PCR were held at 94°C for 2 min; followed by 40 cycles of 94°C for 30 s, 60°C for 30 s, and 72°C for 1 min; with a final extension at 72°C for 10 min. PCR products from each contig were separated on a 1% agarose gel and visualized with ethidium bromide.

Gene annotation and analysis
The biological function of EST contigs was predicted with gene ontology (GO) terms based on BLASTx analysis using the program BLAST2GO [58,59]. Default BLASTx parameters with an e-value threshold of 1e -3 and a high-scoring segment pairs (hsp) filter of 33 were retained so as to assign function to as many contigs as possible while ensuring short matching sequences less than 100 nucleotides were excluded. An annotation configuration with e-value-hit-filter 1.0E -6 , annotation cut off "55" and GO weight "10" was selected.
Prediction of secretory/signal peptides for the S. sclerotiorum ESTs The secretory/signal peptides for each S. sclerotiorum EST were analyzed using prediction algorithms. Firstly,