Genome analyses of the wheat yellow (stripe) rust pathogen Puccinia striiformis f. sp. tritici reveal polymorphic and haustorial expressed secreted proteins as candidate effectors

Background Wheat yellow (stripe) rust caused by Puccinia striiformis f. sp. tritici (PST) is one of the most devastating diseases of wheat worldwide. To design effective breeding strategies that maximize the potential for durable disease resistance it is important to understand the molecular basis of PST pathogenicity. In particular, the characterisation of the structure, function and evolutionary dynamics of secreted effector proteins that are detected by host immune receptors can help guide and prioritize breeding efforts. However, to date, our knowledge of the effector repertoire of cereal rust pathogens is limited. Results We re-sequenced genomes of four PST isolates from the US and UK to identify effector candidates and relate them to their distinct virulence profiles. First, we assessed SNP frequencies between all isolates, with heterokaryotic SNPs being over tenfold more frequent (5.29 ± 2.23 SNPs/kb) than homokaryotic SNPs (0.41 ± 0.28 SNPs/kb). Next, we implemented a bioinformatics pipeline to integrate genomics, transcriptomics, and effector-focused annotations to identify and classify effector candidates in PST. RNAseq analysis highlighted transcripts encoding secreted proteins that were significantly enriched in haustoria compared to infected tissue. The expression of 22 candidate effector genes was characterised using qRT-PCR, revealing distinct temporal expression patterns during infection in wheat. Lastly, we identified proteins that displayed non-synonymous substitutions specifically between the two UK isolates PST-87/7 and PST-08/21, which differ in virulence to two wheat varieties. By focusing on polymorphic variants enriched in haustoria, we identified five polymorphic effector candidates between PST-87/7 and PST-08/21 among 2,999 secreted proteins. These allelic variants are now a priority for functional validation as virulence/avirulence effectors in the corresponding wheat varieties. Conclusions Integration of genomics, transcriptomics, and effector-directed annotation of PST isolates has enabled us to move beyond the single isolate-directed catalogues of effector proteins and develop a framework for mining effector proteins in closely related isolates and relate these back to their defined virulence profiles. This should ultimately lead to more comprehensive understanding of the PST pathogenesis system, an important first step towards developing more effective surveillance and management strategies for one of the most devastating pathogens of wheat.


Background
Wheat yellow rust, also known as stripe rust, is one of the most devastating diseases of wheat worldwide. It is caused by the basidiomycete fungus Puccinia striiformis Westend. f. sp. tritici Eriks. (PST), an obligate pathogen that along with the stem (black) rust fungus Puccinia graminis f. sp. tritici (PGT) threatens worldwide wheat production [1,2]. Historically, the use of major race specific resistance (R) genes in wheat varieties has been an effective method for disease management. However, these approaches are hampered by the evolution of resistancebreaking races of PST. For example, the appearance of PST races that overcome widely deployed R genes (such as Yr2, Yr9, Yr17 and Yr27) has led to destructive pandemics [3]. In recent years, concerns over yellow rust have increased with the emergence of new and more aggressive PST races that have expanded virulence profiles and are capable of adapting to warmer temperatures compared to most previous races [2]. Combined with the intrinsic ability of PST for long distance spore dispersal [4], these new races pose an increasing threat to global wheat production and food security [5].
Biotrophic plant pathogens such as rust pathogens secrete an array of proteins, known as effectors, to modulate plant innate immunity and enable parasitic infection [6]. Some of these effectors translocate inside plant cells probably through specialized infection structures known as haustoria [7][8][9]. Inside plant cells, effectors perturb host processes promoting pathogenesis. However, disease resistance genes in plants, known as R genes, encode immunoreceptors that recognize specific pathogen effector proteins. Once effector proteins are recognized, plants initiate an immune response to block the development of disease, which typically results in a localized hypersensitive reaction and programmed cell death [10,11]. The identification and characterization of these effectors and their cognate R genes is an important first step to understanding the wheat-PST pathosystem and consequently, to our ability to develop sustainable and potentially more durable resistance breeding strategies.
Recent availability of rust pathogen genome sequences has enabled the first steps towards wide-scale cataloguing of putative effector proteins. For instance, Saunders et al. [12] and Duplessis et al. [13] both implemented high throughput computational methods to characterize the effector complements from the fully sequenced rust fungi PGT and Melampsora larici-populina. Recently, Cantu et al. [14] used next-generation sequencing (NGS) to assemble a draft genome of PST isolate 130 (PST-130), annotating 22,185 putative coding sequences and classifying 1,088 of these as predicted secreted proteins. In addition, resources such as cDNA libraries generated from urediniospores and isolated haustoria (to identify PST genes specifically expressed during pathogen infection [15][16][17][18]) are publicly available. Together, they provide the necessary tools to develop a framework for characterization of the putative effector repertoire of PST.
The rapid decrease in sequencing costs now makes it possible to re-sequence multiple PST isolates to further characterize its pathogenicity arsenal. For instance, comparative genome analyses of different isolates of Magnaporthe oryzae, the rice blast pathogen, expanded the knowledge gained from the original reference genome considerably and helped to identify new effector genes with avirulence activity [19]. Similarly, genomic analysis of an epidemic isolate of the potato blight pathogen Phytophthora infestans provided insights into increased aggressiveness and virulence [20].
In this study, we re-sequenced four PST isolates with different virulence profiles and from two distinct geographical regions (the USA and the UK). We identified hetero-and homokaryotic SNPs, providing a first glimpse into PST genetic diversity on a genome wide scale. We performed independent gene discovery and annotation across all isolates to produce a combined PST secretome and identified haustoria-enriched transcripts. We validated the expression of a subset of genes during an infection time course, and revealed distinct temporal expression patterns among them. This data was then integrated using a modified version of the in silico pipeline described in Saunders et al. [12] to classify the putative effector repertoire of PST. Using this information, we identified putative secreted, haustoria-enriched proteins with non-synonymous polymorphisms specifically between the two UK isolates, which only differ in virulence to two known wheat differential varieties. This approach identified five effector candidates among 2,999 predicted secreted proteins that are highly expressed in haustoria and are polymorphic between the UK isolates, PST-87/7 and PST-08/21. These allelic variants are now a priority for functional validation as virulence/avirulence effectors in the corresponding wheat varieties.

Selecting PST isolates with distinct virulence profiles
Four PST isolates from different races with distinct virulence profiles and varied geographic origin were selected for genome sequencing (Table 1). All isolates were initially identified on wheat plants, except PST-21. This isolate was originally isolated from infected triticale plants, but was subsequently shown to be virulent on wheat plants carrying the Yr1 resistance gene [21]. The virulence profiles of the two UK isolates (PST-08/21 and PST-87/7) were examined on a set of European differential lines, complemented by a common set of Avocet 'S' near isogenic lines, and compared to the previously reported profiles for the US isolates (Table 1) [1,22]. The two UK isolates differed in their virulence to wheat varieties Robigus (YrRob) and Solstice (YrSol), but had common virulence to ten Yr genes.
Genome sequencing, quality assessment, and gene prediction We used an Illumina whole-genome shotgun sequencing approach to sequence four PST isolates in addition to the recently sequenced isolate PST-130 [14]. After filtering (see Methods), the total number of PST contigs assembled ranged from 29,178 to 55,502 (Table 2). To assess the completeness of the genome sequences, we aligned the reads to the previously assembled PST-130 contigs [14]. A large proportion of the reads from the newly sequenced PST isolates aligned to the assembled PST-130 contigs (on average 86.2% SD: ±1.6%; Additional file 1), suggesting that most of the PST genome is present in the PST-130 assembly and confirming previous estimates of PST genome size [14]. In addition, a large fraction of publically available PST expressed sequenced tags (ESTs) mapped onto the assembled contigs, ranging from 81.8% in PST-87/7 to 83.2% in PST-43, further supporting a high degree of completeness of the genome assemblies. Taken together, these data suggest the US and UK isolates include between 82 and 88% of the rust genome.
As an independent estimate of the degree of completeness of the assembled gene space, we implemented CEGMA analysis [23]. This protocol maps a set of 248 low copy core eukaryotic genes (CEGs) that are conserved across higher eukaryotes to the assemblies [24]. On average 72.1% (SD: ±14.6%) of the CEGs aligned as complete gene copies to the assembled contigs (Additional file 2), compared to 81.7% (SD: ± 9.7%) that aligned as fragmented partial gene copies (Additional file 2). These values are slightly lower than those of the whole genome sequence of P. graminis f.sp. tritici where 91.1% of the CEGs were mapped as complete copies and 92.7% as fragmented genes (Additional file 2). The levels of complete gene coverage were higher for all US isolates (average 83%, SD: ± 0.6%) and comparable to partial gene coverage (average 88.7%, SD: ± 0%), indicating that few core eukaryotic genes were split across contigs for these isolates. For the two UK isolates (PST-08/21 and PST-87/7) complete gene coverage was reduced (average 56.5%, SD: ± 5.7%) compared to partial gene coverage (71.3%, SD: ± 3.9%), indicating slightly higher levels of fragmentation for these genomes, which is likely attributed to the lower level of genome coverage for these isolates.
To identify open reading frames in the five assembled PST genomes, we applied the MAKER pipeline which used ab initio and homology based predictions and was supported by cDNA evidence generated in the course of this study (see below and Methods; [25]). On average 20,280 (SD: ±1,201) protein-coding genes were identified in the five isolates (Table 3). Over 90% of the extended CEGMA set, which includes 2,748 CEG variants from six eukaryotic genomes, matched peptides predicted in each of the assemblies, with an average of 82.4% of the protein length aligned (SD: ± 8.9%; BLASTP, e-value ≤1e -6 ). This provided an independent estimate of the completeness of the gene catalogue.

Estimation of diversity between PST isolates
Urediniospores constitute asexual dikaryotic spores that contain two independent nuclei. Therefore, to assess genetic variation between the two nuclei in the sequenced PST urediniospores we aligned the sequence reads of a particular isolate to the assembled contigs of the same isolate. This global analysis identified an average single  nucleotide polymorphism (SNP) frequency of 5.98 ± 1.13 SNPs/kb between the two nuclei within a single isolate, referred hereafter as intra-isolate SNPs (Additional file 3).
In addition, we aligned the reads of each isolate to the assembled contigs of the other isolates (Additional files 3 and 4). We classified these inter-isolate SNPs into two classes; heterokaryotic SNPs which refers to a variant position between the two nuclei within a single isolate that is homozygous in other isolates, and homokaryotic SNPs, which refer to homozygous variants that are polymorphic between two independent isolates (Additional file 5). On average, heterokaryotic SNPs across isolates were more frequent (5.29 ± 2.23 SNPs/kb) than homokaryotic SNPs (0.41 ± 0.28 SNPs/kb). The highest levels of diversity were found when reads of isolates PST-21 and PST-130 were mapped onto the two UK isolates (PST-87/7 and PST-08/21) and US PST-43. Here, we observed an average of 7.11 ± 1.14 SNPs/kb for heterokaryotic sites, whereas homokaryotic SNPs had a frequency of 0.64 ± 0.08 SNPs/kb. When the two UK isolates and PST-43 were compared, the heterokaryotic SNP frequency was 2.23 ± 0.53 SNPs/kb, whereas the homokaryotic SNP frequency was 0.02 ± 0.01 SNPs/kb (Additional file 3). We performed a phylogenetic analysis using the homokaryotic SNP data in both coding and non-coding regions. In the associated dendrograms, US isolate PST-43 clustered with the two UK isolates, PST-87/7 and PST-08/21 ( Figure 1A). PST21 was more closely related to PST130 than to the other three isolates, and both where equally distant to the PST-43/UK cluster.

Observed genetic diversity may reflect origin of isolates
To further characterize the genetic diversity between the five sequenced isolates of PST, we assessed the number of homokaryotic SNPs within the gene space (Additional file 6). Using this SNP data, we generated a set of four representative synthetic genes for each reference gene that incorporated the SNP information independently for each isolate to enable downstream analysis of genetic diversity (Additional file 5). Pair-wise comparisons of non-synonymous mutations in these gene sets revealed that when genes representing US isolates PST-21 and PST-130 were compared against other isolates, polymorphisms were more apparent (Figure 1B green circles; Additional file 7). For instance, when using the PST-43 genes as a reference, PST-21 showed between 1,706 and 2,047 polymorphic genes in pair-wise comparisons with all other isolates. Similarly, PST-130 showed between 1, 428 and 2,047 polymorphic genes in pair-wise comparisons with all other isolates, when using this same reference. In contrast, in pairwise comparisons between PST-43, PST-87/7 and PST-08/21 less than 130 genes were shown to be polymorphic when using the PST-43 genes as the reference ( Figure 1B; Additional file 7).
Enrichment in non-synonymous mutations between genes encoding predicted secreted proteins and nonsecreted proteins was assessed using the hypergeometric test. This analysis revealed enrichment in polymorphisms in secreted proteins for nine pairwise comparisons with eight originating from comparisons of PST-21 or PST-130 against other isolates ( Figure 1B, orange highlighted squares). We also calculated rates of synonymous (dS) and non-synonymous (dN) substitutions for each pairwise comparison in each synthetic gene set (Additional file 8). This analysis highlighted more genes with dN/dS > 1 when PST-21 and PST-130 were compared with all other isolates, mirroring the pattern shown in the sequence polymorphism analysis ( Figure 1B, purple circles). For example, for protein PST21_04206 sequence polymorphisms and positive selection were identified between the UK isolates and US PST-43 when compared to the synthetic gene from US PST-130 (Additional file 9). There was no evidence for enrichment in positive selection in genes encoding secreted proteins when compared to those encoding non-secreted proteins.
Another measure of genetic diversity is to assess the number of absent genes in pair-wise comparisons between isolates. The total number of genes classified as absent with no reads aligned, was greater for alignments against the two US isolates PST-21 (161-482 genes) and PST-130 (154-310 genes) when compared to US PST-43, UK PST-87/7 and UK PST-08/21. Less than 100 genes were shown to be absent in PST-43, PST-87/7 and PST-08/21 when alignments between these isolates were considered ( Figure 1C; Additional file 10). Reciprocally, when PST-21 or PST-130 sequence reads were mapped against the genome assemblies of the other three isolates (PST-43, PST-87/7 and PST-08/21), a greater number of genes were noted as absent when compared to the alignment of the other three isolates ( Figure 1C; Additional file 10). A similar pattern was also observed when genes encoding predicted secreted proteins were assessed ( Figure 1C). This analysis confirmed our previous observation that US PST-130 and PST-21 appear more genetically diverse when compared to other isolates in this study, potentially reflecting different evolutionary origins for these isolates. The observed genetic diversity for US PST-21 may reflect adaptation to a different host, triticale, when compared to isolates in this study that were isolated specifically from wheat.

RNA-seq analysis identified transcripts specifically enriched in haustoria
We performed RNAseq analysis of UK PST-08/21 infected wheat leaves and isolated haustoria to identify haustoriaenriched transcripts (Additional file 11). After filtering, we aligned reads to the PST-08/21 assembly and also generated de novo assemblies that were used to support the gene prediction pipeline. We conducted differential expression analysis after normalization, using DESeq to identify genes that were significantly up-regulated (False discovery rate <0.01) in haustoria compared to infected tissue ( Figure 2A; Additional file 12). Within the subset of transcripts encoding for predicted secreted proteins (933 genes, see Methods), we identified 57 genes that were significantly enriched in haustoria (6.1%), compared A B C Figure 1 US PST-43 may belong to the same clonal lineage as the UK isolates PST-87/7 and PST-08/21. A. Dendrograms illustrating that US isolate PST-43 clustered with the two UK isolates PST-87/7 and PST-08/21. Dendrograms were constructed using the homokaryotic SNP information either in the coding or non-coding regions of the genome from all pair-wise alignments. B. Pair-wise comparison of non-synonymous mutations in synthetic gene sets illustrated that polymorphisms were more apparent when corresponding proteins representing PST-21 and PST-130 were compared against other isolates. Each gene for a given reference was taken in turn and any homokaryotic SNPs incorporated for each isolates mapped. The five genes (one reference gene and four synthetic genes) were then subjected to pair-wise polymorphism and positive selection analysis using Yn00. Circle sizes represent the number of proteins with at least one non-synonymous mutation (green circles) or under positive selection (purple circles). Pair-wise comparisons that showed enrichment in non-synonymous mutations in secreted proteins are illustrated with an orange background. C. The total number of genes determined as absent by mapping the sequence reads from each isolate in turn against every other isolate as a reference was greater for alignments against PST-21 and PST-130 when compared to alignments against PST-43, PST-87/7 and PST-08/21. A similar pattern was also observed when genes encoding predicted secreted proteins were assessed. Total number of genes absent from white to blue (0-482) for full gene complement and white to red (0-14) for secretome. to 31 (3.3%) that were significantly depleted ( Figure 2B). A much lower proportion of genes encoding for non-secreted proteins was identified as enriched (2.1%; 411 of 19,703 genes) or depleted (1.7%) using the same analysis ( Figure 2C). These results suggest that this approach is effective in selectively enriching for secreted proteins, which are likely to be haustorial-expressed transcripts.
Defining and classifying the effector repertoire of PST To identify and classify candidate effectors from PST, we implemented a modified version of the bioinformatics pipeline described in Saunders et al. [12] (Figure 3). The five PST proteomes were combined (totaling 100,357 proteins), 5,502 secreted proteins predicted, and redundancy in the combined secretome reduced (see Methods). After consolidation, a total of 2,999 predicted secreted proteins were selected representing the diversity of the five PST secretomes (Figure 3). To enable the identification of any potential conserved rust effectors, secreted proteins were also classified and incorporated in this study from the proteomes of Puccinia graminis f. sp. tritici and Melampsora larici-populina, comprising 1,333 and 1,173 secreted proteins, respectively. The three rust pathogen secretomes were combined and grouped into 1,797 protein tribes based on sequence similarity using Markov clustering [26]. The final filtering step of the bioinformatics pipeline reduced the total to 1,037 tribes, each containing at least one PST secreted protein. Proteins in the 1,037 PST-containing tribes were then annotated with both known effector features from other pathosystems and PST-specific criteria ( Figure 3). The later criteria focused on allelic variation between the five PST isolates and the expression of genes at the plant-pathogen interface as determined by the RNA-seq analysis of PST-08/21infected wheat leaves and isolated haustoria. Expression of genes either during infection or specifically in haustoria was then added as criterion in the effector-mining pipeline ( Figure 3).

Identifying candidate effectors of high interest
To order and cluster protein tribes based on known effector features and PST-specific annotation we implemented the sorting module of the pipeline (see Methods; Figure 3). This resulted in overall scores for each tribe that reflected their likelihood of containing potential effector proteins (Additional file 13). The features associated with the top 100 ranked tribes are displayed in Figure 4.
To select proteins with a high likelihood as candidate effectors we focused on tribes that ranked highly based on our scoring system, are highly expressed and enriched in haustoria, and display sequence polymorphisms between isolates. For example, Tribe 238 is ranked 10 th in our scoring system and is a member of Cluster VII, which reflects the fact that both members, PST21_18221 and PST21_18220, are expressed highly in haustoria and are significantly enriched in haustoria with respect to infected tissue (P = 0.001, 31 and 42-fold, respectively). This tribe represents two distinct proteins that are present in all isolates sequenced in this study. One protein was identified as polymorphic specifically between the US isolates, whereas the second protein was conserved across all isolates ( Figure 5A-B). The two proteins are encoded by genes within close proximity on a single contig in the sequenced genomes ( Figure 5C). This suggests that these A B C Figure 2 Comparison between infected tissue and isolated haustoria RNAseq libraries. Scatter plot of log 2 transformed sequencing counts generated by aligning RNAseq reads to all PST-08/21 genes (A), those that encode predicted secreted peptides (B), or encode non-secreted peptides (C). Red and green colored circles correspond respectively to transcripts that were identified as significantly enriched or depleted in isolated haustoria as determined by DESeq analysis (P ≤ 0.01). Red lines represent the locally weighted polynomial regression (LOWESS method).
genes could have arisen from a duplication event, which is further supported by sequence similarity between the two proteins within the N-terminus. This is consistent with the concept that pathogenicity factors may arise from gene duplication events followed by rapid diversification to evolve new functions [27].
Candidate effectors with sequence polymorphisms between the UK isolates PST-08/21 and PST-87/7 We identified secreted proteins that displayed polymorphisms specifically between the two UK isolates PST-87/7 and PST-08/21, which only differ in virulence to two wheat differential varieties Robigus (YrRob) and Solstice (YrSol). The properties of the 117 proteins identified are displayed in Figure 5D. Of these 117 proteins, 14 were members of tribes that ranked in the top 100 in our scoring system and five were encoded by genes which were significantly enriched and in the top 10% of expressed transcripts in both haustorial samples (Table 4). For all five genes a single amino acid substitution was evident between the sequences for the two UK isolates. For example, for PST130_05023, four amino acid substitutions were identified; one differential between the UK and US isolates, two between UK isolates and US PST-43 when compared to the other US isolates, and one specific substitution between the two UK isolates ( Figure 5E). Focusing on proteins that display polymorphisms between these two isolates may facilitate the detection of avirulence and corresponding virulence effector proteins that are specifically differentially recognized by YrRob and YrSol.
Transcript profiling reveals peaks in gene expression for candidate effectors during infection To further characterize a subset of candidate effectors, we assessed their expression profiles using quantitative RT-PCR across different infection time points (20 hours and 1, 6 and 14 days post inoculation (dpi)) ( Figure 6A; Additional file 14). Twenty-two putative effectors representing nineteen tribes were selected based on their rank as likely effectors and their expression levels determined by mRNAseq analysis of isolated haustoria ( Table 5). The gene models of these effector candidates were further validated during the qRT-PCR analysis using primers that spanned the splice site junctions. Three peaks in expression were observed for the candidate effectors using qRT-PCR, with eight more highly expressed at 1 dpi, four at 6 dpi and nine at 14 dpi ( Figure 6B). Haustoria are already formed at 1 dpi [28] suggesting that some of these candidate effectors might be involved in the very early stages of infection. Two of the five haustoria-enriched polymorphic genes between the UK isolates were amongst those most highly expressed at 1 dpi (PST21_15642 and PST130_00285), whereas one showed highest expression at 14 dpi (PST130_05023). Only one candidate effector (PST130_10194) was stably expressed at all time-points assessed during infection. For tribes 238, 413 and 6, the expression profiles of several members of each tribe were assessed. With the exception of one member of tribe 6 (PST21_11390), members of the same tribe tended to peak in expression at the same time-point, indicating that they may be involved at similar stages during disease progression. These results are consistent with the concept of sequential waves of expression of different sets of effectors during the infection process [29].

Polymorphic secreted proteins as effector candidates
In this study, we re-sequenced four PST isolates to identify effector candidates and relate them to their distinct virulence profiles. Once the complete effector complement was defined, we focused on highlighting polymorphic effector candidates that may reflect rapid  adaptation to specific host targets, such as diseaseresistance proteins amongst others. The list of effector candidates we developed here should prove useful to the rust research community to initiate functional screens for effector activity. One possible strategy would be the induction of hypersensitive cell death by potential avirulence proteins (AVR) on differential wheat lines using an effector delivery system as is routinely conducted in the potato-P. infestans pathosystem [30].
The co-evolutionary arms race between pathogens and plants has led pathogens to respond by mutating or losing AVR effector proteins [31]. A survey of six Melampsora lini isolates identified diversifying selection acting on the Avr567 locus, with substitutions identified in surface A D E B C Figure 5 Secreted proteins with high likelihood as candidate effectors. A. Sequence alignment of PST21_18220, a member of tribe 238, and the corresponding alleles from the other four isolates illustrating sequence polymorphisms specifically between the US isolates, PST-21, PST-43 and PST-130. B. Sequence alignment of the second member of tribe 238, PST21_18221, and its alleles from other isolates illustrating that this protein was highly conserved across isolates. C. The two members of tribe 238, PST21_18220 and PST21_18221, are in close proximity within a single contig in the genome sequence. The corresponding genes were expressed during infection and were also highly expressed and enriched in haustorial samples as determined by mRNA-seq analysis. D. Features displayed by the 117 proteins that were identified as polymorphic between the two UK isolates PST-08/21 and PST-87/7. E. Sequence alignment of PST130_05023 and the synthetic genes that incorporate the SNP information from the other four isolates sequenced, illustrating sequence polymorphisms between isolates. Polymorphic residues are indicated below the sequence by red stars. Genes expressed in the top 10 (a), 5 (b) and 1% (c) in a particular RNAseq dataset.
exposed residues that dictated host recognition [32,33]. Likewise, Yoshida et al. [19] identified three novel Avr genes that were absent from the M. oryzae genome sequence (isolate 70-15) but present in the re-sequenced avirulent isolate Ina168. In PST, we did not find evidence of large scale absence of genes or putative effectors on the scale identified in M. oryzae [19]. Even when comparing between the most distantly related isolates, we found less than 15 predicted secreted protein genes to be absent in any pairwise comparisons (<1.3% of the total predicted secretome, Figure 1C). Although this is a first and limited sampling of the available PST diversity, it suggests that gene loss may not be the main driver for changes in virulence. Therefore, assessing allelic variation in putative effectors between distinct PST isolates will likely be important to identify avirulence/ virulence alleles.

Uncovering the evolutionary origin of PST isolates
Comparative genomic studies are powerful tools for assessing the evolutionary origin of particular races. In this study, the limited genetic diversity found between US isolate PST-43 and the two UK isolates, PST-87/7 and PST-08/21, could be indicative of these isolates belonging to the same clonal lineage. This is supported by a diversity study based on 117 amplified fragment length polymorphism (AFLP) fragments which showed that PST isolates collected before 2000 in the US and Europe clustered together [34], thereby suggesting a common origin.  [5,34] and that PST-21 displays host specificity for triticale [21]. The data generated is not only valuable for addressing these evolutionary questions, but can also be used by the wider rust research community for development of more extensive polymorphic markers for large-scale screening of PST field populations and to complement the current set of dominant AFLP markers.

Utility of haustoria-enriched transcripts in effector mining
As biotrophic fungi and oomycetes secrete effectors from highly specialized structures, known as haustoria, we also undertook transcriptome analyses of infected tissues and isolated PST haustoria to identify potential Avr genes. For example, sequencing ESTs from M. lini isolated haustoria led to the identification of three uncharacterized Avr genes [7], suggesting that haustorial expressed transcripts represent a rich source for Avr effector identification. To identify haustoria enriched transcripts, we compared gene expression levels between infected material and isolated haustoria. Although RNAseq analysis of infected material was limited to a single biological replicate, we used the DESeq approach [35] to assess enrichment as this package is suited for working with partial replicates [36]. This enabled us to estimate levels of variation in the infected  material sample based on that observed in the replicated haustoria samples. Despite the low power of detection with this approach we could identify haustoria enriched genes with sufficient confidence. These transcriptome results helped to further classify and prioritize the PST secretome. It also establishes a robust baseline from which to extend this analysis to other isolates.

Improving existing PST gene models
An additional contribution of the expression studies was to generate more reliable PST gene models, which is critical to identify genes encoding for proteins with secretion signal peptide sequences. This was especially relevant for the downstream effector prediction pipeline since this constitutes the first criteria with which proteins are classified and filtered. We addressed this potential limitation by firstly complementing the homology based and ab initio prediction pipeline [25] with the de novo transcript assemblies generated from the RNAseq data and secondly, by integrating independent gene predictions from all isolates. Whilst assessing the independent gene sets obtained for the different isolates we noted that for many fragmented genes, a non-fragmented copy could be identified in at least one genome. Therefore, it is anticipated that the gene catalog presented here will be a better representation of the PST gene complement, than the previous one which was based on genomic sequences alone [14]. The overall accuracy of the gene predictions herein were also supported by the identification of higher frequencies in nucleotide variants in the intronic regions than in the exons for both heterokaryotic and homokaryotic SNPs (Additional file 3).

Applying comparative genomics to effector mining
The recent release of draft genome sequences for several rust pathogens has provided the first step towards widescale cataloguing of putative effector proteins [13,14]. So far, these studies have been limited to secretome prediction and annotation from a single reference genome. Here, we employed a comparative genomics approach to move beyond the single isolate-directed catalogue and utilise the distinct virulence profiles of the isolates sequenced to identify putative virulence/avirulence effectors. Our aim was to integrate genomics, transcriptomics and effectorfocused annotation to generate a rich source of information that could be utilized to identify effector candidates in PST. However, this raises the challenge of developing methods that can utilize the vast abundance of data to address clearly defined biological questions. The clustering and classification methods used here allowed us to organize the complexity of these large datasets. Providing the data in an easily comprehensible format ( Figure 4) will enhance accessibility and help the wider rust research community to rapidly access effector candidates. Moreover, this method provides a logical framework to prioritize candidate effector genes for functional validation, an approach that will become increasingly powerful as additional races are re-sequenced, more mRNAseq data becomes available and the avirulence activity of candidate effectors is established. The use of association analysis to identify candidate avirulence proteins has been successfully implemented in other pathosystems [19,37]. As a first step in this direction, we examined two UK isolates (PST-87/7 and 08/21) that share common virulence for ten wheat Yr genes (Yr1, Yr2, Yr3, Yr4, Yr6, Yr7, Yr9, Yr17, Yr27, Yr32), but differ in their ability to infect two UK varieties, Robigus (YrRob) and Solstice (YrSol). We found no evidence of gene loss between these isolates. However, focusing on polymorphic variants that were also highly expressed and enriched in haustoria (within the top 10% expressed genes) reduced the dataset from 2,999 secreted proteins to just five polymorphic effector candidates. These genes and their allelic variants are now a priority for functional validation as virulence/avirulence effectors in the wheat varieties Robigus and Solstice.

The next challengefunctional validation of effector candidates
The next step will be to establish functional validation methods for rust effectors in wheat. This crucial phase is still in its infancy [38,39] and will most likely be limited to testing a handful of candidate genes in the initial stages. Several groups have attempted to modify existing heterologous expression systems from other pathosystems to establish a reliable method for testing rust AVR effector candidates in wheat. This includes delivery of effector candidates directly into wheat cells by expressing them in bacterial pathogens for delivery by the type III secretion system [39] or through virus-mediated approaches [40]. Alternatively, host-induced gene silencing could be utilized to transiently silence and test PST pathogenicity or virulence genes. Recently, silencing of three endogenous genes in Puccinia triticina, the wheat leaf rust pathogen, was reported using transient Agrobacteriummediated expression of corresponding RNAi constructs in wheat [38]. If established as a large-scale functional assay system, transient expression could be utilized to enable high-throughput loss-of-function screening of a diverse array of PST effector candidates in wheat.

Conclusions
This study provides valuable information including (i) an estimate of the distribution of genetic diversity within and among PST isolates, (ii) characterization of the expression of PST genes in infected tissue and haustoria using RNAseq analysis, (iii) a bioinformatics pipeline to organize and prioritize candidate effector genes for functional studies and (iv) a list of candidate avirulence genes which display polymorphisms specifically between two UK isolates. As additional races and divergent isolates are re-sequenced, polymorphic effectors will be more easily distinguished from underlying genetic diversity between isolates, streamlining the cataloging of potential avirulence/virulence proteins for testing. The functional validation of these will ultimately lead to a more comprehensive understanding of the PST pathogenesis system, an important step towards developing more effective surveillance and management strategies for one of the most devastating pathogens of wheat.

Genome sequencing, assembly and gene prediction
Genomic DNA was extracted for each isolate from dried urediniospores using the CTAB method as described by Chen et al. [41]. DNA libraries were prepared as described previously for PST-130 [14]. Library quality was confirmed before sequencing using the Agilent 2100 Bioanalyzer (Agilent Technologies, UK). Sequencing was carried out on an Illumina HiSeq machine at the DNA Technologies Service core at UC Davis. Adapter and barcode trimming and quality filtering were carried out using the FASTX-Toolkit [42]. Terminal nucleotides at the 3′ end with sequencing quality below Q20 were removed and reads that after trimming were shorter than 40 nucleotides were discarded. FASTQ files of high-quality trimmed sequences were used for downstream analysis. The pair-end trimmed and filtered reads were de novo assembled using the CLC Genomic Workbench 4.0 software [43]. The following parameters were applied: mismatch, insert, and deletion cost = 3; length fraction = 0.3; similarity = 1.0 no global alignment; conflict resolution = vote; ignore nonspecific matches; min contig length = 300 bp; paired-end distance = 100-600 bp. Contigs with homology with non-fungal sequences in the complete NCBI nt collection were considered contaminant and discarded. Assemblies were deposited at GenBank and the SRA and WGS accession numbers are listed in Additional file 1 together with the general sequencing and assembly metrics. To assess genome completeness reads from each isolate were filtered for contaminants and then mapped to the assembled PST-130 contigs. PST ESTs used for assessment of genome completeness were obtained from GenBank. Gene prediction was undertaken following the MAKER pipeline [25] using PST ESTs, de novo transcript assemblies generated in this work (see below), and PGT and P. triticina (PT) peptide sequences as templates for gene model discovery. The homology based gene prediction implemented in MAKER was integrated with the ab initio prediction program SNAP [44] using as training dataset the above mentioned PST, PGT, and PT datasets.

Assessing genetic diversity
Illumina pair-end genomic sequence reads from each isolate were mapped onto each other isolate used as a reference using Novoalign (version 2.07.18 [45]; parameters used: -R99, -i PE 500,400 -Q30 -o SAM). Custom perl scripts were used to extract the mapping counts from the SAM files and determine the sequencing coverage of each gene. Mapping metrics are reported in Additional file 4. Genes were classified as absent when no reads mapped to the coding region of a particular gene. SAM files were converted into BAM format, sorted, indexed, and analyzed using Picard tools (version 1. 55 [46]). Single nucleotide polymorphisms (SNPs) were determined using the Genome Analyzer Toolkit pipeline (GATK, version 1.65; [47]). The GATK RealignerTargetCreator and IndelRealigner programs were applied to realign the reads mapped on indel sites. The GATK UnifiedGenotyper with parameters --outputmode EMIT_VARIANT_ONLY and -glm SNP was then used to identify SNPs. The alignment of sequencing reads to the correct genomic location is critical for the accurate identification of genetic variants. Thus, the variant calling was restricted to those sites that did not display either too low (minimum 5x coverage) or too high coverage. Genomic regions that display a higher than expected number of aligned reads are likely to be stretches of similar or repetitive sequences that have been assembled together [14]. To reduce the number of false positive identified (i.e. SNPs between non-orthologous region), we imposed a maximum coverage threshold equivalent to 1.5 times the median coverage over the entire assembly. To determine whether the calculated median coverage is a valid proxy of the coverage associated with single copy genes, sequencing reads were remapped on a set of 10 single copy genes previously identified in the PST-130 assemblies [14]. The sequencing coverage on these single copy genes was very similar to the median coverage over the entire assembly [e.g. 63.9x coverage (SD: ±6.7) and 59.9x (SD: ±5.7) in PST-21 and PST-130, respectively]. Assuming that the coverage of a repetitive region increases proportionally with the copy number, the maximum coverage threshold we selected is expected to reduce both Type I and Type II errors of calling a single copy locus to less than 0.001 for all races. Heterokaryotic sites were identified as sites with allelic frequency = 0.5. If a site in the reference had allelic frequency of 0.5 or the mapped reads from another isolate had allelic frequency = 0.5 the site was considered heterokaryotic (Additional File 5). Homokaryotic variants were identified as sites that are homokaryon both in the reference (allelic frequency = 1.0) and in the mapped reads (frequency of the alternative allele = 1.0). The overall ratio of transition over transversion mutations across all five isolates was 2.30 ± 0.17. These values are consistent with human studies [48] and as expected, are higher than the 0.5 ratio that would be obtained if all substitutions were equally probable.
To assess genetic diversity between isolates synthetic gene sets were generated. Each gene for a given reference was taken in turn and any homozygous SNPs incorporated for each isolate mapped (Additional File 5). The five genes (one reference gene and four synthetic copies) were then subjected to pair-wise polymorphism and positive selection analysis using the bioinformatics program Yn00 [49]. Any pair-wise comparisons that yielded a dN value > 0 were classified as polymorphic and those with dN/dS values > 1 were classified as under positive selection. The polymorphism and positive selection analysis was automated using custom Perl scripts.

Preparation of PST infected plant material for mRNAseq analysis
Two sample types were selected for mRNAseq analysis, infected wheat leaf material and purified haustoria. Wheat seedlings (cv Avocet 'S') were infected with PST-08/21 and incubated in the dark at 10ºC with high relative humidity for 24 hours. Plants were then transferred to a 16 h/8 h day/night cycle at 18ºC. For infected leaf samples material was collected at 6 and 14 days postinoculation (dpi) and pooled prior to RNA extraction. Alternatively, haustoria were isolated from infected leaf material 7 dpi. Two independent haustorial isolations were undertaken, each from~2,500 infected wheat seedlings. A total of 20 g of infected leaf segments were collected that displayed localized infection, as determined by small white flecks without rust sporulation. The segments were washed with 70% ethanol for 1 min, followed by water for 1 min, then haustoria isolation performed as described in Song et al. [50] using either three (H1) or two (H2) rounds of sucrose gradient purification.
Total RNA was extracted from both the infected wheat leaf material and purified haustoria using TRIzol reagent (Invitrogen, UK) and treated with DNA™-free DNase (Manufacturer) and the Removal Reagents kit (Ambion, UK) following the manufacturer's instructions. The quantity and quality of RNA extracted was assessed using the Agilent 2100 Bioanalyzer (Agilent Technologies, UK).
Transcriptome sequencing, de novo assembly and alignment cDNA libraries were prepared using the Illumina TruSeq RNA Sample preparation Kit (Illumina, US) for RNA extracted from infected material and the haustoria H2 sample, whereas RNA extracted from the haustoria H1 sample was processed using the Clontech SMARTer Ultra Low Input RNA Kit (Clontech, Takara Bio Europe, France). Library quality was confirmed before sequencing using the Agilent 2100 Bioanalyzer (Agilent Technologies, UK). Sequencing was carried out on an Illumina Genome Analyzer II at The Sainsbury Laboratory. The 76 bp pair-end reads were filtered for quality as described above and aligned to genomic PST assemblies using Bowtie (version 0.12.7 [51]) in global alignment mode allowing a maximum of two mismatches (parameters: -v 2 --best -k 1). SAM output files were parsed with custom Perl scripts to determine the number of reads mapping to a single PST transcript. DEseq [35] was used to normalize raw transcript counts and to compare libraries from PST infected tissue and enriched haustoria preparations in order to determine potential transcript enrichment in isolated haustoria compared to whole infected material (version 1.10.1). The reads were also assembled de novo using the Trinity package [52] with default settings to serve as additional evidence for gene prediction. The Illumina reads were deposited in the National Center for Biotechnology Information's Gene Expression Omnibus (GEO) and are accessible through GEO (GSE42496; [53]).

Secretome prediction and Markov clustering
The predicted proteomes of all five PST isolates were combined and secreted proteins predicted using SignalP2 with parameters described in [54]. Transmembrane domain containing proteins and proteins with mitochondrial signal peptides were removed using TMHMM [55] and TargetP [56], respectively. To reduce redundancy secreted proteins were clustered that displayed 99% sequence identity over 50% of the sequence length, using CD-HIT [57]. A single representative sequence was selected from each protein cluster and used for downstream analysis. Predicted proteomes of M. larici-populina and P. graminis f. sp. tritici were obtained from Duplessis et al. [13]. The secretomes were predicted as above for PST using PexFinder, followed by TMHMM and TargetP analysis. Proteins from the three predicted secretomes were then clustered based on sequence similarity using TribeMCL [26] following methods described in [58].

Annotation of secreted protein tribes with effector features
Automated BlastP-based annotation was performed on proteins included in the secretome tribes using Blast2GO [59] with default parameters. In addition, BlastP analysis of proteins in the secretome tribes was conducted using the haustorial EST database constructed previously [12], with an e-value cutoff of 10 -5 . We searched each protein for the effector motifs [L/I]xAR [19,60], [R/K]CxxCx12H [19], RxLR [61], [Y/F/W]xC [62,63], YxSL[R/ K] [64] and G[I/F/Y][A/L/S/T]R [7] between amino acids 10 to 110 using Perl scripts. Nuclear localisation signals were predicted with PredictNLS [65]. Protein internal repeats were predicted using T-Reks [66]. We also assessed the length of the flanking intergenic regions (FIRs) between genes and the cysteine content of proteins using custom Perl scripts. Proteins cataloged as small and cysteine rich were less than 150 amino acids long and had a cysteine content higher than 3% as defined in Saunders et al. [12].
Scoring secreted protein tribes for the likelihood of containing effector proteins First, the method described in Saunders et al. [12] was used to assign an e-value to each feature within a tribe. The e-value was based on the number of proteins within a tribe that displayed a particular feature, relative to the likelihood of a tribe of the given size containing the same number of proteins with that particular feature by chance (Additional file 13). Next, the individual e-values were log converted into a score and a combined score calculated giving more weight to features that may be more indicative of effector proteins which included high expression in haustoria (a, in top 100 expressed genes; b, in top 500 expressed) or infected material (c, in top 100 expressed genes, d, in top 500 expressed), similarity to previously characterized HESPs or AVR effector proteins (e). A lower weight was given to other effector-related criteria that included the absence of any annotation (f) and the score associated with small cysteine-rich proteins (g).
Combined score ¼ roundðlog a þ b þ c þ d þ e þ 1; 2 ð Þ Ã 1 þ 0:1 Ã f þ g ð Þ ð Þ ; 3Þ The combined score was then used to rank the tribes based on their likelihood of containing potential effector proteins. The features associated with the top 100 ranked secreted protein tribes were visualized using Circos [67].

qRT-PCR of effector candidates during infection
Wheat seedlings (cv Avocet 'S') were infected with PST-08/21 and incubated as described above. Six biological replicates were sampled at each time point (20 hours, 1 dpi, 6 dpi or 14 dpi). For each sample, 2.5 μg of total RNA was extracted and used for cDNA synthesis with the SuperScript First-Strand Synthesis System for RT-PCR (Invitrogen, UK). qRT-PCR was undertaken for 22 candidate effectors described in Additional file 15. Transcript levels were determined on a LightCycler® 480 instrument (Roche Applied Science, UK) using LightCycler 480 SYBR Green I Master (Roche) and the following conditions: 5 minutes at 95ºC; 40 cycles of 15 sec at 95ºC, 15 sec at 60ºC, 20 sec at 72ºC. The PCR amplification specificity was checked by dissociation curve analysis (from 60ºC to 95ºC). Transcript levels were normalized with P. striiformis elongation factor 1 [15,68] and linearised values determined using the 2 -ΔΔCT method [69].