A BAC based physical map and genome survey of the rice false smut fungus Villosiclava virens

Background Rice false smut caused by Villosiclava virens is a devastating fungal disease that spreads in major rice-growing regions throughout the world. However, the genomic information for this fungal pathogen is limited and the pathogenic mechanism of this disease is still not clear. To facilitate genetic, molecular and genomic studies of this fungal pathogen, we constructed the first BAC-based physical map and performed the first genome survey for this species. Results High molecular weight genomic DNA was isolated from young mycelia of the Villosiclava virens strain UV-8b and a high-quality, large-insert and deep-coverage Bacterial Artificial Chromosome (BAC) library was constructed with the restriction enzyme HindIII. The BAC library consisted of 5,760 clones, which covers 22.7-fold of the UV-8b genome, with an average insert size of 140 kb and an empty clone rate of lower than 1%. BAC fingerprinting generated successful fingerprints for 2,290 BAC clones. Using the fingerprints, a whole genome-wide BAC physical map was constructed that contained 194 contigs (2,035 clones) spanning 51.2 Mb in physical length. Bidirectional-end sequencing of 4,512 BAC clones generated 6,560 high quality BAC end sequences (BESs), with a total length of 3,030,658 bp, representing 8.54% of the genome sequence. Analysis of the BESs revealed general genome information, including 51.52% GC content, 22.51% repetitive sequences, 376.12/Mb simple sequence repeat (SSR) density and approximately 36.01% coding regions. Sequence comparisons to other available fungal genome sequences through BESs showed high similarities to Metarhizium anisopliae, Trichoderma reesei, Nectria haematococca and Cordyceps militaris, which were generally in agreement with the 18S rRNA gene analysis results. Conclusion This study provides the first BAC-based physical map and genome information for the important rice fungal pathogen Villosiclava virens. The BAC clones, physical map and genome information will serve as fundamental resources to accelerate the genetic, molecular and genomic studies of this pathogen, including positional cloning, comparative genomic analysis and whole genome sequencing. The BAC library and physical map have been opened to researchers as public genomic resources (http://gresource.hzau.edu.cn/resource/resource.html).


Background
Rice false smut caused by Villosiclava virens (Cooke Tak) (anamorph Ustilaginoidea virens) [1,2] has emerged as a devastating disease in rice, due to the intense application of nitrogen and phosphorus fertilizers and the cultivation of high-yielding semi-dwarf rice cultivars worldwide [3]. Previously, rice false smut was considered as a minor rice disease because of its rare occurrence in limited regions, but this disease has spread widely in the last 20 years and has become a severely devastating disease in many major rice-growing regions, including Asia, Africa, the United States, South America and Italy [3,4]. Rice false smut dramatically damaged rice production in 1988 and has continued to occur frequently [5]. The ustiloxin produced by this pathogen in infected plant tissues is a kind of cyclopeptide mycotoxins, which inhibits the growth of microtubules and is toxic to humans and livestock [6].
To date, the knowledge of V. virens is still very limited. Ashizawa et al. reported a sensitive method to quantify V. virens pathogens in soil samples using real-time PCR [4]. Ladhalakshmi et al. studied the intensity of rice false smut in India and found that the percentage of false smut-infected tillers ranged from 5% to 85% in the southern states, and 2% to 75% in northern states [7]. Atia et al. first investigated rice false smut in Egypt and reported that the production loss caused by this pathogen ranged from 1.0% to 10.9% [8]. Tanaka et al. established a simple transformation system of this pathogen using electroporation of intact conidial cells [2]. Fu et al. described the morphologic characteristics more precisely [3].
At present, the effect of rice false smut control is far from ideal. For searching the effective and environmentfriendly methods, more morphological, molecular, genomic and genetic data of V. virens are required to reveal the infection process, the interaction mechanism between host and pathogen, the genetic variety and diversity, and the genome composition of this specie.
BAC libraries, physical maps and BESs serve as important tools in genetic, molecular and genomic studies. BAC libraries are used as templates in targeted or whole genome sequencing, physical map construction and functional complementation of genes in positional cloning. Physical maps provide frames for genome sequencing and physical positions of genes and markers. BESs are accurate and inexpensive genome samples [9], from which initial insights into the genome composition and candidates of molecular markers can be obtained [10,11]. The combined resources of BAC library, physical map and BESs of a genome play even more powerful roles synergistically in the above mentioned and extended research fields. BAC clones will largely increase the value and utility in targeted genome sequencing and positional cloning when mapped on a physical map. BESs embedded in physical map can be used as anchors in genome comparisons to detect sequence assembly errors of the same source genome and large structural changes of phylogenetically close genomes [12,13].
BAC libraries and physical maps have been constructed for several agriculturally important fungal organisms, such as Magnaporthe oryzae [14,15], Blumeria graminis [16], Fusarium graminearum [17], Cryptococcus neoformans [18], Trichoderma reesei [19] and Ustilago maydis [20]. We recently constructed a BAC library for a V. virens strain, UV-2 [21]. The BAC library contains 10,368 clones and has an average insert size of 124.4 kb. However, no physical map was constructed and no BESs were produced with this BAC library. Here we report the construction of a BAC-based physical map and genome survey of the V. virens strain UV-8b. To our knowledge, this is the first physical map and genome sequence information developed for V. virens.

Phylogenetic analysis of strain UV-8b
The V. virens strain UV-8b was a single spore isolated from Japonica rice Zhonghua 11. To analyze the phylogenetic relationship between this strain and other fungal pathogens, we sequenced its 18S rRNA gene and compared it with other fungal 18S rRNA gene sequences. The 18S rRNA gene sequence of UV-8b showed a 99% identity to those of other V. virens strains and the phylogenetic tree constructed with the NJ algorithm clustered UV-8b into V. virens clade (Additional file 1: Figure S1). The UV-8b strain is also related to the members of Metarhizium, Trichoderma and Cordyceps (about 98% identities among 18S rRNA gene sequences).

BAC library construction, fingerprinting and contig assembly
To obtain basic genome resources for V. virens, we constructed a BAC library and a BAC-based physical map of the V. virens strain UV-8b. The BAC library consists of 5,760 clones arrayed in 15 384-well plates. Analysis of 180 random BAC clones showed that the library had an average insert size of 140 kb with a size range from 25 to 190 kb and an empty-vector rate of lower than 1% (Table 1; Additional file 2: Figure S2). The library was calculated to cover 22.7-fold of the UV-8b genome (based on a genome size of 35.5 Mb, Dr. Shaojie Li, personal communication). The genome size of V. virens was estimated as 35.5 Mb. 2 The clones that contained 50-200 bands were imported into FPC for contig assembly. 3 The total bands of all successfully fingerprinted clones were 283391.
We fingerprinted 2,688 BAC clones using five restriction enzymes (BamHI, EcoRI, XbaI, XhoI, HaeIII). After quality filtering as described in the methods, fingerprint profiles of 2,290 BAC clones were qualified for FPC assembly. The 2,290 BAC clones covered 9-fold genome equivalents and contained an average of 124 bands (consensus bands; CBs) per clone. Based on the average insert size of 140 kb, one CB was estimated to be 1.13 kb ( Table 1).
The fingerprint data of the 2,290 clones were imported into FPC V9.4 for contig assembly. A series of tests were performed to find optimal parameters for the assembly. Table 2 displayed the results of assembly with tolerance 4 and different cutoff values. Based on these tests, we chose 10 -15 as the initial cutoff value for contig assembly. This condition setting assembled 2,035 clones into 196 contigs containing 111 (4.85%) Q clones, and left 255 (11.14%) clones as singletons. The contig101 (4 clones) was end-merged to contig76 (18 clones), and contig193 (29 clones) was end-merged to contig1 (3 clones), by the "End to End" function at terminal cutoff 10 -12 and match value 2. This result was referred as PhaseIA and used as standard version. The PhaseIA contigs covered 51.2 Mb in physical length. The discrepancy between the genome length and the physical length of all contigs might be generated by the potential redundancy of contigs, which could be detected and merged with more evidences. Using "End to End" function at terminal cutoff 10 -08 and match value 1, we merged another 74 contigs. This result was referred as PhaseIB. The BAC library and two versions of physical map have been opened to researchers as public genomic resources (http://gresource.hzau.edu. cn/resource/resource.html).

BAC end sequencing
To perform a genome survey and provide anchor sequences on the physical map for genome comparisons, we sequenced 4,512 BAC clones that included those clones used in fingerprinting at both ends. A total of 6,560 high quality BESs were generated after quality trimming, of which 5,676 were paired-end (86.52%) sequences and 884 were single-end sequences (13.48%) ( Table 1). The maximal and the average length of the BESs were 798 bp and 462 bp (Figure 2), respectively. The total length of the BESs was 3,030,658 bp representing 8.54% of the whole genome. The GC content was 51.52%. The 6,560 high quality BESs are available in GenBank [GenBank:JY267549 to GenBank:JY274108].

Analysis of repetitive DNA in BESs
Repeat sequences are usually a major component of eukaryotic genomes. To gain an initial insight into the composition of repeat elements contained in UV-8b BESs, RepeatMasker was used to identify the known repeat elements from existing databases. The result indicated that a total length of 138,502 bp (4.57%) of the known repeat sequences was identified and contained in 1,273 (19.41%) reads, among which only one read was completely recognized as repeat sequence. In the terms of the repeat category, retroelements were dominant and Each fingerprint band was estimated to be 1.13 kb based on the average insert size 140 kb and an average 124 bands per clone. 2 The percentage of Q clones to total clones 2290. 3 The percentage of single clones to total clones 2290. 4 The cutoff value of 10 -15 was chosen, then contig101was end-merged to contig76, and contig193 was end-merged to contig1 by "End to End" function at cutoff 10 -12 , respectively.
represented 3.07% of the total BES length, of which the LTR elements Ty1/Copia and Gypsy/DIRS1 accounted for 2.05% and 1.01%, respectively, while the LINE elements accounted for only 0.01% of the total BES length. Small RNA and simple repeats accounted for 0.04% and 0.95% of the total BES length, respectively (Table 3). It is interesting that few DNA transposons were identified in BESs in contrast to retroelements. RepeatScout was used to de novo scan the repeat sequences contained in UV-8b BESs with the criterion described in the methods. A cumulative 682,351 bp (22.51%) were marked as repeat sequences with this pipeline, and were contained in 2,642 (40.27%) reads.
Among these reads, 163 (2.48%) were marked as complete repeat sequences. The 1,384 reads in this result were not contained in the RepeatMasker result and 15 reads in the RepeatMasker result were not contained in this result. After repeat-masked, the BESs were self-BLASTed as described in the methods and no reads showed more than three matches to others, proving the high sensitivity of the RepeatScout pipeline.

Comparative analysis of UV-8b with other fungal pathogens through BESs
For functional genomics comparison and evolutionary studies of the UV-8b genome, the following 10 wellcharacterized fungal pathogen genomes were chosen: Magnaporthe oryzae, Botrytis cinerea, Puccinia spp, Fusarium graminearum, Fusarium oxysporum, Blumeria graminis, Mycosphaerella graminicola, Colletotrichum spp, Ustilago maydis and Melampsora lini. They were voted as the most scientifically/economically important fungal pathogens by plant mycologists [22]. Four other fungi, Metarhizium anisopliae, Trichoderma reesei, Nectria haematococca and Cordyceps militaris, were also chosen as related species for this study, because they were close to V. virens in evolution distance and their whole genome sequences were available.
To identify the microsynteny regions of UV-8b to the above genomes, the repeat-masked UV-8b BESs were used in BLAST analysis with the above genome sequences. As shown in Table 4, 0.18-8.34% of masked BESs matched to the top 10 plant fungal pathogen genomes. In ascomycetes pathogens, F. oxysporum (8.34%) and  F. graminearum (8.19%) showed the most hits, followed by C. graminicola (8.13%), and B. graminis (1.07%) showed the least hits. In basidiomycetes pathogens, U. maydis (0.95%) has the smallest genome size but the highest number of hits; P. graminis (0.18%) and Melampsora laricis (0.24%) showed less hits. Among masked BESs, 18.81%, 11.88%, 10.50% and 10.35% matched to M. anisopliae, T. reesei, N. haematococca and C. militar, respectively, and 326 masked BESs matched to all of those species ( Figure 3). The similarity results were generally in agreement with the 18S rRNA gene analysis.
Among the BLAST hits, if paired-ends hit to target genomes with the criteria described in [9], the regions were considered to be collinear between UV-8b and the target genomes. The results (Table 4) showed that F. oxysporum and F. graminearum have more collinear regions than the others in the top 10 pathogen genomes. In the four related genomes, an insect fungal pathogen M. anisopliae had the most collinear regions. The higher degree of synteny between UV-8b and M. anisopliae was consistent with the results of the species distribution in the gene annotation step. Since most of the target genomes were not assembled completely (Table 4), the numbers of paired-end BESs potentially collinear with target genomes could be higher than detected.
In order to detect large syntenic regions, we used the SyMAP [23] program based on the BESs embedded in the contigs to anchor UV-8b PhaseIA contigs to the genomes of M. anisopliae, T. reesei, N. haematococca and C. militaris. Under the SyMAP default criteria, M. anisopliae had most anchored contigs, followed by T. reesei (Table 5), consistent with the comparative analysis results mentioned above (Table 4). Figure 4 shows an example of the graphical representation of the collinear regions.

Analysis of simple sequence repeats (SSRs)
SSRs are potential genetic markers due to their high rate of polymorphisms. To investigate the SSR contents and their distribution in UV-8b BESs, we scanned the BES dataset with SciRoko3.4 [24]. First, the CAP3 [25] program was used to reduce the redundancy of BESs; it clustered 1,821 BESs into 803 contigs and left 4,739 reads as singletons. The total length of these reads was 2,719,880 bp. Among these random genome sequences, a total of 1,023 SSR loci were identified from 849 reads with the criterion described in the methods. The SSRs had an average length of 25.17 bp, an average standard deviation of 10.55 bp and a density of 376.12/Mb.
To compare the SSR contents and distribution patterns, the GSS sequences of B. graminis, Fusarium virguliforme, M. oryzae and T. reesei were downloaded from NCBI and scanned for SSRs using the same parameters of CAP3 and SciRoKo programs. The SSR contents and distribution patterns varied obviously ( Figure 5). The SSR densities of the above species were 96.27/Mb, 107.71/Mb, 173.60/Mb and 234.06/Mb, respectively, in contrast to 376.12/Mb in UV-8b. The result indicated that UV-8b and T. reesei, which were closest in phylogenetic distance among the four species, had the highest SSR densities. The frequencies of the SSR types were also different among the above species. Mononucleotide types were dominant in UV-8b and M. oryzae, whereas the trinucleotide was most common in T. reesei. It is interesting that dinucleotides had the lowest frequency in all of the species. As for the frequencies of individual SSR motifs, SSR motif A was the most common motif in UV-8b, M. oryzae and B. graminis, AG was most common in T. reesei, and AGC was most common in F. virguliforme.

Gene annotation
Before gene annotation, the repeat-masked BESs were pre-processed by the CAP3 [25] program to reduce  Percentage relative to the total number of masked BESs (6,560).

2
The two end sequences of the same clone hit to the "+" and "-" strand, respectively.

3
Except that the Puccinia graminis, Ustilago maydis and Melampsora laricis were basidiomycetes, the other 7 of the top 10 pathogens all are ascomycetes. 4 The accumulative length of scaffolds or contigs of more than 300 kb and percentage relative to the total length of assembled sequences.
sequence redundancy. A total of 640 contigs were formed by the CAP3 program and 5,215 reads were left as singletons. The cumulative length of the processed sequences was 2,797,772 bp. An additional 876 (398,742 bp) reads, whose effective lengths were shorter than 100 bp, were removed to improve the result accuracy. The final 4,979 (contigs + singletons) reads, whose total length was 2,399,048 bp, were compared with the EST and NR databases of NCBI to identify coding regions.  Figure 6 shows the target species distribution in the NR database. M. anisopliae and M. acridum had the most BLASTX hits. A total of 928 unique GO terms were assigned to 1,324 reads, and each read was associated with 3.37 GO numbers on average. The genes showed a wide range of functional categories (Figure 7; Additional file 5: Table S2). The binding and catalytic activities were most abundant in the molecular function category, whereas the cellular and metabolic processes were most common in the biological process category. A total of 971 reads matched to the InterProScan database provided a reliable dataset to understand gene function. On the other hand, 171 unique EC (Enzyme Code) annotations were assigned to 387 reads, and 74 pathways in which these enzymes participated were identified by the KEGG map module of BLAST2GO [27], such as the tricarboxylic acid cycle (TCA cycle). Six enzymes (from 7 reads) of the 171 unique EC were involved in the TCA cycle.

Discussion
Rice is the staple food of more than 50% of people worldwide, and the problem of food deficiency is more and more severe with the expanding human population [28,29]. Rice false smut caused by V. virens has emerged as a devastating disease in rice, and the ustiloxin produced by the pathogen is toxic to humans and livestock [6]. However, little is known about this fungal pathogen to date. In this study, we constructed the first BACbased physical map and generated a large set of BESs for V. virens. These resources will serve as fundamental tools for molecular, genetic and genomic studies of this pathogen.
Due to the lack of reference sequences and effective molecular markers, the contigs could not be edited. We used PCR with the primers derived from the masked BESs to evaluate the contig quality. From a total of 76 clones analyzed, only one clone was not verified by the PCR experiment, indicating that the contig assembly is  The syntenic regions identified by SyMAP. 2 The distance between the two hits of one clone in reference is within 500 kb. 3 The CB percentage of the anchored contigs to total contigs.
reliable. The control samples and several pairs of primers in one contig helped to discriminate the false positive PCR bands. Transposable elements (TEs) contribute largely to the evolution of fungal genomes [30,31]. In UV-8b, we found that the known repeat elements represented 4.57% of the total BESs and are mainly LTR elements. Few DNA transposons were identified. This may be because the percentage of retroelements is higher than DNA transposons in the V. virens repeat family or because DNA transposons of V. virens are less homologous with the available repetitive sequences in the Fungi sub-database of RepeatMasker. In the M. oryzae genome, the retroelements were also more common than DNA transposons [32,33].
By de novo searching the repetitive sequences contained in the UV-8b BES dataset with RepeatScout, a total of 682,351 bp (22.51%) sequences distributed in 2,642 reads (40.27%), were marked as repeat sequences. In our results, the core-repetitive sequences that were identified by RepeatScout were identified in 4 to 156 BESs, while the fragments which have lower occurrence may be false positives or lowly repetitive sequences. However, the percentage of repeat sequences reduced from 22.51% to 16.25% if the criterion threshold for hits in BESs was set as >5 instead of >3 times (please note that the BES sequences accounted for only 8.54% of the genome sequence). There was 10.3% genome sequences that were identified as repetitive sequences in the M. oryzae p131 assembly [32].
A total of 1,384 reads identified by RepeatScout were not identified by RepeatMasker, indicating that they are new repeat elements that have not been collected in the database. Fifteen known repeat element-containing reads identified by RepeatMasker were not identified by RepeatScout. It is possible that these elements have high-copy numbers in other fungi but low-copy numbers in the UV-8b genome or that they were under-represented in the BESs.
Bischoff et al. analyzed the phylogenetic placement of Villosiclavae and claimed that it is related to, but distinct from, the Clavicipitaceae and Hypocreaceae clades [34]. This is in agreement with our phylogenetic analysis that most of the strains closely related to V. virens UV-8b belong to Clavicipitaceae and Hypocreaceae. In the processes of both genome comparison (Figure 3; Table 4) and gene annotation (Figure 6), the BLAST hit distributions were also consistent with the result of the phylogenetic tree, except for with T. reesei. This result could be due to the fact that the genome sequences used for genome comparison were draft sequences, and less genomic resources of T. reesei were deposited in GenBank for gene annotation.
To date, little is known about co-linearity of chromosome segments among filamentous ascomycete fungi [33] compared to plant and animal genomes. The synteny relationship could facilitate the acquisition of knowledge about genome evolution and dynamics, comparative genomics and phylogeny [35,36]. We compared the repeat-masked BESs to the top 10 fungal pathogens to search microsyntenic regions. The result showed that F. oxysporum and F. graminearum have the most hit numbers. The hosts of F. oxysporum range from arthropods [37] to humans and also include gymnosperm and angiosperm plants [38], whereas F. graminearum was notorious for causing Fusarium head blight. However, the M. oryzae, which was a well-known pathogen of rice, showed few synteny regions.
Alignments of contigs and BAC clones to the target or reference genomes through BESs were widely used to detect phylogenetic relationships and large structural genomic variations between species, such as expansion, contraction, inversion and rearrangement in plants [12,39]. These alignments could also assist in the sequence assembly and detect the assembly errors of the genome sequence of the same species [13]. In this study, the numbers of UV-8b contigs aligned to Table 6 The SSR frequency and distribution in the UV-8b BESs Type (motif number) 1 Average length ( Group all the similar and reverse complementary SSR motifs together, e.g. "TC", "CT", "AG", and "GA" were grouped into "AG".

2
The number of sequence reads that contain SSRs. 3 The primer numbers that were successfully designed for SSR loci.
the target genomes were not high. This was most probably due to both the high diversities among fungal genomes and the incompleteness of the target genome sequences. The SSRs play an important role in genetic diversity analysis and genetic map construction due to their high level of polymorphisms, co-dominance and robustness [40]. Before substantial genome sequence availability, the BESs, as random genome survey sequences, were an important resource for mining SSR markers. We found 1,023 SSRs with an average length of 25.17 bp and a density of 376.12/Mb, of which primers of 836 loci have been designed successfully. These primers are candidates for genetic analysis by PCR. It is interesting that UV-8b has a similar SSR content and distribution pattern with M. oryzae but not with the closely related T. reesei.

Conclusions
We constructed the first generation BAC-based physical map of V. virens and acquired 3,030,658 bp of BAC end sequences, representing 8.54% of the genome. The BAC library was equivalent to 22.7X genome coverage with an average insert size of 140 kb. A total of 2,035 BAC clones were assembled into 194 contigs and 255 clones were left as singletons. The BAC library and physical map provide tools for positional cloning, comparative genomics and whole genome sequencing of V. virens. In addition, the BAC end sequence analysis provides a glimpse into the V.virens genome composition, such as 51.52% GC content, 22.51% repetitive sequences, 376.12/ Mb SSR density and approximately 36.01% coding regions.
We believe that all these information is valuable to expedite the genomic and genetic research into the important rice false smut fungus.

The 18S rRNA gene identification and phylogenetic analysis
The 18S rRNA gene of UV-8b was amplified by PCR using the common primers NS1 and NS8 [41]. The sequence was compared with those in the NCBI GenBank database by the BLASTN searching tool. The sequences were edited using ClustalX 1.83 software [42]. The phylogenetic tree was constructed using the neighborjoining (NJ) algorithm tested by 1,000 bootstrap with MEGA5 [43].

High molecular weight genomic DNA preparation
The V. virens strain UV-8b was subcultured on PSA medium (1 L: 200 g peeled potato, 20 g sucrose and 15 g agar; natural pH) at 28°C for 5 days. The fresh mycelium was harvested and transferred onto new PSA plates covered with a layer of cellophane to propagate enough amount of fresh mycelium. The fresh mycelium was collected, ground properly and cultured in liquid complete medium [44,45] at 28°C, 180 rpm for 65 h. The culture was filtered through 2-4 layers of cheese cloth. The collected mycelium pellet was washed first with sterile ddH 2 O twice and then with 0.7 M NaCl twice, and incubated in 0.7 M NaCl solution containing 8 mg/ml Driselase (SIGMA D9515) at 31°C at 100 rpm for 3 h to release protoplasts. The protoplast-containing mixture was filtered through one layer of miracloth twice. The protoplast-containing solution was centrifuged at 1500 g for 15 min. The pellet was washed with 0.7 M NaCl for three times, with 1.2 M sorbitol once and then resuspended in a minimal volume (usuallỹ 1 ml) of 1.2 M sorbitol to reach a compromise to obtain both as high as DNA concentration and as many as DNA plugs required for at least one attempt of BAC library construction (It is difficult to obtain a high DNA concentration from rice fungi). The protoplast suspension was mixed with an equal volume of 1% low melting point (LMP) agarose (prepared with 1.2 M sorbitol) at 45°C and then transferred into plug molds (Bio-Rad) to form plugs. The plugs were treated following our published protocol [46].

BAC library construction
BAC library construction was performed as previously described [46][47][48]. The linearized dephosphorylated lowcopy BAC vector pIndigoBAC536-S was prepared with HindIII from a high-copy composite vector pHZAU-BAC1 as previously described in Shi et al. [49]. Individual BAC clones were arrayed in 384-well plates and stored at −80°C in our laboratory. The insert size of the BAC library was estimated by digesting random BAC clones with I-SceI and analyzing the digested products   [51] was used for contig assemblies. The FPC parameters were adjusted as described [50,52]. A series of cutoff and tolerance values were tested to obtain optimal assembly following the principles of decreasing the number of contigs without excessively increasing the number of questionable clones.
After each round, when more than 5 Q clones existed in a contig, the "DQer" function was used to break up the Q contig with a step value of 2. Finally, the tolerance value was set to 4 and the Sulston cutoff value was set to 10 -15 . At the end, the contigs were improved using the "End to End" automerge function.
The primers which were used to evaluate the contig quality generated at cutoff 10 -15 were designed from masked BESs by primer5, with the exception of con-tig184 whose 3 pairs of primers were SSR primers generated in SSR analysis. The conditions of the bacterial liquid PCR reaction were 94°C for 5 min for initial denaturation, followed by 35 cycles of denaturation at 94°C for 30 sec, annealing for 30 sec, and extension at 72°C for 40 sec, and a final cycle of extension for 10 min. The annealing temperature was selected based on the TM values of the primers. The products of PCR were separated in 1.0% agarose gels. The presence/absence of the bands of expected sizes were examined. The host cells, empty vector, and the clones U13J12 (contig6), U03H01 (contig133), U03J10 (contig53), and U03A03 (contig100) were randomly selected as control samples.

BAC end sequencing
BAC end sequencing was performed as previously described [53] with some modifications. BAC clones were sequenced at both ends on an ABI 3730 DNA Analyzer using Big-Dye v3.1 (Applied Biosystems, Foster City, California), following the manufacturer's instructions. The two primers BACf (5′aacgacggccagtgaattg3′) and BACr (5′gataacaatttcacacagg3′) were used as forward and reverse sequencing primers, respectively. Sequences were base-called using Phred [54], and the vector and low-quality (Phred value <16) sequences were removed using the program LUCY [55]. The reads less than 100 bp in length were removed. All the trimmed sequences were deposited in the GenBank database [GenBank:JY267549 to GenBank:JY274108].

Analysis of repetitive DNA in BESs
The known classes of repeat elements contained in the UV-8b BAC end sequences were identified by the RepeatMasker v3.3.0 pipeline (http://www.repeatmasker. org) from the Fungi subdatabase in RepBase17.07 [56]. The BAC end sequences were used to search for novel repeat elements with RepeatScout1.0.5 [57]. Only the sequences that were repeated > 3 times and were > 50 bp in length in the BES dataset were kept. Then, the remaining BES sequences were self-BLASTed to search for additional BESs that were repeated > 3 times and were > 50 bp in length [12,58].

Genome comparative analysis
The genome sequences of the fungal pathogens M. oryzae, B. cinerea, P. graminis, F. graminearum, F. oxysporum, C. graminicola and U. maydis were downloaded from the Broad Institute Database (http://www.broadinstitute.org). The genome sequences of the fungal pathogens B. graminis, M. graminicola, M. laricis, M. anisopliae, N. haematococca, T.reesei and C. militaris were downloaded from the NCBI database (http://www.ncbi.nlm.nih.gov). The repeat-masked BESs were BLASTed against the above genome sequences using BLASTN with an E-value cutoff of 10 -05 . The matched sequences with longer than 50 bp and more than 80% identity were collected and analyzed. The BESs were also used to anchor the corresponding contigs to the genome sequences of M. anisopliae, F. oxysporum, N. haematococca and T. reesei using the SyMAP V3.4 program [23] (http://www.agcol.arizona. edu/software/symap/).

Analysis of simple sequence repeats
The BES sequences were clustered by the program CAP3 [25] with default parameters to reduce the redundancy of the dataset. The non-redundant sequences were scanned by SciRoko3.4 [24] to search for the potential SSRs, with the criteria of a minimum repeat number was of 3 and a minimum total length of 15 bp. The full standardization of SciRoKo, which groups all the similar and complementary SSR motifs together, was used for the SSR statistics, e.g. "TC", "CT", "AG", and "GA" were grouped into "AG". The genome sequences (GSS section) of B. graminis, F. virguliforme, M. oryzae and T. reesei were downloaded from NCBI (of December 2012) and mined for SSRs with the same criteria above for comparisons of the SSR contents and distribution patterns. The primers flanking SSRs were designed by standalone primer3 [26] and the DesignPrimers program in the SciR-oko3.4 package [24].