Ontology and diversity of transcript-associated microsatellites mined from a globe artichoke EST database

Background The globe artichoke (Cynara cardunculus var. scolymus L.) is a significant crop in the Mediterranean basin. Despite its commercial importance and its both dietary and pharmaceutical value, knowledge of its genetics and genomics remains scant. Microsatellite markers have become a key tool in genetic and genomic analysis, and we have exploited recently acquired EST (expressed sequence tag) sequence data (Composite Genome Project - CGP) to develop an extensive set of microsatellite markers. Results A unigene assembly was created from over 36,000 globe artichoke EST sequences, containing 6,621 contigs and 12,434 singletons. Over 12,000 of these unigenes were functionally assigned on the basis of homology with Arabidopsis thaliana reference proteins. A total of 4,219 perfect repeats, located within 3,308 unigenes was identified and the gene ontology (GO) analysis highlighted some GO term's enrichments among different classes of microsatellites with respect to their position. Sufficient flanking sequence was available to enable the design of primers to amplify 2,311 of these microsatellites, and a set of 300 was tested against a DNA panel derived from 28 C. cardunculus genotypes. Consistent amplification and polymorphism was obtained from 236 of these assays. Their polymorphic information content (PIC) ranged from 0.04 to 0.90 (mean 0.66). Between 176 and 198 of the assays were informative in at least one of the three available mapping populations. Conclusion EST-based microsatellites have provided a large set of de novo genetic markers, which show significant amounts of polymorphism both between and within the three taxa of C. cardunculus. They are thus well suited as assays for phylogenetic analysis, the construction of genetic maps, marker-assisted breeding, transcript mapping and other genomic applications in the species.


Background
The globe artichoke Cynara cardunculus is a member of the Asteraceae (Compositae) family, and originates from the Mediterranean basin [1]. The species is subdivided into three taxa -the globe artichoke (var. scolymus L.), the cultivated cardoon (var. altilis DC), and their progenitor, the wild cardoon [var. sylvestris (Lamk) Fiori]. The edible part of the globe artichoke plant is provided by its immature inflorescence, referred as a capitulum or head [2], and represents a significant component of the Mediterranean diet. The cultivated cardoon is grown for its fleshy stems, which are used in traditional cuisine. Leaf extracts of the species contain a number of bioactive compounds (e.g., quercetin, rutin, luteolin, gallic acid, di-caffeoylchinic acid, and sesquiterpene lactones) which have been shown to have anti-oxidative and anti-carcinogenic activity, to inhibit cholesterol biosynthesis, and to enhance lipid metabolism [3][4][5][6][7][8]. The antioxidant content per serving of globe artichoke ranks very highly among vegetables [9], while the roots provide a source of inulin, a proven enhancer of human intestinal flora [10,11]. In spite of its economic importance, however, little breeding effort has been applied to date in the globe artichoke.
Progress has been made in the development of DNA marker based genetic maps in globe artichoke [12][13][14]. The most saturated map has been recently developed from a set of F 1 progeny of a cross between a globe artichoke and a cultivated cardoon genotypes [14]. This map consisted of 20 linkage groups comprising 326 loci and spanned ~1500 cM with a mean inter-marker distance of 4.5 cM. A set of loci common to this map and a previously developed one [12] allowed for map alignment and the definition of 17 homologous linkage groups, corresponding to the haploid chromosome number of the species.
It was long assumed that SSRs were primarily associated with non-coding DNA, but it has now become clear that they are more frequent in transcribed than non-transcribed sequences and equally frequent in the transcriptomes of plants with dramatically different nuclear DNA contents [15]. EST databases therefore represent a convenient resource for the identification of microsatellites, some of which, as a result of their presence within coding DNA, have the potential to deliver informative within gene markers, exploitable as COS (conserved orthologous set) for genomic comparative analysis.
Here, we report: i) the unigene assembly based on the globe artichoke EST database deposited in GenBank by the Compositae Genome Project (CGP), ii) the identification of a wide set of EST-based microsatellite markers and iii) the evaluation of the informativeness of a subset of these markers using a panel of C. cardunculus genotypes. Furthermore, we performed a comprehensive functional annotation, inferred from sequence alignment (ISA), as well as a gene ontology categorisation inferred from sequence orthology (ISO) of the SSR-containing unigenes. At last we assessed whether motif type and relative position within CDSs (Coding DNA Sequences)/UTRs (Untraslated Regions) may be preferentially associated with a particular gene ontology term.

Results and Discussion
EST microsatellite discovery, frequency and primer design Globe artichoke ESTs were trimmed, assembled, and annotated using a customized bioinformatic pipeline ( Figure 1) into 19,055 unigenes (6,621 contigs and 12,434 singletons) spanning 15 Mbp. The transcript assembly and unigene consensus sequences are supplied as electronic supplementary materials (See Additional file 1, 2, 3: EST assembly, 19,055 unigenes, ACE assembly file). The unigenes had a mean length of 786 ± 1.7 bp, with a mean GC content of 43.9 ± 0.03% (range 24.7 -67.9%, Figure 2) and a mean ambiguity code ratio of 0.51 ± 0.01. Within the unigene set, 3,308 contained 4,219 uninterrupted tracts of (perfect) di-, tri-, tetra-, penta-, and hexa-nucleotide repeats, delivering a mean microsatellite density of one per 3.6 kb. Comparable microsatellite frequencies and densities have been discovered in the transcriptomes of other Compositae species [16][17][18]. Only perfect repeats were selected, as these appear to be the more prone to strand slippage and, consequently, tend to The schema used for EST assembly, annotation, primer design and amplicon screening Figure 1 The schema used for EST assembly, annotation, primer design and amplicon screening. be more polymorphic than interrupted ones [19]. Sufficient flanking sequence (in terms of both length and read quality) was present in 1,974 of the unigenes, containing 2,311 perfect microsatellites. The resulting PCR primers designed for these loci are given in Additional File 4 (primer pairs designed).

Allelic diversity with the EST microsatellites
A subset of 300 microsatellites (ranging in length from 10 to 84 bp, and representative of a broad spectrum of the repeat types) was surveyed for their informativeness. The targeted amplicon length ranged from 98 to 456 bp and the set was selected to optimize the possibility of multiplexing on the capillary genotyping platform employed. The test germplasm panel consisted of twelve genotypes of globe artichoke, nine of cultivated cardoon, and seven of wild cardoon (Table 1). In all, 238 (79.3%) of the assays were successful; of these, 236 were informative among the taxa, while 215, 216 and 223 were polymorphic within, respectively, globe artichoke, cultivated cardoon and wild cardoon ( Table 2 and Additional file 5: full statistics on 300 SSR-containing loci). A total of 1,546 alleles was generated from the 238 successful assays, giving a mean of 3.8 (range 1-15) alleles per locus. The largest range in amplicon length observed was 196 -252 bp, observed for a TCA n microsatellite (CyEM-171). In 85% of the loci, the assay generated the predicted length of amplicon, in 12.2% the amplicon was longer than expected, and in 2.8% it was shorter. The allelic range (in terms of amplicon length) was greater for the wild cardoon (17.1 ± 1.0 bp) than for globe artichoke (13.6 ± 0.8 bp) or cultivated cardoon (13.7 ± 0.7 bp).
Allelic diversity was, as expected given the breeding history of these taxa [2,20,21], greater in the wild than in the two cultivated forms ( Figure 3A). The frequency of taxonspecific alleles was two fold more in the wild cardoon, and the polymorphic information content (PIC) was higher in the wild cardoon (0.576 ± 0.015) than in either the globe artichoke (0.484 ± 0.013) or the cultivated cardoon Output of the EST assembly (0.466 ± 0.015). The observed heterozygosity level (Ho) was significantly less in the cultivated cardoon than in globe artichoke, presumably because the globe artichoke is primarily a vegetatively propagated plant, and thus able to maintain a high level of heterozygosity over time [20,21]; whereas cultivated cardoon is seed propagated and has been subjected to purifying selection aimed at increasing genetic uniformity for stabilizing its production. We previously developed three mapping population for the development of C. cardunculus genetic maps by crossing one globe artichoke non spiny genotype (common female parent) with a spiny genotype of globe artichoke or cultivated cardoon or wild cardoon. When the parents of the three mapping populations were tested with the set of microsatellites, 214 were informative in at least one combination, while 153 across all the three combinations, thus supplying landmarks for comparative mapping of phenotypic and quantitative trait loci (QTLs). As expected, the most polymorphic combination (198 microsatellites) was the one involving the cross between the most genetically divergent taxa: globe artichoke and wild cardoon ( Figure 3B).

Diversity analysis
The informativeness of the newly developed EST microsatellites was comparable with that described for microsatellite markers in globe artichoke [13,22], sunflower [23,24] and lettuce [25]. A set of five EST microsatellites was sufficient to discriminate between each of the 28 members of the germplasm panel (e.g. CyEM-10, -37, -54, -105, -254). The inferred genetic relationships were in good concordance with those derived from AFLP profiling [20,21,26,27]. Thus, the globe artichoke accessions clustered with one another ( Figure 4A, cluster A), but two subclusters, corresponding to the contrasting capitulum types (i.e. non spiny vs. violet, spiny types), could be recognised. The clade most closely related to the globe artichokes contained the cultivated cardoons ( Figure 4A, cluster B), and among these, the most well differentiated accession was 'Bianco Pieno Migliorato', as previously observed [21]. The Spanish cultivated cardoon accessions were genetically very similar to one another. The wild cardoon accessions formed a discrete, but rather loose group ( Figure 4A, cluster C). A principal co-ordinate analysis further illustrated the genetic relationships between members of the germplasm panel ( Figure 4B). Axes 1 and 2 accounted for ~ 73% of the genetic variation, the former contributing ~ 47%, and the latter ~ 26%. Axis 1 distinguished the globe artichokes from the cultivated and wild cardoons, while Axis 2 separated the two cardoon taxa. As expected, F 1 progenies mapped to intermediate positions with respect to those of their parents ( Figure 4B).

Distribution of microsatellite
Of the 4,219 microsatellites, trinucleotide motifs accounted for 49%, dinucleotides for 33%, hexanucleotides for 13%, tetranucleotides for 3% and pentanucleotides for 2% ( Figure 5). Only ESTs (2,498 of the 3,308) having a non-ambiguous ortholog in Arabidopsis thaliana were taken forward for the purpose of annotation. The position of the microsatellite tract (5'-UTR or 3'-UTR or CDS) was derived from the BlastX result in conjunction with the ORF (Open Reading Frame)-Predictor algorithm [28]. About 33% of the microsatellites were present in the 5'-UTRs, ~ 20% in the 3'-UTRs and ~ 47% in the CDSs ( Figure 5), similar to the relative frequencies in both the A. thaliana and rice genomes [15]. Most of the CDS microsatellites consisted of trinucleotides, while dinucleotides were the most abundant in the 5'-UTRs, and di-and trinucleotides were equally represented in the 3'-UTRs. Tetra-and pentanucleotide motifs were more frequent in the 3'-UTRs than in either the CDSs or the 5'-UTRs (data not shown). Trinucleotide (and hexanucleotide) motifs are expected to predominate in the population of CDS microsatellites, as variation in their repeat number is not  [20,21,26], respectively.     associated with a frame shift event [29]. The most abundant dinucleotide repeat was AG/CT, followed by AC/GT, although AT/TA predominated in the 3'-UTRs. Among the trinucleotides, the most frequent was AAG/CTT, followed by ATC/GAT and CAC/GTG ( Figure 6). This distribution is consistent with the situation in A. thaliana and Brassica spp. orthologs, in which a preference for AG/CT and AAG/ CTT motifs has been identified in the 5'-UTRs, thought to be associated with the cis-acting regulation of transcription [30]. In the globe artichoke 5' UTRs, dinucleotide motifs were over-represented, with AG/CT being the most abundant ( Figure 6), similar to the situation in the 5'-UTRs of many plant (both mono-and dicotyledonous species) genes [31,32], which has been reported to play a role in post-transcriptional gene regulation at the RNA level [33,34]. Dinucleotide motifs were also frequent in the 3'-UTRs, possibly because AT-rich elements are able to act as cis mediators of mRNA turnover [33]. Overall, present data confirm that homopurine/homopyrimidine repeats contribute markedly in 5'-UTR and CDS, as previously reported by Morgante et al [15].

The function of genes containing microsatellites
Microsatellites within coding sequences can have a major effect on gene activity, since the expansion/contraction of triplets within the coding sequence alters the gene product, thereby sometimes causing a significant phenotypic change. In humans, the effects on phenotype due to the presence of SSRs in coding regions of genes playing key roles in neuronal disorders and cancer have been extensively studied [35]. Among the microsatellites in the globe artichoke transcriptome, the six most frequent amino acid stretches present in the CDS were poly-serine (94 unigenes), poly-aspartic acid (75 unigenes), poly-glutamic acid (57 unigenes), poly-lysine (46 unigenes), poly-glycine (45 unigenes) and poly-threonine (35 unigenes). It has been reported that particular amino acid repeats tend to be associated with specific classes of proteins [36]. Acidic and polar amino acid repeats have generally been associated with transcription factors and protein kinases, whereas serine repeats are common within membrane transporter proteins [37]. In the globe artichoke, poly-serine and poly-glycine stretches are particularly frequent in the CDS. Poly-serine linkers are common in eukaryotic genomes, and are thought to provide a flexible interdomain. They are frequently associated with modular proteins, and are involved in complex carbohydrate degradation [38] and the binding of proteins with extracellular matrix components, such as the laminin binding protein.
Poly-glycine (also poly-asparagine and poly-proline) microsatellites may provide a domain for DNA binding or protein-protein interactions, and has been found to be necessary for chloroplast envelope targeting. Polyglutamic and poly-aspartic acid tracts feature in many NLS (nuclear localisation signal) proteins [39], and it has been suggested that both basic karyophilic and acidic clusters can enhance their selective binding to transport machinery components [40]. Poly-glutamic acid stretches have also been implicated in transcription activation/de-activation [41][42][43][44][45], and microsatellite allelic variants of these genes have been identified as the genetic basis of a number of human diseases [46].  To support the occurrence of certain repeated motifs in the assembled unigenes we have exploited sequence alignment and gene ontology in order to annotate their functions and assess whether their motif type and position are preferentially associated with a particular gene ontology term.

Allelic diversity revealed by the set of EST microsatellite markers
In preparation, the set of globe artichoke unigenes was filtered to include only those with a BlastX E-value of < 1e -29 when matched with the A. thaliana reference protein set. In all, 12,783 queries satisfied this criterion (Additional file 6: 12,783 globe artichoke unigenes annotation). The A. thaliana gene accession numbers were used to catego-rize the unigenes using TAIR gene ontology (data not shown). The GoStat2 web interface was then used to identify gene ontology categories which were over-represented. By comparing either the set of microsatellitecontaining unigenes, or subsets of it (e.g.: genes including di-or trinucleotide motifs in their CDS or UTRs) with the complete set of annotated unigenes, it was possible to identify over-representation in gene ontology (GO) categories (Table 3). Microsatellites appeared to be over-represented in loci involved in certain biological processes and functions, while no significant association was found with GO cell components (data not shown).
Most of the unigenes containing trinucleotide motifs in their CDS were associated with nucleic acid metabolic processes (GO:0006139), transcription (GO:0006350) and the regulation of transcription (GO:0006355), consistent with the encoding by the GAT trinucleotide of aspartic acid, since stretches of this residue are characteristic of 'karyophilic' acidic clusters in NLS (nuclear localization signal) proteins. Similarly, the AAG/TTC motif, which occurred frequently in the CDS, encodes polyglutamate, which is thought to be involved in both protein-DNA complex assembly (GO:0065004) and heterocyclic metabolic processes (GO:0046483). Unigenes carrying dinucleotide motifs in their CDS were found to be specifically associated with the response to stimulus (GO:0050896). The AG/CT repeats in the CDSs were overrepresented among genes responding to stress (GO:0006950), involved in DNA repair (GO:0006281) and in nucleic acid binding (GO:0003676). This is con-   GO terms statistically enriched (showed in percentage) for specific SSRs subsets. Fisher's exact test was performed between each SSRs subset versus the whole unigenes categorization; only significant over-represented subset are reported (p < 0,01). Analysis is displayed referring to "biological process" and "molecular function" classifications; "cellular component" is not reported due to the absence of particular enriched subsets. sistent with the presence of domains involved in protein-RNA/protein-protein sticky interactions.

Position and motifs of EST microsatellites
The commonest microsatellite motifs occurring in 5'-UTR of unigenes were dinucleotide repeats (mostly AG/CT). These unigenes were associated with nucleic acid metabolism (GO:0006139), the regulation of gene expression (GO:0010468), transcription (GO:0006350) and the regulation of transcription (GO:0006355). AG/CT repeats were also over-represented in genes involved in the response to ABA (GO:0009737 and GO:0009738). Moreover, trans-acting elements (GO:0003700: "transcription factor activity"), which show an over-representation of trinucleotidic (ATC/GAT) in their CDSs, were also frequently enriched in their 5'UTRs by AG/CT motifs, suggesting a cascade of signal transmission. Trinucleotide motifs were not common in the 5'-UTRs, except in genes involved in the response to auxin stimulus (GO:0009733).

Conclusion
We have demonstrated here the utility of a set of de novo globe artichoke EST-based microsatellite markers for the definition of genetic diversity, phylogeny and genetic mapping. Since EST microsatellites lie within expressed sequences, they have the potential to represent perfect markers for genes underlying phenotypic variation. Most of these assays are fully transferable to other C. cardunculus taxa, providing anchor points for the integration of taxonspecific genetic maps. The functional annotation of these EST sequences increases their utility as a source of genebased markers for the study of synteny and other applications.

EST microsatellites discovery and primer design
A collection of 36,321 EST, generated from the 'Green Globe' variety of C. cardunculus var. scolymus, as part of the output of the Compositae Genome Project http://compg enomics.ucdavis.edu, was downloaded from the NCBI database http://www.ncbi.nlm.nih.gov. To generate a set of unique assemblies, the sequences were first trimmed to remove any remaining vector fragments and polyA tails, using the perl script SeqCleaner, and assembled adopting a second perl script, TGICL, employing the following parameters: p = 95 (identity percentage), l = 40 (minimum overlap length), v = 10 (maximum length of unmatched overhangs); the maximum mismatch overhang was set to 10 bp, since the sequences had already been purged of vector stretches and polyA tails. The two scripts are available at http://compbio.dfci.harvard.edu/ tgi/software. The unigene set was then searched for perfect microsatellite sequences using a modified SSRIT perl script [47], with the minimum number of dinucleotides set as five, of tri-, tetra-and penta-nucleotides set as four, and of hexanucleotides as three. A sample of 300 non redundant microsatellite-containing sequences, selected to include the longer microsatellite motifs, was taken forward for PCR screening. Primer design was carried out using BatchPrimer3 http://probes.pw.usda.gov/ batchprimer with an optimal GC content of 50%, a maximum melting temperature difference of 3°C, variable amplicon size (to allow multiplexing), and all other parameters set to default values. The de novo microsatellite markers were prefixed with 'CyEM' (Cynara Expressed Microsatellite) and numbered sequentially.

Plant materials and genomic DNA isolation
DNA was extracted from young leaves following a modified CTAB method [48]. The primers were used to amplify genomic DNA template extracted from a germplasm panel consisting of twelve globe artichoke genotypes, representative of crops grown in the Mediterranean Basin [20]; nine cultivated cardoon genotypes, representative of both the Spanish and Italian gene pools [21]; and seven wild cardoon genotypes sampled from both Sicily and Sardinia [26]. Full genotypes details are reported in Table 1. The set also included DNA of the four parents of three established mapping populations, i.e. two globe artichoke accessions ['Romanesco C3' (C3) and 'Spinoso di Palermo' (SP)], one cultivated cardoon ('Altilis 41') and one wild cardoon ('Creta 4'); furthermore six F1 individuals from each of the segregating populations (C3 × SP, C3 × Altilis 41 and C3 × Creta 4) were included in the analyses.
Weakly amplified reactions were re-run using 1.5 mM MgCl 2 and applying a final annealing temperature of 55°C. Amplicons were separated on an ABI3730 capillary DNA sequencer (Applied Biosystem Inc., Foster City, CA, USA). Internal ROX-labelled GS500 size standards were included in each capillary. Fragment data were analysed using GeneMapper v3.5 software (Applied Biosystems). The genotypic data were analysed using the GenAlex Excel package [49]. Genetic diversity was calculated separately for the globe artichoke, cultivated cardoon and wild cardoon genotypes on the basis of (1) the mean number of alleles observed per locus (n o ), (2) the effective number of alleles per locus (n e ) as predicted by 1/Σp i 2 where p i is the frequency of the i th allele at the locus, (3) the mean observed heterozygosity (H o ), and (4) the polymorphic information content (PIC), estimated following [49]. A co-phenetic distance matrix for co-dominant markers was generated as described by Smouse and Peakall [50] and used to construct a UPGMA-based dendrogram [51] by means of NTSYS software package v2.10 [52]. Principal co-ordinate analysis was based on the distance matrix, with data standardization provided by the GenAlex package.

Annotation of the unigene set
The unigene set was aligned by a BlastX [53] search against the A. thaliana reference proteins database (NCBI), applying an E-value threshold of e- 29 . The location within the gene sequence of the microsatellite (5'-UTR, CDS or 3'-UTR) was inferred from this alignment, while ORF-Predictor [28] was used to predict the position of the stop codon. The frequencies of peptide repeat tracts within the CDS were used to identify any over-representation of particular triplets. For this purpose, the unigenes were divided into ten subgroups on the basis of the identity of the trinucleotide motif present in the CDS. Each subgroup was then subjected to an analysis based on the ORF-Predictor algorithm, considering only the positive reading frames (+1, +2, +3), since the sequenced transcripts were originally directionally cloned. All the unigenes were assigned a function based on the GeneOntology tool http://www.arabidopsis.org, using the A. thaliana orthologs (from BlastX output) as input (using AGI codes from TAIR). Enrichment within the GO hierarchical levels by mean of different subset of unigenes was estimated using the GoStat2 interface http://gostat.wehi.edu.au/cgibin/goStat2.pl based on Fisher's exact test [54], adopting a threshold p-value of 0.01 and considering terms starting from the 3rd hierarchical level of the DAG (directed acyclic graph; Table 3).