EST assemblies of Lactuca spp.
ESTs for five Lactuca species (L. sativa, L. serriola, L. saligna, L. virosa and L. perennis) generated as part of the Compositae Genome Project (http://compgenomics.ucdavis.edu) were trimmed to phred scores of 15 and vector sequences were removed with custom Perl scripts (https://code.google.com/p/atgc-xyz/source/browse/#svn%2Ftrunk). Trimmed ESTs for each of the five species of Lactuca were assembled separately using CAP3 . To minimize assembly of paralogous sequences, stringent assembly parameters were used for CAP3 (95% identity, 100 nt minimum overlap). The assemblies were combined iteratively using L. sativa as a base and adding unique sequences from individual species in order of genetic distance from L. sativa.
Genomic DNA was extracted following the protocol described by Kozik  with minor modifications. Frozen leaf tissue from three-week old, greenhouse-grown plants (2.5 g fresh weight) was ground in liquid nitrogen. Fifteen ml of 2X extraction buffer (100 mM Tris–HCl pH 8.0, 1.4 M NaCl, 20 mM EDTA, 2% w/v CTAB, 10 μl/ml β-mercaptoethanol) was added to the tissue, mixed and incubated at 65°C. One volume of chloroform:isoamyl alcohol (24:1) (ChIA) was added and mixed, and the sample centrifuged at 3,500 rpm for 20 min twice. The aqueous phase was then transferred and 3 to 3.5 volumes of precipitation buffer (50 mM Tris–HCl pH 8.0, 10 mM EDTA, 1% w/v CTAB) were added. The sample was incubated overnight at room temperature to precipitate the DNA and then centrifuged at 3500 rpm for 15 min. The DNA pellet was washed with dH2O and centrifuged for 10 min; 5 ml of 1.5 M NaCl and 6 μl of 10 mg/ml RNaseA was added to the pellet and incubated at 37°C until completely re-suspended. A chloroform extraction was performed as above to remove RNaseA and any additional contaminants. The aqueous phase was collected and DNA was precipitated and washed with ethanol. Samples were centrifuged again for 10 min at 3,500 rpm. The pellet was allowed to dry then re-suspended in 100 μl ddH2O. The sample was diluted 1:100 and run on a 1% agarose gel with known amounts of uncut lambda DNA to estimate the concentration of the genomic DNA. The 2.5 g of starting material yielded approximately 200 μg of genomic DNA, which was sufficient for hybridization to six chips at 30 μg per chip.
Whole genome amplification
The GenomePlex® Complete WGA kit (WGA, Sigma-Aldrich Corp, St. Louis, MO, USA) was evaluated for whole genome amplification. The maximum amount of genomic DNA that could be accommodated by the WGA kit (20 ng) was amplified and then re-amplified using the GenomePlex WGA re-amplification kit following the manufacturer’s instructions. Ten parallel re-amplification reactions were required to obtain sufficient amounts of amplified DNA for hybridization. The samples were then purified using Qiaquick PCR purification columns (QIAGEN, Valencia, CA, USA), pooled and concentrated using a Millipore Centricon YM-10 (Millipore, Temecula, CA, USA). Concentrated samples were then re-suspended in 37 μl ddH2O and quantified using the ND-1000 (Nanodrop, Wilmington, DE, USA).
DNA preparation, labeling and hybridization
In a 50 μl reaction, fragmentation of DNA was achieved by incubation of 30 μg genomic DNA with 0.0015 U RQ1 DNase I (Promega, Madison, WI, USA) per μg of DNA in the presence of: 0.175 mg/ml BSA, 10 mM Tris-acetate pH 7.5, 10 mM magnesium acetate, and 50 mM potassium acetate. Reactions were carried out for 30 minutes at 37°C followed 15 min at 99°C. Fragmentation profiles were examined by electrophoresis of 1.5 μl of fragmented product on a 2% agarose gel alongside a 50 bp DNA ladder (Fermentas, Glen Burnie, MD, USA) (Figure 3). Fragmented DNAs ranging between 25 to 250 bp were considered acceptable for hybridization.
The 3’ termini were end-labeled by addition of a biotinylated oligonucleotide (Affymetrix 7.5 mM Labeling Reagent) using terminal deoxynucleotidyl transferase (TdT) (Promega, Madison, WI, USA) as per a modified Affymetrix Labeling of Fragmented Double-Stranded DNA instructions . The reaction was scaled up to accommodate 30 μg of fragmented DNA rather than the 7.5 μg fragmented cDNA. The BioPrime DNA Labeling System (Invitrogen, Carlsbad, CA, USA) was also tested for labeling DNA as per the manufacturer’s instructions.
Hybridizations were carried out following the protocol described in Affymetrix GeneChip® Whole Transcript Double-Stranded Target Assay (WTDSTA) Manual (pp. 51–52; Affymetrix 2005) with minor modification. Samples were comprised of 70 μl of fragmented and labeled DNA containing the elevated amount of DNA (30 μg total), the remaining components of the GeneChip® Hybridization Kit (Affymetrix, Santa Clara, CA, USA), adjusted to 1.1x standard volumes, and the 20x hybridization control from the Hybridization Control Kit (Affymetrix, Santa Clara, CA, USA) were added as per instructions for a total volume of 220 μl. After centrifugation, the microarray cassette was injected with 200 μl of the cocktail. Post hybridization, washing and staining were carried out as described in the Affymetrix GeneChip® WTDSTA Manual using program FS450_0001 (pp. 55–61 Affymetrix 2005). Hybridization intensities were measured using an Affymetrix GeneChip® Scanner 3000 with 7 G up-grade.
Data processing and management
The raw .CEL files generated by the GCOS software (Affymetrix, Santa Clara, CA, USA) were subjected to background correction by using the Robust Multiarray Analysis (RMA) method and quantile normalization  of the Affymetrix package  with R and Bioconductor Affymetrix package . A MySQL 5.0 database was developed to curate all the information regarding experiments, probe characteristics, hybridization data, and quality control data (https://pgfmars.ucdavis.edu/phpmyadmin). A web interface was designed using PHP scripts for retrieval of the data. All scripts of our “SPPscan Package for Lettuce GeneChip® Data Management, Processing and Analysis” and detailed descriptions can be obtained from http://chiplett.ucdavis.edu/.
SPPdev detection algorithm
Perl scripts were developed for SPP detection. For hybridization of genomic DNA, the reference hybridization values in the SFPdev formula should be all of the lettuce probes on the microarray with the same GC content rather than those within a unigene as utilized by West et al.
] for cDNA hybridizations. We therefore defined the SFPdev for each probe i
of each unigene from each chip g as:
) is the average value of all the high quality probes on the same hybridized lettuce chip with the same GC content as probe Y
. The high quality probes were defined as those probes whose signal intensities exceeded those of a defined percentile of anti-genomic probes. The threshold value selected for our analysis was 90% (See Results).
The overlapping 2 bp tiling path on the lettuce chip allowed us to utilize the average weighted signal from multiple probes covering individual positions in each unigene. Deviation in hybridization intensities was calculated using a 2 bp sliding window incorporating data from all high quality probes covering each position. For each 2 bp position, s
, in a unigene covered by at least one probe,
we calculated the single position polymorphism deviation (SPPdev) as:
is the weighting factor for the high quality probe k that covers position s in the unigene. The weighting factor takes into account the distance from the central nucleotide of each probe to the evaluated position (Figure 5). The weighting factor was determined empirically by analyzing the hybridization behavior of probes covering a training set of 1,000 SNPs obtained from the alignment of multiple ESTs from L. sativa cv. Salinas with L. serriola acc. US96UC23 (See Results). All scripts are available from http://chiplett.ucdavis.edu/.
Statistical evaluation of putative SPPs
For each evaluated position of a unigene, SPPdev values were obtained for each replicated microarray of the two genotypes (L. sativa
cv. Salinas and L. serriola
acc. US96UC23). To detect putative SPPs within each unigene, we used the test statistic, R
defined as the SPPdev ratio for position s
Where n is the number of replicated microarray chips for each genotype. If the value of R
was much larger or smaller than what was expected by chance, then the position s of the unigene under study was predicted to have a SPP. A p-value was then assigned to each R
according to the null distribution of R
which was estimated by permutation . The follo-wing inference procedure was carried out for each unigene:
1) Arrange rows for all the probes of the unigene and columns for all the replicates of the two genotypes.
2) For the b
th permutation, b = 1, 2,…B:
a) Permute the columns of the data matrix obtained in S1
for each position s, s = 1, 2,…m.
3) After carrying out the B permutations, sort all the
values obtained from Step 2 in increasing order. The permutation p-value for testing the null hypothesis (that no polymorphism exists at position s) is determined by:
the background-corrected and quantile-normalized hybridization data in a matrix with
5) Obtain a FDR-adjusted p-value , P
s for each ps.
6) A position s is claimed to be a putative SPP if P
s < 0.01.
Note that the null distribution of R
for each position s is assumed to be the same. Therefore, the null distribution of R
is estimated by pooling the permutation distributions of R
at all positions.
Alternative SPP detection algorithm (Modified SFP Algorithm; MSA)
An algorithm to identify SFPs in complex genomes was described by Borevitz et al.
] and implemented in R (http://www.r-project.org
). This algorithm was modified to interrogate every 2 bp position on the lettuce GeneChip® instead of individual probes. The MSA algorithm was implemented in Perl because R could not handle our large data set and to facilitate parallelization of analysis on a Linux computer cluster. In MSA, we calculated the D-stat
statistics at every 2 bp position as in Equation 1, instead of calculation of the D-stat
statistics for each probe as in the original algorithm. The FDR calculation is described in Borevitz et al
. (2003; http://naturalvariation.org/methods/
); however, quantile function of R which is used in FDR calculation was implemented in Perl. The MSA Perl scripts can be accessed through (http://chiplett.ucdavis.edu
Mean of parent A or B at a given 2 bp position is natural logarithm (L
) of weighted average intensity of all probes that interrogating a position,
is calculated for each position but a* constant is a constant value for all positions,
is a constant for all positions, which is the median of all S
values for the entire 2 bp positions on the chip (>10,000,000 positions).
RNA was isolated from multiple tissues and treatment conditions (http://cgpdb.ucdavis.edu/cgpdb2/data_info_files/CGP_Library_Construction_V02.html) and pooled in equimolar concentration for each of the two genotypes. cDNA was synthesized and prepared for sequencing following protocols as described . L. sativa cv. Salinas and L. serriola acc. US96UC23 were sheared and selected for fragments around 260 bp and 200 bp respectively. Fragments were normalized using duplex specific nuclease (DSN, Evrogen, Moscow, Russia) after denaturation and re-association for five hours to reduce high copy sequences digesting double stranded DNA. The L. sativa cv. Salinas library was sequenced in one direction while the L. serriola acc. US96UC23 library was sequenced as paired-ends. Sequences are available in the Sequence Read Archive at NCBI (Study numbers SRP004854 and SRP008310).
Using BWA  and SAMtools  to identify a set of high quality SNPs, heterozygous positions, and InDels in the IGA set. Custom scripts were used to identify false positive SPPs identified by each of the SFPdev and MSA methods by comparison to the IGA set, the SNPs mined from the EST assembly used to develop the array, and SPP markers mapped in the core mapping population . A SPP was considered a true positive if SNPs were within 8 bp of the SPP ranges. An 8 bp range on either side of detected SPPs was chosen to account for detection of SPPs with the overlapping 25 bp oligo design and empirically determined sensitivity of oligos (Figure 5). Additionally, if the SPP range in the pair-wise comparison overlapped the SPP range mapped within the core RIL population, it was considered real. SPPs with no IGA sequence were excluded as we were not able to identify if they were true or false SPPs.
In order to identify if the apparent SPP was present in multiple contigs, each SPP plus and minus 8 bases on either side of the SPP range was compared using BLAST to three separate sets of sequences: a lettuce whole genome assembly of shotgun Illumina reads (Michelmore et al., unpublished data), a gene space assembly (Kozik, unpublished data) and the EST assembly used for the array design (http://compgenomics.ucdavis.edu) using BLASTn . To maintain confidence in the BLAST hits there had to be a minimum of 95% of the subject in the query with no more than two mismatches. SPPs identified as duplicated loci were removed from the analysis. SPPs detected in the SFPdev pair-wise comparison method that had a SFPdev ratio greater than one were removed as we empirically determined a 79% false positive rate through our validation procedures.
Diversity panel analysis
As described in Truco et al. the RIL algorithm calculates the SPPdev value for each array and plots that value for each position. The distribution of data points for each position is then evaluated for a bimodal distribution while maximizing the number of individual arrays under each peak. Those that fall between the identified populations of the bimodal distribution, or are completely missing, were treated as a missing data point (−). Arrays with hybridization signals in the higher part of the distribution were designated the A allele while those with lower hybridization values (because they deviated from the probe sequence), were designated the B allele. Data points for each array were then summarized based for each individual or cultivar as follows. Marker calls based on three replicates were designated as A, B, C, D, I (inconsistent), or – (missing). An A marker call resulted from an A in all three replicates. A B marker call resulted from a B in all three replicates. C calls were not A and were missing in one of the three reps i.e. (B/B/-) and D was not B and missing one of the three reps i.e. (A/A/-). An I was any combination of the three chips that contained two missing and an A or B or, any combination that included an A and B call from one of the three reps. – was missing data from all three reps.
Utilizing the package PHYLIP-3.69  the SPP markers were treated as restriction site markers and scored as 1 or 0. The SeqBoot module was used to create 100 re-samples data sets for bootstrap calculation. The resulting 100 replicates were used to create distance matrices subsequently used for construction of phylogenetic trees with the Fitch module. Global rearrangement and randomized input order of species with 10 jumbles for each of the multiple data sets was used to construct 100 trees for use in building a consensus tree with bootstrap values. The Consense module was then used with species L. perennis used as an out-group root. The resulting consensus tree was then visualized with Mega4  and branches were rotated for legibility. See Additional file 9: Table S2.
Principal component analysis
For principal component analysis, markers within contigs (9,513) that were common to the L. sativa cv. Salinas by L. serriola acc. US96UC23 map  were converted to haplotype frequencies using custom Perl scripts. Frequencies were calculated as the number of times the haplotype occurs in the panel for a given contig divided by the total number of genotypes in the panel. A limitation on the number of markers to use with the SAS/STAT software required us to reduce our set of markers to 10,000 markers or less. Creating haplotypes for all markers within a contig allowed us to reduce the number of input data points to 8,381 while retaining more genetic information than a randomly selected marker for each mapped contig. The markers used for calculation of haplotype frequencies in the entire DP were then filtered to include those polymorphic within L. sativa lines only and haplotype frequencies were calculated for 3,311 contigs to resolve L. sativa varieties. Frequencies of haplotypes in the DP panel were used as input for principal component analysis using the SAS/STAT® software’s PRINCOMP procedure (Version 9.1.3, SAS Institute, Cary, NC). Eigenvalues for each of the first three principal components were calculated for each genotype and an analysis of variance (ANOVA) and means separation using Students t-test were performed using JMP (JMP, Version 7. SAS Institute Inc., Cary, NC, 1989–2007) to determine if species and/or classes were significantly different. The two data sets described above were used to differentiate lettuce species and lines. See Additional file 10: Table S3 and Additional file 11: Table S4.