BAC Clone Sequencing and Assembly
Loblolly pine BAC library Pt_7Ba (Clemson University Genomics Institute [CUGI], Clemson, SC) was screened with multiplexed32P-labeled PCR amplicons from ten genes mapped to quantitative trait loci (QTL) associated with wood chemical traits in loblolly pine [48–50] (Additional file 6, Table S5). Hybridizations identified 256 positive clones, and a random 48-clone subset of the positives was obtained from CUGI. The BAC DNA was isolated with the Colony Fast-Screen| Kit (Epicentre Biotechnologies, Madison, WI) and sized relative to BAC-Tracker| Supercoiled DNA Ladder (Epicentre) using SYBR, Gold (Molecular Probes, Eugene, OR) and agarose gel electrophoresis. Subsequently, ten BACs were selected that showed single BAC insert bands and were different from each other. Glycerol stocks were sent to Beckman Coulter Genomics (Danvers, MA) for subclone library construction and sequencing.
For each BAC, a shotgun library was prepared from a single clone inoculated to 500 mL of LB with 12.5 μg/mL chloramphenicol. High molecular weight DNA was produced using the Qiagen (Valencia, CA) Large-Construct Kit. The DNA was randomly sheared using a Genemachines Hydroshear (Genomic Solutions, Ann Arbor, MI). The sheared DNA was end-repaired with Epicentre End-It| End-Repair Kit and size selected for inserts from 2 to 4 kilobases to produce libraries with average insert sizes of 2 Kb, 3 Kb and 3.5 Kb. The insert DNA was ligated to pUC19 high copy plasmid vector (Fermentas, Glen Burie, MD). The ligations were transformed into DH10B T1r E.coli cells (Invitrogen, Carlsbad, CA) and plated on LB agar with appropriate carbenicillin, X-gal and IPTG concentrations. Transformation mixes were quality controlled via enzyme digest and arrayed into 384-well plates containing LB freezing medium. Subclone DNA templates were sequenced in 384-well format, using BigDye® Version 3.1 reactions on ABI3730xl instruments (Applied Biosystems, Foster City, CA) with the forward and reverse reactions (paired ends) being done in the same plate to maximize the paired end rate. Thermal cycling was performed using 384-well Thermocyclers (Applied Biosystems). Sequencing reactions were purified using Agencourt's CleanSeq® dye-terminator removal kit.
All reads were processed using PHRED base calling software and constantly monitored against quality metrics using the PHRED Q20 [51, 52]. The quality scores for each run were monitored through Agencourt's Galaxy LIMS system. A passing read was defined as an average high quality PHRED score of 20 or higher for at least 100 bases. Typical average read-lengths extended 500-600 bp. The Arachne Whole Genome Assembler , coupled with Agencourt's LIMS system, was used to assemble the BAC sequences. Assemblies were viewed in CONSED [54, 55].
Computational Annotation of BAC Assemblies
Annotations for the P. taeda contigs were prepared using the program MAKER, a genome annotation pipeline that identifies repetitive elements, aligns EST and protein homology evidence, prepares ab initio gene predictions, calculates quality control metrics, and synthesizes these data into final genome annotations.
The EST/cDNA sequences used by MAKER were derived from P. taeda and were combined with EST/cDNA sequences from all other Pinaceae species found in dbEST . The UniProt/Swiss-Prot [57, 58] protein database was used as the protein homology database for the MAKER run. Repeat elements were identified using a MAKER internal transposable element database, the RepBase repeat library in conjunction with RepeatMasker, and pre-computed repeats from the program CENSOR  passed to MAKER via the algorithm's GFF3-passthrough option.
The total length of the preliminary contig set (923817 bp) was too short to accurately train the ab inito gene predictors specifically for the P. taeda genome. Instead a hybrid approach was taken by using existing training parameters from both monocot and dicot plant species to produce gene predictions in separate MAKER runs. Because MAKER uses evidence alignments to produce "hints" which are then sent to the ab initio gene prediction algorithms that can accept them, prediction algorithms that run inside the MAKER pipeline are capable of producing improved gene models even when the training parameters are imperfect. After producing a pool of possible ab initio and "hint-based" gene predictions, MAKER chooses those that are best supported by EST and protein homology evidence alignments using internal quality control metrics [31, 59] and promotes them to the status of genome annotations.
MAKER was first run using the ab initio gene prediction algorithms SNAP , Augustus [61, 62], and GeneMark  trained for Arabidopsis thaliana and FGENESH  trained for a generic dicot species (the exact species was not specified in the FGENESH documentation). The second run of MAKER was performed using SNAP and GeneMark trained for Oryza sativa in conjunction with Augustus trained for Zea mays. Both sets of MAKER-produced gene models were saved in GFF3  format and simple intron/exon structure statistics were calculated against them using the program Eval [61, 66] The MAKER runs were viewed and evaluated using the Apollo Genome Annotation Curation Tool . The peptide sequences corresponding to both sets of MAKER gene predictions were searched for conserved protein domains using Interproscan with default parameters  against the Interpro protein signature database.
High-throughput Whole Genome Shotgun Sequencing
Whole genome shotgun sequencing was performed on diploid DNA from the same individual used to construct the BAC library, using the high-throughput Illumina Genome Analyzer II sequencing platform. Genomic DNA library construction was carried out using the Illumina genomic DNA sample preparation kit according to manufacturer's instructions, except that paired end specific oligonucleotides were used instead of the single read oligonucleotides. Starting material was 80 ul of pine genomic DNA at a concentration of 62.5 ng/ul sonicated in a Diagenode Bioruptor for 15 cycles of 30" on maximum power then 30" rest. Following paired end adapter ligation, fragments of approximate size 400-425 bp were gel purified and PCR amplified using the paired end Illumina library PCR primers (primers 1.0 and 2.0). After AMPure purification (Beckman Coulter Genomics), the sample was applied to an Agilent Bioanalyzer for quantitation. Based on the bioanalyzer-reported sample concentration, the library was applied to a flow cell at 5 pM using v1 cluster reagents. Sequencing was performed on an Illumina Genome Analyzer II using version 2 sequencing reagents for 40, 42 and 60 cycles. Basecalling was carried out using the Illumina GA Pipeline v1.3. The WGS sequencing was carried out at the University of California, Davis, Genome Center.
Additional Element Characterization in BAC Assemblies
As previously described, the MAKER automated annotation pipeline was customized for both gene prediction and repeat identification in the ten P. taeda BAC assemblies. MAKER reported simple sequence repeats, as well as similarity to Repbase accessions and the MAKER internal transposable element database. Since only a handful of complex repetitive elements have been characterized in conifers, it is expected that this similarity-based repeat landscape described by MAKER is incomplete.
Several additional methods were included to complete the identification of putative repetitive elements in the BAC assemblies. Tandem Repeats Finder was used to locate tandemly duplicated units of 5-200 bp, Gepard  was used to produce dotplots in order to visualize longer direct and inverse repeats within each BAC, and discontiguous megablast (word size 11, match/mismatch = +1/-1, gap open/extension cost= 2/2) was used within each BAC to delineate direct repeats of minimum length 100 bp that span at least 500 bp of putatively noncoding sequence. The resulting pairs of direct repeats are presented in this paper as potential long terminal repeats of uncharacterized LTR retrotransposons. The results of MAKER run with dicot parameters on unmasked pine BACs were also examined for evidence of nongenic open reading frames (ORFs) that may correspond to 'novel' complex repetitive elements such as DNA transposons or LTR retrotransposons.
Regions were identified where at least two MAKER gene-finding tools predicted ORFs, but the sequence failed to show enough similarity to EST and protein databases to be annotated as protein-coding genes. Each putative nongenic ORF element shows significant similarity to at least one known repetitive element and is described using the longest ORF (minimum length 240 bp) among similar predictions.
Whole Genome Shotgun Sequence Analysis
Two consensus transposons and a putative centromeric tandem repeat were assembled from a pool of 40 and 42-bp WGS reads using nugtohs.pl (unpublished). In order to assess genome-wide occurrence of putative genic and repetitive elements in the BAC assemblies, 60-bp WGS reads were aligned to each BAC sequence with BLASTN and post-processed with a Perl script. This produced two WGS-coverage maps of each BAC; one coverage map optimized for WGS-to-BAC alignments showing 99% nucleotide identity (score threshold 55) and one map optimized to count alignments at or above 75% nucleotide identity (score threshold 24). The coverage maps are reported in hits per base pair in.sgr formats that were initially analyzed using the Integrated Genome Browser . Genome-wide copy number of BAC elements were computed by averaging hits per base pair along the length of each element and calculating the ratio of this value to the estimated genome coverage (0.036×) provided by the 60-bp reads.
Assessment of Pine Genome for Sequencing and Assembly
To assess the P. taeda genome for sequencing and assembly, the repeat content of the genome was compared to twelve previously sequenced genomes: Caenorhabditis briggsae , Drosophila melanogaster , Chlamydomonas reinhardtii , Arabidopsis thaliana , Oryza sativa , Vitis vinifera , Physcomitrella patens , Populus trichocarpa , Sorghum bicolor , Malus x domestica (Troggio, unpublished), Zea mays , and Homo sapiens [79, 80]. Whole genome shotgun reads of each species were retrieved from the NCBI Trace Archive and converted to 60-bp lengths. 0.036× genome equivalents of these "reads" were then aligned to 920000 bp (similar to the total P. taeda BAC sequence) of randomly-selected regions of each genome using BLAST. Alignments were categorized into three nucleotide identity groups: 70-84%, 85-97% and 98-100%. The genomic sampling was conducted 10 times and averaged.
To simultaneously visualize all elements that were identified in the BAC assemblies, the program gff2ps was used . The following data were formatted into GFF files and used to create Figure 1 and Additional file 1, Figure S1: MAKER dicot and monocot runs on masked and unmasked sequence; simple repeats; tandem repeats; direct repeats (potential LTRs); nongenic ORF elements; and coverage maps of each BAC at 75% identity and 99% identity. The coverage maps are shown in these figures as histograms of average hits per base pair in 50-bp windows. The GFF files are available for interactive browsing or download at http://dendrome.ucdavis.edu/treegenes/gbrowse, where a modified version of the GMOD project GBrowse was implemented in the TreeGenes database to display the annotations [82, 83].