Characterization of the genome of bald cypress

Background Bald cypress (Taxodium distichum var. distichum) is a coniferous tree of tremendous ecological and economic importance. It is a member of the family Cupressaceae which also includes cypresses, redwoods, sequoias, thujas, and junipers. While the bald cypress genome is more than three times the size of the human genome, its 1C DNA content is amongst the smallest of any conifer. To learn more about the genome of bald cypress and gain insight into the evolution of Cupressaceae genomes, we performed a Cot analysis and used Cot filtration to study Taxodium DNA. Additionally, we constructed a 6.7 genome-equivalent BAC library that we screened with known Taxodium genes and select repeats. Results The bald cypress genome is composed of 90% repetitive DNA with most sequences being found in low to mid copy numbers. The most abundant repeats are found in fewer than 25,000 copies per genome. Approximately 7.4% of the genome is single/low-copy DNA (i.e., sequences found in 1 to 5 copies). Sequencing of highly repetitive Cot clones indicates that most Taxodium repeats are highly diverged from previously characterized plant repeat sequences. The bald cypress BAC library consists of 606,336 clones (average insert size of 113 kb) and collectively provides 6.7-fold genome equivalent coverage of the bald cypress genome. Macroarray screening with known genes produced, on average, about 1.5 positive clones per probe per genome-equivalent. Library screening with Cot-1 DNA revealed that approximately 83% of BAC clones contain repetitive sequences iterated 103 to 104 times per genome. Conclusions The BAC library for bald cypress is the first to be generated for a conifer species outside of the family Pinaceae. The Taxodium BAC library was shown to be useful in gene isolation and genome characterization and should be an important tool in gymnosperm comparative genomics, physical mapping, genome sequencing, and gene/polymorphism discovery. The single/low-copy (SL) component of bald cypress is 4.6 times the size of the Arabidopsis genome. As suggested for other gymnosperms, the large amount of SL DNA in Taxodium is likely the result of divergence among ancient repeat copies and gene/pseudogene duplication.


Background
The conifer family Cupressaceae contains many remarkable and important trees including junipers, redwoods, sequoias, cypresses, and thujas [1]. One Cupressaceae species that is of tremendous ecological importance to the southeastern U.S. is bald cypress, Taxodium distichum (L.) Rich var. distichum [2]. Bald cypress is the cornerstone species in the aptly named "cypress swamps" where it serves as a source of food and shelter for numerous and sundry organisms [3]. Though native to the U.S. South, bald cypress is a popular ornamental throughout much of the world; indeed, it has been cultivated in Europe since at least the mid 17 th Century [4]. Bald cypress wood is extremely resistant to wind, water, pathogens, and pests, something that perhaps is not surprising when one considers that individual trees may spend their entire life (sometimes > 1500 years - [5]) partially submerged in water. The highly durable wood of bald cypress is used in construction of boats, docks, bridges, and roofing shingles [6], although the tree's relatively slow growth-rate has limited its use as a wood crop. Unlike most conifers, bald cypress is deciduous with leaves that change from light green to brown in the fall. Its attractive appearance and hardiness have made it a popular ornamental throughout the eastern U.S. [2].
The genus Taxodium consists of one to three extant species, depending upon taxonomic preference. The most conservative treatment places all trees in a single species (T. distichum) with three varieties; specifically bald cypress (T. distichum var. distichum), pond cypress (T. distichum var. imbricarium), and Montezuma bald cypress (T. distichum var. mexicanum). While the single species treatment is phylogenetically warranted [7][8][9], sociological reasons have kept the multi-species nomenclature in place -e.g., Montezuma bald cypress is the national tree of Mexico [10].
The bald cypress 1C DNA content is 9731 Mb [11] which places it amongst the smallest of conifer genomes [12]. Bald cypress possesses 2n = 2x = 22 chromosomes [13] and is a diploid like most members of the Cupressaceae [12].
Taxodium has not been the subject of molecular mapping and/or EST sequencing. In the Cupressaceae, molecular research has largely focused on Cryptomeria japonica, and molecular maps based upon EST, RFLP, RAPD, and isozyme markers exist for this species [14][15][16]. However, these genetic maps cover only a small part of the entire Cryptomeria genome.
To advance understanding of Taxodium and the Cupressaceae in general, we utilized three experimental tools. Specifically: (1) Cot analysis -The study of DNA reassociation kinetics in solution is known as Cot analysis [17]. It is one of the earliest means of studying genome structure predating cloning and DNA sequencing techniques by several years (see 18 for review). Cot analysis is based upon the observation that the product of DNA concentration (C 0 ), reassociation time (t), and a "buffer factor" accounting for cation concentration (δ) has a predictable effect on the amount of reassociation occurring in a denatured DNA sample [18]. The major unknown factor influencing reassociation is the underlying sequence composition of the DNA. Consequently, one can indirectly study genome sequence composition by exploring how changes in C 0 tδ (known by the colloquialism "Cot") influence reassociation. Typically, a graph is created in which the fraction of reassociated DNA is plotted against the logarithm of Cot (from Cot~0 to Cot values at which reassociation is essentially complete). The resulting scatter plot is analyzed using nonlinear regression analysis, and a least-squares curve is fit through the data. This graph, known as a Cot curve, provides a visual representation of the genome. Analysis of Cot data provides the number of kinetic components in a genome, the reassociation rate (k) of each component, the fraction of the genome found in each component, the kinetic complexity (i.e., estimated sequence complexity) of each component, and each component's average sequence iteration. Additionally, in some instances the genome size of an organism can also be estimated through comparison of the k value for the single/ low copy component of the organism of interest with the k and genome size of E. coli [19].
(2) Cot filtration (CF) -CF represents a merger between Cot analysis and high-throughput DNA sequencing [19][20][21]. In short, the results of a Cot curve are used to guide fractionation of a genome into its kinetic components, and isolated components are sequenced in full or part. The value of CF and other reduced-representation sequencing techniques lies in their ability to enrich for subsets of genomic DNA of interest [19,22]. Since the majority of genes are single/low-copy in nature, CF has been used to enrich for gene space including the promoters and introns missed by cDNA approaches [20,23,24]. Alternatively, sequencing of a highly repetitive component represents a means of efficiently exploring the repetitive landscape of a genome [25].
(3) BAC library analysis -Bacterial artificial chromosomes (BACs) have been the most popular large-insert cloning vectors for nearly 20 years [26], and ordered BAC libraries (i.e., libraries in which clone is individually archived) remain highly useful tools in modern genomics research [27]. One can efficiently map molecular markers to corresponding BACs via multiplex macroarray hybridization techniques or by using multiplex PCR strategies [28][29][30]. By combining macroarray/PCR mapping data with data from BAC end sequencing, DNA fingerprinting, and/or sequencing of BAC pools, one can generate highly accurate physical (DNA sequence) maps and identify minimum BAC tiling paths representing whole or nearly whole chromosomes [27]. Though construction of BAC minimum tiling paths will likely become less important as DNA sequencing becomes cheaper and faster, BAC libraries will likely remain a key means of bridging gaps and resolving anomalies in shotgun sequence-based scaffolds [31].
The bald cypress BAC library generated in this study affords 6.7 genome equivalent coverage of the bald cypress genome. Though BAC libraries exist for several Pinaceae conifers [31][32][33], to our knowledge the bald cypress BAC library is the first constructed for a Cupressaceae species. The library was shown to be useful in gene isolation and genome characterization. The Cot and CF-based repeat analyses suggest that highly diverged, low-copy repeats account for much of bald cypress' genomic DNA.

Cot analysis
The CotQuest [34] nonlinear regression model providing the best fit of the renaturation kinetics data was a threecomponent fit in which outliers had been removed (using CotQuest's built in outlier detection) and the reassociation rate (k) of the slowest reassociating component had been fixed based upon the genome size of bald cypress. This best fit Cot curve is shown in Figure 1 while the major biological characteristics obtained from curve analysis are shown in Table 1. Of note, the curve is composed of highly repetitive (HR), moderately repetitive (MR), and single/low-copy (SL) components accounting for 47.0, 41.1, and 7.4% of the genome, respectively. Because the best fit was obtained when the reassociation rate of the SL component was fixed based upon genome size, the curve cannot be used to produce an estimate of DNA content. Assuming that the SL component has a repetition frequency of 1, the average repetition frequency of the DNA in other components can be estimated by dividing their k values by the k value of the SL component [35]. The mean predicted repetition frequencies of sequences in the MR and HR components are 61 and 2054, respectively.
For each component, 80% of the sequences in that component will reassociate in the "two Cot decade region" (TCDR) flanking the component's Cot1/2 value, i.e., if a component's Cot1/2 is y, then 80% of sequences can be found between 0.1y and 10y. Because k (and hence Cot1/2 which is the inverse of k) can be directly related to sequence copy number, one can use the TCDR to predict the range in sequence iteration for 80% of the sequences in a particular component. For example, the bald cypress MR component has a mean repeat frequency of 61 while 80% of the MR sequences are repeated from 6.1 to 610 times. Likewise, for the HR component which has a mean repetition frequency of Cot Figure 1 Cot curve for bald cypress. (A) A least-squares curve (black line) was fit through the data points (gray circles) using the CotQuest program of Bunge et al. [34]. The curve consists of highly repetitive (HR), moderately repetitive (MR) and single/low copy (SL) components, characterized by fast, intermediate, and slow reassociation, respectively. Blue, green, and red diamonds mark the Cot1/2 values of the HR, MR, and SL components. The brackets centered at a particular Cot1/2 marker show the "two Cot decade region" in which 80% of the sequences in that component will renature [63]. The blue shaded region at the top of the curve shows the double-stranded HR DNA (plus foldback sequences) isolated for sequencing. Biological data obtained from curve analysis is shown in Table 1

Cot filtration
743 high-quality HR capillary sequence reads were analyzed using the Sequence Read Classification Pipeline (SRCP) of Chouvarine et al. [36]. As shown in Figure 2, the vast majority of reads (72.49%) showed no obvious (S' ≥ 60) homology to previously characterized repeat sequences, gene sequences, and/or each other. Chloroplast DNA, rDNA, and "Probable Repeats" (i.e., sequences that appear to be repetitive based upon their relative frequency in the HR sequence set) each account for about 6-8% of the HR reads. Only 0.54% of sequenced HR sequences shared significant sequence identity with known mobile elements. The program MUST [37] was used to identify potential miniature inverted-repeat transposable elements (MITEs) in the HR sequences. MITEs are non-autonomous DNA elements characterized by terminal-inverted repeats, target-site duplications (direct repeats), and an internal region with no coding sequence [38]. Those potential MITEs found in sequence reads classified as "Unknown" or "Probable Repeats" by the SRCP are provided in Additional file 1. Using the criteria described in the Materials and Methods, 53 potential MITEs were identified. The mean MITE length was 246 bp, the mean direct repeat length was 2.3 bp, and the mean terminal-inverted repeat length was 8.8 bp.

BAC library construction and characterization
The bald cypress BAC library consists of 606,336 individually-archived clones in 1579 384-well microtiter plates. The library was given the designation TDD_Ba in accordance with the library naming standards used by the Mississippi Genome Exploration Laboratory (MGEL; see http://www.mgel.msstate.edu/dna_libs.htm), the Clemson University Genomics Institute (http://www. genome.clemson.edu), and the Arizona Genomics Institute (http://www2.genome.arizona.edu). The library and its associated products are distributed by MGEL (http:// www.mgel.msstate.edu).
Periodically during BAC library construction, pulsedfield gel electrophoresis analysis was used to check the insert size of randomly-selected NotI-digested BACs. Inserts ranged in size from 10 kb to 202 kb ( Figure 3A), and the mean insert size for the library was 113 kb ( Figure  3B). The origin of the BAC insert DNAs was confirmed by Southern hybridization using radiolabeled bald cypress genomic DNA as a probe. As shown in Figure 3C, there is considerable variation in hybridization intensity between BAC inserts ( Figure 3C) that cannot be accounted for by differences in DNA amount per lane (compare Figures 3B  and 3C). Those clones with the strongest hybridization signals ostensibly contain more repetitive DNA than those with weaker signals.
Approximately 1.8% of clones in the library were false positives (i.e., they possessed a vector plasmid but had no discernible insert). The proportion of false positive clones was greater earlier in library construction -e.g., the first 660 plates had a false positive percentage of 3.05 while the remaining 919 plates had a false positive percentage of 0.95. The addition of a "pre-electrophoresis" size-selection step (see Materials & Methods) was likely responsible for the reduced number of false positives in the latter threefifths of the library.
The level of chloroplast DNA contamination in the library was estimated by macroarray hybridization. Four loblolly pine (Pinus taeda L.) sequences representing different regions of the chloroplast genome [31] were used to probe five 4 × 4 double-spotted bald cypress macroarrays. We found that 2348 (approximately 2.5%) inserts of the 92,160 Taxodium clones represented on the macroarrays contained chloroplast DNA. This is the highest level of chloroplast contamination for any plant BAC library we have constructed, but still within the range of reported values (see reference [31] for discussion).
While macroarray screening with mitochondrial DNA was not performed, sequencing of nuclear DNA prepared using our nuclear DNA isolation protocol suggests that mitochondrial DNA contamination is 10 to 100 times U Figure 2 Categorization of bald cypress HR sequences. The HR sequence reads were classified using the Sequence Read Classification Pipeline of Chouvarine et al. [36] with minor modifications detailed in the Materials and Methods. Note that the majority of reads (72.49%) show no significant to homology to known gene or repeat sequences, nor are they categorized as "Probable Repeats" by the SRCP's ab initio repeat analysis functions.
less frequent than contamination from chloroplast DNA (reference 31 and Figure 2). After subtracting the fractions of false positive clones and chloroplast DNA-containing clones, the number of clones containing bald cypress nuclear DNA was estimated to be 580,263. Thus the fraction of clones in the library containing nuclear DNA was 0.957 (i.e., 606,336 ÷ 580,263 = 0.957), and the number of nuclear DNAcontaining clones on each macroarray was determined to be 17,639 (i.e., 18,432 × 0.957). Using the average insert size of 113 kb, the library contains approximately 6.74 genome equivalents of bald cypress nuclear DNA [(580,263 × 113 kb) ÷ 9731 Mb = 6.74)]. This level of genome equivalent coverage affords a 99.88% chance that a particular genomic sequence of interest will be found at least once in a library [see 39].
To further assess the quality and the utility of the library, eight T. distichum var. distichum single-copy gene sequences, employed previously in molecular phylogenetics research [40][41][42], were used to screen five 4 × 4 macroarrays. The five macroarrays collectively represented 1.02 (rounded to 1.0 for simplicity) genome equivalents of bald cypress nuclear DNA. The gene sequences (Table 2) were used to design PCR primers and/or overgo probes for each gene (Additional file 2, Tables S1 and S2), and macroarrays were separately screened with a pool containing the radiolabeled overgos and a pool containing the radiolabeled PCR amplicons. As shown in Table 3   clones recognized by a probe was 1.4 for the overgo pool and 1.5 for the amplicon pool. To verify the utility of the library in gene isolation, positive clones 1069I8, 1052C3 and 1137O7 were spotted onto nylon membranes and probed with various combinations of the overgo probes for the Chi1, Ferr, Pat, HemA and Lcyb genes. The results of these dot-blot hybridizations revealed that clones 1069I8, 1052C3, and 1137O7 contained the Chi1, HemA, and Pat genes (in full or part), respectively.

Library screening with repetitive sequences of bald cypress
For estimation of repetitive DNA content, blots prepared from pulsed-field gels of random NotI-digested BAC clones were probed with radiolabeled Cot ≤ 1 M·s (i.e., Cot-1) DNA. Based on the Cot data, Cot-1 DNA should contain sequences repeated, on average, 9523 times. Among the 328 clones on the various Southern blots, 273 showed obvious hybridization signals using the Cot-1 probe, i.e., approximately 83.2% of the BAC clones in the library appear to have inserts containing repetitive elements found hundreds to thousands of times in the genome (data not shown).
To study the distribution of a particular retroelement in the Taxodium genome, a 4 × 4 macroarray was screened with an overgo probe (Additional file 2, Table S3) designed from a bald cypress HR Cot-filtered sequence [GenBank: ET185333] with high sequence identity (S' = 128) to two Gingko biloba copia-like retroelement reverse transcriptase sequences [GenBank: DQ054445, GenBank: DQ054446]. Of the 17,639 nuclear DNA-containing clones on the macroarray, 888 (i.e., 10.1%) exhibited hybridization to the overgo which represents a reverse transcriptase gene we have deemed TdCRT1 for Taxodium distichum copia-like reverse transcriptase 1 (Figure 4). Densitometer analysis of the macroarray suggests that TdCRT1 is found in roughly 23,892 copies in the bald cypress genome. An initial glance at TdCRT1 hybridization to macroarrays suggests that the element is found in clusters, i.e., it is not distributed randomly throughout the genome. To test this hypothesis, we used the probabilistic "urn model" method of Holst [43] as detailed in Shan et al. [44]. Using an average BAC insert size of 113 kb, an estimated 17,639 nuclear DNA-containing clones per macroarray, and a genome size of 9731 Mb, we calculated that each macroarray represents about 0.205X coverage of the Taxodium genome [i.e., 17,639 · 113 kb)/9731 Mb = 0.205]. Thus one would predict that one macroarray should contain 4,898 copies of TdCRT1 (i.e., 23,892 copies · 0.205 = 4,898). If TdCRT1 distribution were indeed random, we would expect that the distribution of clones lacking TdCRT1 (i.e., lacking hybridization signal) would approximate normality. The mean number of clones expected to lack a TdCRT1-containing element and the theoretical standard deviation (SD) can be estimated using Theorem 2 of Holst [43]. Specifically, where N is the number of nuclear DNA-containing clones on the macroarray (i.e., 17,639), n is the expected number of clones showing TdCRT1 hybridization (i.e., 4,898), and p k is the probability of any copy of the element "falling" within a clone (i.e., 1/17,639). Insertion of the preceding values into the equations produced a mean of 13,362 clones per macroarray with a standard deviation of 26. However, we determined that 16,751 of the 17,639 clones on the macroarray did not exhibit TdCRT1 hybridization. Consequently, the observed number of clones lacking TdCRT1 hybridization is 642 times (16,751/26 = 642) the expected standard deviation for a normal distribution, strongly supporting the premise that the distribution of TdCRT1 is not random.
Of note, nine overgos representing common Ty1copia-like or Ty3-gypsy-like repeats from loblolly pine were hybridized to bald cypress macroarrays. None of the loblolly pine repeat-based overgos hybridized to the Taxodium filters (data not shown).

Discussion
The genome of bald cypress, a Cupressaceae conifer, is three times the size of the human genome but only half the size of the genomes of most Pinaceae conifers [12]. Bald cypress appears to be a diploid like its Pinaceae cousins and, since evidence suggests that conifers do not likely require or utilize more genes than other seed plants [45], a natural hypothesis is that the bald cypress genome contains lots of repetitive, non-coding DNA. Our Cot analysis supports this general tenet as the repetitive fraction of the bald cypress genome is substantial (HR + MR = 90.5%). However, the HR and MR components of bald cypress have relatively low mean repetition frequencies (2054 and 61, respectively). The idea that bald cypress repeats tend to be found in low copy numbers is supported by analysis of the HR sequences with the de novo repeat detection functions of the Sequence Read Classification Pipeline (SRCP) [36]. SRCP analysis reveals that only 6.2% of 743 bald cypress HR reads showed enough sequence identity with each other to be grouped by the SRCP into the "Probable Repeats" category. In comparison, SRCP evaluation of random Sanger sequence reads from Sorghum bicolor (sorghum), a plant with a genome one-thirteenth the size of bald cypress [46], resulted in categorization of 27% of Sorghum reads as "Probable Repeats" despite the fact that the Sorghum sequences were not specifically enriched for repeats [36].
Why the Taxodium genome has so much low-copy repetitive DNA is unknown. A reasonable guess is that the majority of mobile element amplification events in bald cypress occurred tens to hundreds of millions of years ago, and since then element proliferation/activity has been severely restricted. If mobile element amplification was suppressed long ago, the copies of each repetitive element would presumably begin to diverge until they were eventually no longer readily recognizable as being derived from the same source. Indeed, the limited studies of mobile elements in conifers do indicate that there is considerable sequence divergence among members of retroelement families [31,47,48]; moreover, much of the DNA within sequenced regions of conifer genomes lacks features characteristic of known mobile elements, low-complexity repeats, or genes [31,49]. Since all conifers possess enormous genomes, it is possible that some of the mobile element amplification events that underlie the behemoth C-values of conifers occurred prior to the divergence of the major conifer families; this idea should be much easier to test once genome sequences become available for conifers.
Only 0.54% of reads (i.e., four reads total) showed reasonable homology (S' = 60) to previously annotated transposons; top hits were to copia-like retroelements from Ginkgo biloba (GenBank: DQ054445, GenBank: DQ054446), Sequoiadendron giganteum (AJ290723), and Picea glauca [GenBank: AF229252], and a gypsy-like element from Abies veitchii [GenBank: AJ002621]. The HR read with the highest level of similarity to a previously characterized retroelement [GenBank: ET185333] was used to design an overgo probe (see Additional file 2, Table S3) that was hybridized to bald cypress macroarrays. The element represented by the overgo, i.e., TdCRT1, was estimated to be present in approximately 23,892 copies per 1C genome. As described above, 80% of HR sequences are present in copy numbers between 205.4 and 20,540. If half the remaining 20% of sequences in the HR component reassociate after 10·Cot1/2, then we would predict that the remaining half (i.e., 10%) would reassociate prior to 0.1·Cot1/2. In other words, we would expect less than 10% of HR sequences to have an iteration frequency greater than 20,540. Since TdCRT1 has a frequency of 23,892 copies per genome, it seems reasonable to assume that few, if any, genome sequences are more redundant than TdCRT1. Statistical evaluation of the distribution of TdCRT1 among clones on a macroarray suggests that it is found in a decidedly non-random distribution throughout the bald cypress genome. That one of the most redundant sequences in the 9.7 Gb Taxodium genome is found in only 23,892 copies and is distributed in a decidedly non-random fashion initially struck us as a bit surprising. However, as Baucom et al. [50] note, "Two of the many misconceptions about TE (transposable element) properties in higher eukaryotes are that they are highly repetitive and are randomly scattered about the genome." Our previous studies on the loblolly pine genome suggest that, as in bald cypress, the majority of pine repetitive DNA sequences are highly diverged and ostensibly ancient. However, in contrast to bald cypress, there appear to be at least a few retroelement families in loblolly pine of more pronounced conservation (and ostensibly of more recent origin). For example, the retroelement IFG7 [51] is found in 210,557 copies and accounts for about 5.8% of the pine genome [31] while the Athila-like retroelement Gymny is found in approximately 21,700 copies and accounts for 1.3% of the loblolly pine genome [47]. Together, IFG7 and Gymny make up 1.54 Gb of the loblolly pine genome, a value 9.8 times that of the Arabidopsis thaliana genome [52]. As the pine genome is roughly twice the size of that of bald cypress, it is possible that the activity of IFG-7, Gymny, and other mobile element families are, in part, responsible for the larger genome of loblolly pine (and other Pinaceae conifers) compared to bald cypress.
The SL component of bald cypress accounts for 7.4% of its genome. This is an amount 4.6 times the size of the Arabidopsis genome. Because conifers appear to have functional gene numbers similar to those of diploid angiosperms [45], it is possible that much of the SL DNA of Taxodium is composed of repetitive elements that have diverged to the point that they are now "singlecopy" in nature, and indeed, studies on other conifers have suggested that the high sequence complexity of their low-copy DNA is likely due, in part, to repeat divergence [31,47,48]. Duplication of genes and/or pseudogene production may also contribute to the size of the SL component. It is interesting to note that the macroarray hybridization results using probes for Taxodium "singlecopy" genes (Table 2) indicate that there are 1.4-1.5 positive clone hits per probe per estimated 1C genome equivalent. This result could indicate that BAC library coverage is actually greater than what we calculated based on insert size and clone number or that the genome size of bald cypress is actually smaller than previously reported. Alternatively, some of our probes may be hybridizing to multiple loci. Because our analysis was based upon macroarray hybridization and not PCR, it is also possible that some of the amplicon and/or overgo probes we developed contain regions with significant sequence identity to loci not amplified using the primers from the previous phylogenetic studies. Such loci could be paralogous genes or pseudogenes. Of note, pseudogenes have been described in a number of Pinaceae conifers including pines [49,[53][54][55][56], larches [57], and spruces [58,59]. In a recent examination of ten BAC sequences from loblolly pine, Kovach et al. [49] found that pseudogenes were five times more common than genes with potential protein coding functions -whether such a high pseudogene level holds for the genome as a whole is unknown. To our knowledge, pseudogenes have not been reported in Cupressaceae conifers, although the amount of sequence information for Cupressaceae is minute compared to that for Pinaceae. As a means of exploring whether pseudogenes are common in bald cypress, we plan to sequence Taxodium BAC clones including those recognized by the probes listed in Table 2.
Hybridization of loblolly pine retroelement sequence probes to macroarrays of Taxodium produced no positive signals. This is not particularly surprising as fossil evidence indicates that the Pinaceae and the Cupressaceae have been separate lineages for roughly 250 million years [1].
Analysis of the bald cypress HR sequence with MUST [37] resulted in detection of 53 potential MITEs (Additional file 1). None of the MITEs shared terminalinverted repeats, and hence each potential MITE was considered a member of its own MITE family. Further analysis of the potential MITEs will be facilitated by sequencing of bald cypress BACs and random genomic DNA.

Conclusions
We have explored the genome of bald cypress using several molecular techniques. Of particular note, we have generated a BAC library for bald cypress. This is the first large-insert library for a member of the Cupressaceae and a logical tool in eventual sequencing and assembly of the bald cypress genome. With regard to genome biology, the nuclear DNA of bald cypress appears to be largely composed of relatively low-copy repeats. These low-copy sequences may have arisen from ancient mobile element amplification events followed by millions of years of mobile element quiescence. Sequencing of BACs should provide key information on gene structure while helping to define the roles that gene duplication, pseudogenes, and repeat sequence divergence have played in shaping the Taxodium genome.

Plant materials and DNA isolation
Young leaves were collected from a bald cypress tree located at 33°27.2729' N, 88°47.5898' W on the Mississippi State University campus, and nuclear DNA was isolated according to the protocol described in Additional file 3. The DNA was sonicated into fragments with a mean length of 450 bp (Additional file 4), and metal ions were removed from the DNA solution using Chelex-100 (Additional file 5). The fragmented DNA was ethanol precipitated and re-dissolved in 0.5 M sodium phosphate buffer (Additional file 6) for Cot analysis or 10 mM Tris-HCl (pH 8.0) for other uses.

Melting curve and Cot analysis
DNA preparation and melting analyses were performed as described previously [60]. Cot analysis was performed according to Peterson et al. [20,60]. Cot data was analyzed using the program CotQuest [34].

Cot filtration and highly repetitive DNA library construction
From the Cot analysis we determined that the Cot1/2 value of the HR component was 4.64 M·s (Table 1). To isolate the HR component and foldback DNA, a bald cypress DNA sample was denatured and allowed to reassociate to Cot 6 M·s (rounded up from 4.64 M·s for simplicity). The resulting double-stranded DNA (Cot ≤ 6 M·s) was isolated using hydroxyapatite chromatography and ligated into either the pGEM T-Easy (Promega) or TOPO TA (Invitrogen) cloning vector. The resulting recombinant molecules were used to transform ElectroMAX DH10B T1 Phage-Resistant E. coli cells (Invitrogen) according to Peterson et al. [20]. HR clones were end-sequenced (single-pass) using an ABI 3730×l DNA analyzer. We obtained high-quality reads for 743 clones. The HR clone sequences were deposited in GenBank (GenBank: ET185231-ET185973).

HR sequence analysis
The HR sequence reads were analyzed using the Sequence Read Classification Pipeline (SRCP) previously described in Chouvarine et al. [36]. Additionally, the sequences were evaluated using the miniature inverted-repeat transposable element (MITE) identification tool MUST [37] with the following parameters: minimum TIR length = 8 nt; maximum TIR length = 50 nt; minimum DR length = 2 nt; maximum DR length = 30 nt; minimum MITE length = 100 nt; maximum MITE length = 1000 nt; minimum incluster identity = 1. In Figure 2, potential MITEs detected by MUST were only included under the MITE heading if they were categorized as either "Unknown" or "Probable Repeat" by the SRCP. For a putative MITE in an SRCPclassified "Unknown" or "Probable Repeat" read, only the length of the MITE itself (TIR and inter-TIR sequence) was considered MITE sequence; the remainder of the bases in the read were classified based upon the read's SRCP categorization.

BAC library construction and characterization
Young leaves were obtained from a bald cypress tree located at 33°27.4330' N, 88°47.3777' W on the Mississippi State University campus. We have designated this tree MSSTATE#1, and rooted cuttings from the tree have been sent to the USDA's Southern Institute for Forest Genetics (Saucier, MS) for maintenance and distribution. A BAC library was produced from MSSTATE#1 leaf nuclear DNA according to Peterson et al. [61]. A "preelectrophoresis" step, as detailed in Magbanua et al. [31], was added to the BAC library protocol during the sizeselections preceding the production of plates 661-919. Introduction of the "pre-electrophoresis" step appears to have led to a decrease in false positives clones in the latter 60% of the library compared to the first 40% (see Results).
Preparation of radiolabeled probes and macroarray hybridization were conducted as described in Magbanua et al. [31]. Southern hybridization was conducted using standard techniques [62].