In this analysis, we have characterized homeologous sequences from the paleopolyploid soybean genome and studied the effect of conserved duplicate regions on sequence assembly. Identified BACs map to 11 of the 20 soybean linkage groups representing a broad sampling of potential homeologous regions across the soybean genome. Previous analyses have shown fairly extensive sequence conservation between homeologous blocks in soybean [8, 25]. Sequenced BACs identified as containing transcribed duplicate genes show a range of gene conservation (Figure 1; Additional Files 1, 2, 3, 4, 5: Supplemental Figures 1, 2, 3, 4).
Early analysis of the structure and organization of a paleopolyploid genome have been in maize. The "maize model" suggests that the present maize genome is a result of extensive reciprocal deletions as well as major transposable element insertions causing genome expansion and contraction resulting in homeologous regions that are not well conserved [5, 7, 9, 10]. Conversely, in cotton, a relatively recent allotetraploid, the homologs studied were highly conserved with only small indels and transposable element insertions differing between regions . The "cotton model" suggests strong duplicate gene conservation that extends well into the intergenic regions. In this analysis we find that the soybean genome is a mosaic of these two models with a range of conservation spanning from gene for gene retention  to moderately conserved regions with 25 to 50% gene retention  and highly divergent regions with a single gene conserved (Figure 1).
Coalesence estimates suggest that the most of the regions diverged approximately 9.6 Mya. This value falls within the range of what has previously been observed [8, 25]. On the extreme end, however, five BACs contain highly divergent duplicate genes. These may indeed be the result of gene translocation, segmental or single gene duplication and not the result of polyploidy. While in the absence of the whole genome sequence we cannot be certain of the mechanism by which these genes duplicated, some support for at least a larger duplication event is found from the genetic map. Mapping of duplicate RFLP markers in soybean provided early evidence for a major genome duplication event . Utilizing the most recent genetic map , linkage groups D1a and D1b (where gmw1-57d24 and gmw1-27d20 map, respectively) were found to contain an RFLP A725 that is duplicated between these linkage groups. In addition, D1b and O (where gmw1-27d20 and gmw1-58k3 map, respectively) both contain the RFLP K011 duplicated between linkage groups. While the linkage positions of these markers are separated by many centimorgans (data not show), it does lend credence to these linkage groups having a shared ancestry. A similar comparison for gmw1-13o17 and gmw1-8g7 could not be done because gmw1-8g7 is unmapped. Regardless of the mechanism, in soybean, there are regions of paleoduplicated chromosomes that have diverged greatly since duplication while others have not (Figure 1) [see Additional files 1, 2, 3, 4, 5].
Size differences between duplicate genes were observed on many of the BACs (Table 2). Even though on average 59% of the predicted genes had some EST support, the reliance on ab initio predictions results in variation between duplicate genes in gene structure predictions. A similar issue is observed with the PPR-like genes on gmw1-103e11 that are a potential source of batch assembly error. In addition, the varying levels of protein identity in homeologous regions may be the result of unsupported gene structure predictions. This analysis clearly shows that for improved annotation of the whole genome assembly, more transcript (EST, cDNA, etc.) sequences will be necessary to verify predicted gene structures.
Most plant genome sequencing efforts have been BAC-based using highly inbred plants with pseudo-monoploid genomes (diploid or polyploid plants with identical paleoduplicated genomes). As a result, plant genome assemblies have not been confounded by the effects of retained homeology in paleopolyploid regions of the genome. Conversely, many of the non-plant eukaryotic sequencing efforts have been WGS such as Fugu rubripes , mouse [41, 42], and the Celera version of the human genome [43, 44] to name only a few. Comparisons between the WGS project and BAC-based sequencing project in humans have found that while the WGS provides more accurate gene coverage more quickly, the BAC-based sequencing has much better coverage of repetitive sequences, especially highly conserved repeats and in the long run is more accurate in both order and orientation of genes [44–47]. A somewhat similar comparison between the Oryza sativa L. ssp. indica  and Oryza sativa L. ssp. japonica  sequencing projects concluded that the major differences in sequence assemblies are due to regions with large transposable elements .
The soybean genome is a well-documented paleopolyploid [12, 51] as are all sequenced plants, e.g., Arabidopsis , rice [48, 49, 53, 54] and most recently Poplar . Although homeologous blocks could be identified in each of these species, even the most recent polyploidy events are thought to be more ancient than what has been described in soybean [19, 20]. The often high levels of sequence conservation in homeologous regions in soybean [8, 25] has raised the question of what effect this will have on the assembly of the whole-genome shotgun sequence effort (WGS) currently underway.
The reassembly of 17 homeologous BACs in soybean provides the first look at the effects a relatively conserved paleopolyploid genome on WGS assembly. The most identical homeologous BACs sequenced, gmw1-105h23 and gmw1-15k6 are just under 95% identical across both the BAC coding and noncoding regions (Table 2) . Reassembly of these two BACs showed no misassembly of the BACs and no cross-assembly of trace files from one BAC in the other BAC (Figure 2). In the context of the WGS assembly, this is good news for homeologous regions that share less than 95% sequence identity. Under standard assembly parameters using Phrep/Phrap, paleoduplicate homeologous regions should be resolvable.
When all 17 BACs are reassembled in batch, observed assembly errors are the results of tandem duplications and simple sequence repeats. Analysis of the re-assembled BAC gmw1-103e11 shows that tandem duplications of genes such as the PPR-like genes with sequence identity greater than 95% may cause assembly issues. Using a standard set of parameters, clone pairs cannot be distinguished, especially when the repeat is larger than the sequence reads (generally over 500 bp). The parameter set that better resolves tandem repeats may not be the appropriate parameter set for all assemblies; as a result, hand assembly of these regions may be necessary for completion of genome assembly. Similarly, large simple sequence repeats may cause incorrect merging of regions. It should be noted however, if there are homeologous regions of the soybean genome that are conserved with greater than 95% sequence identity, they will likely behave in a manner similar to tandem duplications and may be more difficult to distinguish.
What was not observed in the batch reassembly was errors caused by retrotransposon sequences. In soybean, many of the potential retrotransposons have not been characterized although a number of studies are underway to identify repetitive sequences in soybean Marek et al. unpublished results . This analysis, with one exception, did not identify BACs that contained numerous repetitive sequences; instead they were found to be gene rich. BAC gmw1-45m6  does contain numerous LTR retrotransposons, but re-assembly of this BAC showed few errors. Cytogenetic studies have shown that the high-copy sequences in soybean are highly concentrated to centromeric and pericentromeric regions [24, 56]. In addition, ongoing analysis of repetitive sequence in soybean shows that it is primarily in the centric, telomeric and nucleolar organizing regions of the genome (Gill et al. unpublished results) . Contrary to maize or some species of rice [10, 57], no evidence for a large burst of retrotransposon activity has been found in soybean. It is likely then, that in the context of WGS assembly, retrotransposon sequences in most cases will not affect assembly of genic regions.
Preliminary analysis of contigs generated from JGI trace files give an estimation of what repetitive sequences will need to be screened for during WGS assembly (Figure 4). Even though the 80,000 JGI traces were prescreened against characterized soybean repeats, those trace files that contain a fragment of a repeat are passing through the screening process. Further, there are enough sufficient sequences that assemble to regenerate the original repetitive sequence into a contig, or at least enough of the sequence to match back to characterized repeats. One previously noted consequence of WGS assembly is that the exclusion of transposable element sequences and repetitive sequences during assembly has the effect of eliminating genes that might be found in these regions . In this case, genic sequences that flank or are contained in repetitive regions may be able to pass through the repeat screening such that they become part of the assembly. A balance between screening for repetitive sequences during WGS assembly while not excluding genic information will need to be found.