A multiway analysis for identifying high integrity bovine BACs
© Ratnakumar et al. 2009
Received: 10 June 2008
Accepted: 23 January 2009
Published: 23 January 2009
Skip to main content
© Ratnakumar et al. 2009
Received: 10 June 2008
Accepted: 23 January 2009
Published: 23 January 2009
In large genomics projects involving many different types of analyses of bacterial artificial chromosomes (BACs), such as fingerprinting, end sequencing (BES) and full BAC sequencing there are many opportunities for the identities of BACs to become confused. However, by comparing the results from the different analyses, inconsistencies can be identified and a set of high integrity BACs preferred for future research can be defined.
The location of each bovine BAC in the BAC fingerprint-based genome map and in the genome assembly were compared based on the reported BESs, and for a smaller number of BACs the full sequence. BACs with consistent positions in all three datasets, or if the full sequence was not available, for both the fingerprint map and BES-based alignments, were deemed to be correctly positioned. BACs with consistent BES-based and fingerprint-based locations, but with conflicting locations based on the fully sequenced BAC, appeared to have been misidentified during sequencing, and included a number of apparently swapped BACs. Inconsistencies between BES-based and fingerprint map positions identified thirty one plates from the CHORI-240 library that appear to have suffered substantial systematic problems during the end-sequencing of the BACs. No systematic problems were identified in the fingerprinting of the BACs. Analysis of BACs overlapping in the assembly identified a small overrepresentation of clones with substantial overlap in the library and a substantial enrichment of highly overlapping BACs on the same plate in the CHORI-240 library. More than half of these BACs appear to have been present as duplicates on the original BAC-library plates and thus should be avoided in subsequent projects.
Our analysis shows that ~95% of the bovine CHORI-240 library clones with both a BAC fingerprint and two BESs mapping to the genome in the expected orientations (~27% of all BACs) have consistent locations in the BAC fingerprint map and the genome assembly. We have developed a broadly applicable methodology for checking the integrity of BAC-based datasets even where only incomplete and partially assembled genomic sequence is available.
BAC libraries are a key component of many large genomics projects. They are used in the construction of maps of regions of genomes see [1–5] for examples for the bovine genome, in the construction of maps of complete genomes [6–11], to provide a framework for the sequencing of genomes [12, 13], and in comparative genomic hybridisation to study genome rearrangements [14, 15]. Many projects undertake fingerprint and BES analyses to construct physical maps of the target genome; this information can also be used to identify a tiling path of BACs to be sequenced as part of a genome sequencing strategy. To enable a range of different analyses to be undertaken by different groups, several copies of the BAC library may be created or subsets re-arrayed with a number of different organisations undertaking various parts of the fingerprinting, BAC end-sequencing and full BAC sequencing, thereby potentially increasing the chances of BAC assignment errors.
Throughout the course of the bovine genome project the CHORI-240 library was replicated a number of times and different methods were used by several research groups at varying times on independent equipment. As part of the processing in these laboratories, clones were re-arrayed several times from 384 to 96 well plates for growth of cells prior to preparation of DNA and further split onto two 96 well plates for sequencing of the two ends of each BAC clone . Despite this process being a frequent event there have been relatively few studies of the impact of these processes on the integrity of BAC assignments within large genomics projects. In an early assessment of the Human Genome Project, analyses of the association between clone name and sequence for the human BESs found a match rate for some sets of BACs of only 30% . In a specific test of integrity 91% of clones contained the same BESs when determined at two different centres . More recently during the construction of a set of BAC clones spanning the human genome approximately 7% of clones reanalysed did not generate the same fingerprint as generated in the original fingerprinting of the clones . In a study using the mouse genome BAC-libraries, a consistency rate of 95% for repeat sequencing of BAC ends was observed . The authors proposed that the high levels of automation in the processing pipelines should further increase the integrity of the datasets being generated . Similar analyses of EST datasets also indicate a range of tracking error rates, almost 38% in a sample of the IMAGE cDNA clones , 11.1% in a set of bovine cDNAs  and ~7.6% in a set of honey bee cDNA clones . In contrast, lane tracking errors during sequencing appear to be generally low, around 0.5% in a survey of a number of EST libraries .
A number of genome projects have used fingerprint maps, BESs and genome sequence data to identify sets of reliable BAC clones spanning the genome  or to build a BAC-based map of the genome . In these projects the consistency between the BAC fingerprint and BES based positions on the genome was used to include or exclude BACs from either the set of BACs or the map. In the set of 73,305 paired end sequenced BACs positioned on the rat genome assembly 2% were assigned to different chromosomes by their fingerprint and BESs . However, the source of the discrepancy, incorrect BAC fingerprint, or incorrect BES(s) was not reported.
Numbers of fingerprinted and end sequenced BACs
Included in BAC fingerprint map
In contigs in BAC fingerprint map
BAC end attempted
at least 1 BAC end sequence
at least 2 BAC end sequences
At least one BAC end sequence and fingerprint
Here we undertake the three way comparison of BAC-based genomic datasets using the International Bovine BAC Mapping and Genome Sequencing Consortia datasets.
Mapping bovine CHORI-240 BAC-end sequences to the bovine genome
BAC end sequences
correct and consistent
expected orientations and within size limit (tail-to-tail)
conflicting orientations and/or outside size limit same chromosome
two different chromosomes (breaks)
only one BAC end placed on genome (unpaired)
We then examined the possibility that systematic errors may have occurred at some point during the sequencing of the BACs. The BACs sequenced at BCM have been allocated internal sequencing project codes, which are included in the GenBank entries for the BAC sequences. A number of examples of pairs and up to six potentially swapped BACs with consecutive, or close to adjacent, BCM sequencing project codes were identified, suggestive of swapping of BACs during sequencing. Detailed examination of one of the groups of apparently swapped BACs identified an inverted relationship between the apparently swapped BACs and the BCM project codes: i.e. CHORI-240 BAC 66L13 (BCM project code FDVO) appeared to be swapped with 66O17 (FDVV), 66L18 (FDVP) appeared to be swapped with 66O12 (FDVU) and 66L21 (FDVR) appeared to be swapped with 66M24 (FDVS). However, many of the potentially incorrect BACs did not appear to have a simple reciprocal relationship with a single other BAC, for example many of these BACs apparently contained DNA derived from two different BACs (data not shown). These relationships are potentially very complex and as many of the BAC sequences are still only draft sequences we have not attempted to resolve all of the discrepancies in this dataset.
Since only a small number of BACs have been sequenced a much broader analysis was required to identify the range of potential issues with the integrity of the unsequenced BACs. To undertake this analysis all of the BES mapped positions on the bovine genome were converted to approximate locations on the BAC fingerprint based map as described in the methods. BACs with equivalent positions in both maps were called consistent, and BACs with apparently different positions in the two maps were called potentially inconsistent. There are many ways that BACs could be in the potentially inconsistent set, for example; errors in our classification pipeline, errors in the fingerprinting or BAC end sequencing, errors in the assembly of the BAC fingerprint map, or in the positioning of the BES(s) on the bovine genome assembly. The latter is clearly likely to be a significant problem for BACs with only one end positioned (unpaired), or with their ends positioned on two different chromosomes (breaks). Indeed, based on just one end sequence (unpaired), or testing both end sequences (breaks) BACs in both of these groups contained a significantly higher percentage of inconsistent BACs, 14–19%, than the rate of 5.3% in the BACs that map with the expected organisation. However this still suggests that the locations of most unpaired BESs in the bovine genome assembly are indeed correct and for the majority of the BACs with BESs mapped to different chromosomes one end is correct and that the location of the BAC in the fingerprint map can be used to identify the correct position in the genome assembly.
However, inconsistent BACs may have arisen by either sporadic or systematic problems. In order to identify systematic issues, the distribution of the proportions of consistent BACs for the set of tail-to-tail BACs and the set of unpaired BACs were plotted against each other (Fig. 3B). Overall the two sets of values are highly correlated, for the majority of the plates almost all of the tail-to-tail BAC clones were consistent. However, there are clearly a number of plates which contain much higher numbers of potentially inconsistent BACs, for example almost all BACs on plates 285 and 286 appeared to be potentially inconsistent (Fig. 3B). The plates lying outside the bulk of the data points were examined in more detail. To facilitate the analysis, and to identify if systematic errors were involved, the locations of the consistent and potentially inconsistent BACs were plotted on graphics of the plates, examples are shown in Fig. 3C. The results of the complete analyses are shown in the Additional Files 1 (Table 1). In all cases errors during the BAC-end sequencing appear to have been the major cause for the conflicts between the BAC fingerprint map and BES positions. This was determined based on the category of the sequenced BACs which lay in a row or column of potentially inconsistent BACs on a plate with systematic errors. If these BACs were predominantly in the correct group (i.e. BESs matching full sequence) then it is likely that the discrepancy between the BES and the fingerprint was due to errors in the fingerprinting, conversely if none of these BACs were in the "correct" group then it is most likely that the errors occurred during the BAC-end sequencing.
In summary, for this set of BACs both end sequences were derived from a different BAC from the one intended to be end sequenced, but each end of each BAC was sequenced only once.
During the calculations of the ratios described above we observed that nine plates of BACs did not contain any BACs in the tail-to-tail group. On seven of these plates: 10, 345, 477, 478, 520, 521 and 522 only sequences from one end had been deposited in GenBank and therefore BACs could not be in the tail-to-tail set. However, BESs from both ends of the BACs on plates 446 and 447 had been deposited in GenBank, further analysis (comparing the locations of the BAC ends and the BACs in the BAC fingerprint based map) suggested that the TARBAC13P2 primed sequence reads have been swapped between these two plates (Additional Files 1: Table 1).
In summary, one BES from the pair of BESs that correspond to each BAC on plate 446 was swapped with one BES from each pair of BESs for each equivalently positioned BAC on plate 447.
In the above analysis we assumed that both BESs for potentially inconsistent BACs were "unique" in the library, barring the occasional use of exactly the same restriction enzyme site at the vector-clone junction. However, it is possible that single sets of BESs could have been determined twice and assigned to two different sets of BACs. The way in which the analysis was undertaken, in particular allowing either end of a BAC with end sequences mapping to two different chromosomes to be consistent with the BAC fingerprint map location, would have obscured these cases (Fig. 2). In order to address this issue the set of locations of CHORI-240 BESs mapped to the bovine assembly were scanned for overlapping BESs. 17,577 pairs of BACs with at least one BES overlapping BESs from one or two other BACs were observed, in 2,628 of these BAC pairs both BACs were derived from the same CHORI-240 plate, about 70 times more than expected from a random distribution of overlapping BESs. 3,367 BACs from the 2,628 BAC pairs were in the consistent set, and 1,218 of the BAC pairs from the same plate with at least one overlapping BES also have overlapping fingerprints.
In summary, in this set of BACs one end sequence was derived from a different BAC from the one intended to be end sequenced and one end of many of the BACs was sequenced twice or the data was erroneously deposited in the database twice.
Although the BAC fingerprint based map of the genome assembly displays details for a very large number of BAC clones the lengths of clones and the extent of the overlaps are not highly accurate. This can be illustrated by a comparison of the overlaps of pairs of clones based on the mapping of the BAC ends to the genome assembly (tail-to-tail clones only) and from the BAC fingerprint based map (Fig. 5C). Because the analysis uses only BAC clones for which both sets of data are available a much reduced number of BAC clones are included. Overall there is a trend with a correlation coefficient of ~0.51, but clearly the BAC-fingerprint overlap information is only indicative of the true extent of the overlap. The distribution of the BAC overlaps based on the genome assembly also shows a pronounced peak at 85–88% average overlap (Fig. 5D). Of the overlapping BACs derived from the same plate 56% had overlaps of 99% or more based on the positions of their BESs in the genome assembly.
To further investigate the apparent systematic relationship between pairs of overlapping clones observed on some plates based on the observation above and elsewhere in the analyses described here, we plotted the frequency of overlapping BACs in the same row one or two columns apart, in the same column, one or two rows apart and the corresponding diagonal relationship (Fig. 6B). This identified a number of other plates with apparently systematic relationships between overlapping clones. The nature of the patterns suggests that these relationships were generated during transfer between 384 and 96 plates or vice versa. However, in the majority of cases the relationships appeared to be random suggesting that in the cases of substantial overlap these were more likely to be due to multiple clone picks rather than contamination per se.
Another interesting, but very infrequent pattern, that was observed by linking BACs with overlapping BAC-end sequences on the plate was star-like (Fig. 6C). Although only a small proportion of the BACs on plates were affected, both end sequences of one of the BACs involved overlapped with the equivalent end sequences of the other BAC. In almost all cases BAC fingerprint data was not available for both BACs involved, so the point at which the apparent contamination occurred cannot be determined. The nature of the relationships appears to indicate a 180 degree rotation of the original plate, or copies of the plate, or perhaps a lid, that ultimately affected the BAC-end sequences from only a small number of the wells.
In summary, the majority of overlapping BACs located on the same plate in the BAC library are likely to be two examples of exactly the same BAC clone, perhaps arising from replication of clones prior to plating or picking of the same clone twice during the transfer to the clones.
Since it will never be possible to identify all of the issues, a list of all BACs with questionable identities will never be complete, however, using the approach above we can define a set of bovine BACs with a high probability of being reliably recovered from copies of the CHORI-240 library using a few rules;
Avoid using BACs that do not have both a fingerprint and at least one BES available (where both methods have been used) - check that the fingerprint and genome positions are equivalent
Do not use more than one of a set of BACs with substantial overlaps in the BAC fingerprint based map, in the genome assembly, or BESs, where the BACs are derived from the same library plate, as it is likely that the BACs are identical
Do not use BACs from library plates with systematic issues
Never use BACs with conflicts between the position in the fingerprint-based map and genome assemblies, and/or DNA sequence if available
These rules are based on the assumption that at least the fingerprinting and end sequencing of the BACs has been undertaken directly from the original library plates, or replicates that maintain the original relationships.
The most preferable BACs are those where the DNA sequence and BES-based positions on the genome assembly and the fingerprint-based map positions are all consistent
If no sequenced BACs are available for a region of interest only use BACs where the BES (tail-to-tail) and fingerprint positions are available and are consistent.
During fingerprint map construction BAC clones with extremes of insert size and number of restriction fragments were excluded from the analysis to screen out potentially mixed wells , thus clones incorporated into the map are unlikely to be heavily contaminated. Therefore BACs excluded from the final fingerprint map should not be used for further characterisation, even when they have BES data.
It has been reported that the draft Btau3.1 assembly of the bovine genome has a number of problems and could be substantially improved [21, 31]. This leads to the question; have issues with the identities of the BACs contributed to errors in the assembly and/or to the perception of the extent of the potential errors in the assembly? Of the 12,671 CHORI-240 library BAC clones with sequence data deposited in GenBank and used in this study 413 were from wells identified as likely to contain BACs with one or more end-sequence that was not derived from the same BAC as the fingerprint. Since these appear to have resulted from systematic errors during the BAC-end sequencing the identities of the BACs sequenced during the genome sequencing project should be correct. Since the assembly of the genome is based on the sequence and not the identity of the BACs per se these will not have impacted on the assembly of the genome sequence. Only sixteen BACs with one apparently correct and one apparently incorrect end sequence were included in the genome sequencing. A much larger number of such BACs, contributing incorrect mate pair information, were included in the set of BAC end sequences in the initial pool of sequences used in the assembly. Since multiple consistent mate pair links are required to link contigs within scaffolds during genome assembly it is unlikely that this relatively small set will have significantly contributed to assembly problems, rather just added to the background noise. Due to the relatively small numbers of BACs involved it is also likely that these had little impact on the comparison of the genome sequence and BAC fingerprint based map . However, our analysis suggests that only a single BAC mate pair link should be allowed from any one plate in the BAC library to avoid the high rate of apparently the same clone on the same plate contributing to the generation of incorrect links.
A detailed analysis of the large number of consistent non-tail-to-tail BACs has demonstrated that most of these resulted from the limitations of the draft assembly of the bovine genome used in the analysis, Btau3.1. During the course of this work a revised assembly of the bovine genome, Btau4.0, was released, which has resolved many of these limitations. However, our analysis methodology was designed to be robust, allowing for assembly errors by using non-tail-to-tail BACs, requiring the position of only one BES to be consistent with other data sets and allowing 500 kb windows for matches. Preliminary investigations indicate that the new bovine assembly has had no impact on the identification of the problematic BES plates. Small numbers of BACs currently identified as incorrect may now be classified correct using Btau4.0, however BACs previously identified as correct and consistent remain as correct or consistent.
There are many places where errors can arise in large genome sequencing projects. However, as we have outlined it is also possible to identify problems and correct errors. Overall, the error rate for bovine BACs with both fingerprints and end sequences appears to be less than 5%, consistent with expectations . By applying the rules described here the risk of inadvertently characterising an incorrect BAC clone can be reduced to virtually zero. The rules and methods described here are applicable to all such datasets and analyses.
The set of bovine BAC-end sequences downloaded from GenBank was filtered as described previously  and aligned to the bovine genome, build Btau3.1, using MegaBLAST with the following parameters: -F "m D" -U T -D 2 -m 8. The bovine genome sequence assemblies were obtained from the UCSC Genome Bioinformatics site [32, 33]. The BESs were grouped into tail-to-tail, tail-to-head etc. as previously described . The BAC sequences were aligned to the bovine genome using MegaBLAST with the parameters: -D 3 -W 32 -F m -U T -e 1e-100. A subset of BESs were also mapped to the available BAC sequences using MegaBLAST with parameters -D 3 -W 32 -F m -U T -e 1e-100.
The coordinates from the mapping of the BESs and BACs to the bovine genome were compared. Sequenced BACs with BAC ends that mapped within 250 kb either side of the region of the bovine genome containing the corresponding BAC sequence were called correct. Sequenced BACs with BAC ends that did not meet this criterion were called potentially incorrect, and the process below was used to determine whether they were consistent or potentially inconsistent. Reciprocal swaps were identified by comparing the locations of the BESs within 250 kb either side from the potentially incorrect BACs with the locations of the BAC sequences for the same set of potentially incorrect BACs. If at least 1BES of BAC A mapped to a location within 250 kb of the location of BAC B (other than itself), and at least 1 BES from BAC B mapped to within 250 kb of the position of BAC A, then BAC A and B are called potentially reciprocally swapped BACs.
The BAC fingerprint map data (version May 1, 2006) was downloaded from BCGSC . Consistent BACs were identified by taking the BES positions for BACs that were potentially incorrect from the previous step and finding all sequenced BACs and BACs with BESs positioned tail-to-tail that overlap with either BES (with 250 kb leeway) of this potentially incorrect BACs set. Then the BAC fingerprint map positions of the BACs that were found to overlap with the starting BACs were determined. The fingerprint map location for cases where more than one BAC from the same fingerprint map contig was found to overlap with the potentially incorrect BAC was determined by taking the minimum start and maximum end coordinates for the set of BACs. The fingerprint map positions for these BACs that overlap with the potentially incorrect set were used to determine which other BACs overlapped with them in the fingerprint map. If any of the members of the final list of overlapping BACs in the fingerprint map were the same as the BAC from the initial potentially incorrect set the initial BAC was called consistent. However, if none of the overlapping BACs were the same as the initial BAC, the BES and fingerprint positions for the initial BAC are not consistent, we called these BACs putative non-consistent.
The size and number of the overlaps between all of the CHORI-240 BESs mapped to the bovine genome were calculated using Perl scripts. Each of the positioned BES locations were compared to all of the other BES locations, cases where there was overlap between BESs from 2 different BACs were recorded. The results for BACs with overlapping BESs from the same plate were determined by comparing the plate number embedded in the BAC clone ids.
The number and the average percent of overlap between all pairs of the CHORI-240 BACs in the bovine BAC fingerprint map were calculated from the coordinates in the BAC contig file using Perl scripts. For each pair of overlapping BACs the average percent overlap was calculated from the individual percent overlaps calculated for each of the two BACs. The same scripts were used to calculate the set of average percent overlaps derived from the bovine genome assembly (Btau4.0). A filtered set of overlaps between BACs derived from the same plate in the BAC library was also generated by comparing the plate number embedded in the BAC clone ids. To calculate the expected frequency of overlapping BACs on the same plate the BAC-names in the full dataset were randomised independently 10 times and the background frequency of within plate overlaps calculated by comparing the plate number embedded in the BAC clone ids.
The high integrity BACs are displayed on the Btau4.0 genome browser along with the mapping of the bovine BAC-ends to the bovine genome assembly . A list of high integrity BACs and the images of the BAC-end sequence overlaps and BAC fingerprint overlaps are also available from the livestock genomics website .
Bacterial Artificial Chromosome
BAC end sequence(s)
Baylor College of Medicine.
The authors would like to thank the relevant researchers from the member organisations of the IBBMC (AgResearch, BCGSC, CHORI, CSIRO, EMBRAPA, Roslin, TAMU, TIGR, UIUC and USDA) for generating and providing public access to fingerprinting and BAC end sequencing data. The authors also gratefully acknowledge the early pre-publication access under the Fort Lauderdale conventions to the draft bovine genome sequences provided by the Baylor College of Medicine Human Genome Sequencing Center and the Bovine Genome Sequencing Project Consortium. This work was partly funded by the USDA-NRI, the CRC for Innovative Dairy Products, the CRC for Beef Genetic Technologies and Sheep Genomics (a joint venture of Meat and Livestock Australia and Australian Wool Innovation).
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.