Numerous whole genome sequence projects have been completed or are in progress, spanning a wide range of species among different orders. The genome sequences are providing novel insights into evolution and gene regulation that would have been impossible without these large-scale sequencing efforts. While a variety of sequencing strategies have been applied, the most common currently in use and the strategy chosen for the bovine genome relies mainly on whole genome shotgun (WGS) sequencing and assembly of the sequencing reads based on sequence similarity overlap. The bovine assembly will be supplemented by a much lower coverage of sequence from large-insert clones (Bacterial Artificial Chromosome, BAC) to provide connections between non-overlapping sequence contigs that represent chromosomal locations in close proximity to one another. A more comprehensive build of the genome sequence adds information from physical and genetic maps to WGS and BAC sequence to order contigs on a larger scale. An intermediate level of resolution and a critical check on the accuracy of the other methods can be provided by determining if the proper orientation, order, and spacing of exons in known expressed genes are maintained in the build. This approach requires knowledge of expressed transcript sequence to compare to the genome build.
Another use of transcript sequence is in annotation, a key to the utility of whole genome sequencing. Previous full-length cDNA sequencing projects have established the importance of experimentally derived mRNA sequences to produce gene models that establish accurate exon-intron boundaries [1–5]. These projects provided vital information about alternate splice forms of gene products that generate variation in form and function thought to be a key contributor to diversity in expression and phenotype. FLIC sequences also assisted in discriminating between alternative splicing and gene duplication or pseudogenes, a procedure that is difficult and error prone if based solely on clustered EST sequences.
The other main use of FLIC sequences has been generation of predicted protein sequence, providing a resource to support proteomic approaches and comparative analysis to reveal details of protein function. This goal requires accurate reconstruction of CDS portions of the bona fide transcripts expressed in the target tissues, which may be problematic with clustered EST as mentioned above.
The present effort was undertaken to support all of the potential uses of bFLIC data. The International Bovine Genome Sequencing Consortium  led by Baylor College of Medicine recently released the second, 6-fold coverage genome assembly (Worley, K. personal communication). Refinement of the assembly will be facilitated by incorporating bFLICs in the gene modeling and assembly process, similar to their utility in the assembly of genomes of other organisms. The bFLICs will also support efforts at NCBI and ENSEMBL to derive accurate gene models, and derive predicted protein sequence databases. In this sense, the present study is similar to previous full-length cDNA projects carried out for humans , mice , and other species [5, 7]. However, a different approach was used to generate the data than in previously described efforts, as the first step of this project employed sequencing of pooled-tissue, normalized libraries [8, 9] that had not been constructed by procedures to enrich for full-length clones, since such procedures could potentially introduce bias that would decrease the diversity of observed mRNA. Moreover, a primary goal of the project was to develop a method to consistently select full-CDS clones from these libraries based on comparison of the single-pass, 5' end sequences to the human Reference Sequence  (RefSeq) mRNA database.
This report characterizes the sequences of bovine full-CDS clones selected with a method using 5' end EST sequence data as input. This method efficiently identified apparent bovine homologs of human RefSeq mRNA sequences, collected the full insert sequence, and annotated the resulting bFLICs with GeneIDs, product, repetitive elements, and predicted protein sequences. The method described should be particularly useful for generating full-CDS and predicted protein sequences for organisms with mature databases of sequence from other species in the order (e.g. other mammals) but not included in complete genome sequence projects. The success of the method was characterized by comparison of the bFLIC sequences to human Refseq mRNA and mammalian UTRdb, . Because the investigation was initiated prior to release of the assembled bovine genome, direct comparison between bovine genomic and bFLIC sequence was problematic.
Without available genomic or full-CDS cDNA sequence, it is common practice to rely on gene clusters such as Unigene  or TIGR Gene Indices [8, 9, 13, 14] for transcript predictions. These computational derived consensus assemblies containing open reading frames (ORFs) are generated from single pass reads through cDNA libraries. These clusters provide a very important resource for putative gene models and products. The TIGR Bos taurus Gene Index (BtGI) was compared to bovine full-CDS sequences to confirm the existence of experimentally determined transcripts in the computed clusters. This characterization of gene clusters to full-CDS sequences may assist investigators to interpret the significance of their searches against gene cluster databases.