Characterization of 954 bovine full-CDS cDNA sequences
© Harhay et al; licensee BioMed Central Ltd. 2005
Received: 04 July 2005
Accepted: 23 November 2005
Published: 23 November 2005
Genome assemblies rely on the existence of transcript sequence to stitch together contigs, verify assembly of whole genome shotgun reads, and annotate genes. Functional genomics studies also rely on transcript sequence to create expression microarrays or interpret digital tag data produced by methods such as Serial Analysis of Gene Expression (SAGE). Transcript sequence can be predicted based on reconstruction from overlapping expressed sequence tags (EST) that are obtained by single-pass sequencing of random cDNA clones, but these reconstructions are prone to errors caused by alternative splice forms, transcripts from gene families with related sequences, and expressed pseudogenes. These errors confound genome assembly and annotation. The most useful transcript sequences are derived by complete insert sequencing of clones containing the entire length, or at least the full protein coding sequence (CDS) portion, of the source mRNA. While the bovine genome sequencing initiative is nearing completion, there is currently a paucity of bovine full-CDS mRNA and protein sequence data to support bovine genome assembly and functional genomics studies. Consequently, the production of high-quality bovine full-CDS cDNA sequences will enhance the bovine genome assembly and functional studies of bovine genes and gene products. The goal of this investigation was to identify and characterize the full-CDS sequences of bovine transcripts from clones identified in non-full-length enriched cDNA libraries. In contrast to several recent full-length cDNA investigations, these full-CDS cDNAs were selected, sequenced, and annotated without the benefit of the target organism's genomic sequence, by using comparison of bovine EST sequence to existing human mRNA to identify likely full-CDS clones for full-length insert cDNA (FLIC) sequencing.
The predicted bovine protein lengths, 5' UTR lengths, and Kozak consensus sequences from 954 b ovine FLIC sequences (bFLICs; average length 1713 nt, representing 762 distinct loci) are all consistent with previously sequenced mammalian full-length transcripts.
In most cases, the bFLICs span the entire CDS of the genes, providing the basis for creating predicted bovine protein sequences to support proteomics and comparative evolutionary research as well as functional genomics and genome annotation. The results demonstrate the utility of the comparative approach in obtaining predicted protein sequences in other species.
Numerous whole genome sequence projects have been completed or are in progress, spanning a wide range of species among different orders. The genome sequences are providing novel insights into evolution and gene regulation that would have been impossible without these large-scale sequencing efforts. While a variety of sequencing strategies have been applied, the most common currently in use and the strategy chosen for the bovine genome relies mainly on whole genome shotgun (WGS) sequencing and assembly of the sequencing reads based on sequence similarity overlap. The bovine assembly will be supplemented by a much lower coverage of sequence from large-insert clones (Bacterial Artificial Chromosome, BAC) to provide connections between non-overlapping sequence contigs that represent chromosomal locations in close proximity to one another. A more comprehensive build of the genome sequence adds information from physical and genetic maps to WGS and BAC sequence to order contigs on a larger scale. An intermediate level of resolution and a critical check on the accuracy of the other methods can be provided by determining if the proper orientation, order, and spacing of exons in known expressed genes are maintained in the build. This approach requires knowledge of expressed transcript sequence to compare to the genome build.
Another use of transcript sequence is in annotation, a key to the utility of whole genome sequencing. Previous full-length cDNA sequencing projects have established the importance of experimentally derived mRNA sequences to produce gene models that establish accurate exon-intron boundaries [1–5]. These projects provided vital information about alternate splice forms of gene products that generate variation in form and function thought to be a key contributor to diversity in expression and phenotype. FLIC sequences also assisted in discriminating between alternative splicing and gene duplication or pseudogenes, a procedure that is difficult and error prone if based solely on clustered EST sequences.
The other main use of FLIC sequences has been generation of predicted protein sequence, providing a resource to support proteomic approaches and comparative analysis to reveal details of protein function. This goal requires accurate reconstruction of CDS portions of the bona fide transcripts expressed in the target tissues, which may be problematic with clustered EST as mentioned above.
The present effort was undertaken to support all of the potential uses of bFLIC data. The International Bovine Genome Sequencing Consortium  led by Baylor College of Medicine recently released the second, 6-fold coverage genome assembly (Worley, K. personal communication). Refinement of the assembly will be facilitated by incorporating bFLICs in the gene modeling and assembly process, similar to their utility in the assembly of genomes of other organisms. The bFLICs will also support efforts at NCBI and ENSEMBL to derive accurate gene models, and derive predicted protein sequence databases. In this sense, the present study is similar to previous full-length cDNA projects carried out for humans , mice , and other species [5, 7]. However, a different approach was used to generate the data than in previously described efforts, as the first step of this project employed sequencing of pooled-tissue, normalized libraries [8, 9] that had not been constructed by procedures to enrich for full-length clones, since such procedures could potentially introduce bias that would decrease the diversity of observed mRNA. Moreover, a primary goal of the project was to develop a method to consistently select full-CDS clones from these libraries based on comparison of the single-pass, 5' end sequences to the human Reference Sequence  (RefSeq) mRNA database.
This report characterizes the sequences of bovine full-CDS clones selected with a method using 5' end EST sequence data as input. This method efficiently identified apparent bovine homologs of human RefSeq mRNA sequences, collected the full insert sequence, and annotated the resulting bFLICs with GeneIDs, product, repetitive elements, and predicted protein sequences. The method described should be particularly useful for generating full-CDS and predicted protein sequences for organisms with mature databases of sequence from other species in the order (e.g. other mammals) but not included in complete genome sequence projects. The success of the method was characterized by comparison of the bFLIC sequences to human Refseq mRNA and mammalian UTRdb, . Because the investigation was initiated prior to release of the assembled bovine genome, direct comparison between bovine genomic and bFLIC sequence was problematic.
Without available genomic or full-CDS cDNA sequence, it is common practice to rely on gene clusters such as Unigene  or TIGR Gene Indices [8, 9, 13, 14] for transcript predictions. These computational derived consensus assemblies containing open reading frames (ORFs) are generated from single pass reads through cDNA libraries. These clusters provide a very important resource for putative gene models and products. The TIGR Bos taurus Gene Index (Bt GI) was compared to bovine full-CDS sequences to confirm the existence of experimentally determined transcripts in the computed clusters. This characterization of gene clusters to full-CDS sequences may assist investigators to interpret the significance of their searches against gene cluster databases.
Results and Discussion
Strategy for bovine full-CDS selection and sequencing
The majority of clones were selected to represent unique loci as defined by human GeneID, and in cases where multiple EST clones were available for a given GeneID the clone with the longest predicted clone length was chosen. Additional criteria were also used relative to the predicted length of insert based on human cDNA length, in order to avoid clones of relatively short insert length. Specifically, clones were selected in size categories between 1,000 and 5,000 bp. A minority of clones were then chosen that were redundant to previously targeted GeneID to ascertain the impact of alternative splicing on EST cluster-based sequence databases. This clone selection yielded full-CDS bFLICs cDNAs with 80% efficiency, which was limited in part by the method of library construction that incorporated a digestion with restriction enzyme NotI following second-strand cDNA synthesis to generate a compatible cloning site on the 3' end of the cDNA [8, 9]. Of the 20% failures, 45% are due to NotI sites within the transcript sequence that caused premature termination of the cDNA representations of the transcripts. This is a much higher rate than anticipated based on the average occurrence of NotI sites in genomic DNA and probably reflects a higher percentage of cytidine (C) and guanosine (G) in mRNA sequence (the recognition site for NotI is GCGGCCGC). Hopefully, recent advances in cDNA library production that avoid this type of difficulty will reduce the incidence of truncated clones in future efforts.
Putative full-CDS FLICs selected were sequenced with a "primer walking" procedure in which each sequence read was used to design a primer to extend sequence in the 3' direction. The reads were assembled into contigs, screened for polyA tail and vector, and compared to the human RefSeq transcripts after every walk. Once the 3' end of the insert was encountered (polyA tail or vector), the contig was manually checked for low quality base calls; 5' and 3' finishing primers were used to improve these low quality regions before they subjected to annotation. For each bFLIC, the translated longest ORF (putative protein coding sequence) of the bFLIC was compared to the RefSeq protein database using BLASTP. The bovine protein-human protein comparison served as consistency check with respect to the annotators' association of the bFLIC to human RefSeq. The bFLIC nucleotide sequence comparison to human RefSeq protein sequence (BLASTX) exposed potentially artificial frameshifts/insertion/deletions if present. Only when there was agreement between the annotators' annotation and the computational comparisons were the bFLICs submitted to GenBank.
Summary and length distributions of the bFLICs
Summary of full-CDS bFLICs
Number bFLICs submitted
Number unique loci
Average length (nt)
Success rate (number full-CDS Sequence/number clones sequenced)
Number Bt full-CDS bFLICs used as source clones for GenBank Bt gene models (Entrez Gene)
Comparison of bFLICs to human RefSeq mRNA and protein
Comparison of bFLICs 5' UTR to mammalian 5' UTR – verifying CDS start statistics
Comparison of bFLICs to mammalian Kozak consensus sequences
Comparison of bFLICs to TIGR BtGI
The sequences from all EST libraries used for this study have been previously incorporated into the TIGR Bt GI. This presented an opportunity to verify the TCs (Tentative Consensus sequences) constructed with single pass reads of source clones by comparing them to contigs built from multi-pass full-length sequencing of the same source clones.
The TCs of TIGR Bt GI (Release 11, September 28, 2004) were compared to the full-CDS bFLICs using BLAT. A threshold of 300 or more identities, 1/2 the size of our shortest bFLIC, was chosen to minimize short matches. After the identities threshold was applied, a total of 1346 distinct TCs were found to be similar to 933 of the original 954 bFLICs. If only bFLICs that are members of TCs were considered, 1250 TCs were found to be similar to 855 distinct bFLICs. If there was a further constraint that only matches between a (query) TC and it's (subject) member source clones be considered, then 740 distinct TCs were found to be similar to their source member bFLICs. In the latter analysis, 1 TC can match multiple bFLICs, but not vice versa. This number is quite close to the 762, the number of distinct loci associated with our 954 bFLICs generated in the annotation pipeline. 92 full-CDS bFLICs are not members of a TC.
The analysis of the BLAT similarities between the TIGR Bt GI and bFLICS is complicated by the fact that because multiple TCs can represent a single locus by virtue of alternative splice forms, mis-assembly, or other aspects of shared gene structure, a single bFLIC may be similar to multiple TCs besides its parent TC. Accordingly, the BLAT analysis was segregated into two groups. The first group (A) was the comparison of the bFLICs to all 40,810 TCs, where in general, and given our BLAT threshold, a bFLIC will be similar to more than 1 TC. This comparison results in BLAT hits to 1346 TCs. The second group (B) was a comparison of 855 full-CDS bFLICs to only those TCs that the bFLICs are members of, a smaller set (740) of TCs than the first group. Group B TCs represent the minimum number of TCs that span the "transcription potential" of 855 bFLICs.
BLAT Results: TIGR Bt GI TCs vs. full-CDS bFLICs with identities > = 300
No bFLIC TC membership requirement
bFLIC required to be member in query TC
933 total bFLICs in all alignments
855 total bFLICs in all alignments
Fractional coverage of bFLICs by any single TC
Number of bFLICS
Average Contig Length
Number of bFLICS
Average Contig Length
> = .95
Single pass 5' and 3' reads for 169 full-CDS bFLICs were previously incorporated into the TIGR BtGI. The 5' and 3' single pass reads for 94 (56%) were assembled into the same TCs, while 75 (44%) single pass end reads were placed in different TCs. Using the admittedly limited dataset of 169 bFLICS, it is observed that about 1/2 of the TCs were self-consistently constituted from their source clone sequences. It is likely that the TCs not self-consistently constituted were assembled without adequate data linking the two ends from the single source clone.
A total of 195,443 5' end sequence reads from the 1BOV, 2BOV, 3BOV, 4BOV, and 5 BOV [8, 9] cDNA libraries were masked for repeats with RepeatMasker and compared to human RefSeq mRNA using BLAST  yielding 146,741 distinct bovine clones with BLAST hits, 116,911 of which had 300 or more bases with phred quality score greater than or equal to 20. The bovine cDNAs were associated with the human RefSeq mRNAs with the highest bit score, and through the RefSeqs, the bovine cDNAs were associated with human GeneIDs. Based on the clone sequence similarity to the beginning of CDS of human RefSeq mRNAs, 9,989 potential full-CDS clones were identified, associated with 3,482 distinct human loci. Predicted clone length was defined by the sum of the length of the bovine clone upstream of the beginning of the BLAST match plus the length of the human RefSeq sequence downstream from the beginning of the BLAST match. The putative full-CDS clones were grouped by loci (human GeneID) and predicted clone length.
The clones were sequenced, assembled, and annotated in a semi-automated pipeline involving a database that stored and provided sequencing, primer, and annotation information for every clone. Perl scripts were used to process reads, place them in the appropriate directories, instantiate phredPhrap for contig assembly, automate the detection of polyA, vector, pick walking primers and update the database. bFLIC clones were sequenced 5' -> 3' until polyA or vector was encountered. In regions of low quality sequence, reverse read primers were manually picked. Perl scripts were also use to automate BLAT comparisons of bFLICs with human RefSeq mRNAs.
Human annotators assigned bFLICs to human RefSeq mRNA homologs and determined whether or not the bFLIC was a full-CDS clone. Gbrowse  was used to display bovine/human alignments and Artemis [26, 27] was used for manual annotation of sequence when required. A clone was deemed to be full-CDS if the BLAT query bFLIC region encompassed the entire CDS of its human homolog, and/or the BLAT query region encompassed the beginning of the human homolog's start methionine and exhibited a polyA stretch of at least 13 adenosines on the 3' end. Subsequently, each masked bFLIC was processed through the quality check/assurance portion of the pipeline where the largest translated ORF was compared with BLASTP  to human RefSeq proteins and the entire nucleotide sequence of the bFLIC was compared to RefSeq proteins with BLASTX . A bFLIC was flagged for GenBank submission only if the highest scoring BLASTX and BLASTP hits originated from the same RefSeq mRNA and was identical to the transcript assigned through human review.
- species cow -xsmall. Repeats are masked with lower case letters using the cattle specific repeat library.
- U -F "m S" -I T -f 14 -e 1e-20 -a 2 -b 15. Use RepeatMasker output as input. Allow for extension through repetitive regions, but alignment isn't seeded in repetitive region (soft masking).
- v 1 -b 1 -f 14 -e 1e-20 -a 2
The GenBank Accessions for the bFLIC clones are: [GenBank:BT020623 GenBank:BT021084, GenBank:BT021145 ... GenBank:BT021203, GenBank:BT021479 ... GenBank:BT021911].
Kozak consensus sequence
The sequence spanning from 6 nucleotides upstream to 3 nucleotides downstream of the adenosine of the start ATG was extracted from each bFLIC and aligned with clustalw. The alignment file was used as input to WebLogo.
Steve Simcox performed sequencing reaction setup for the majority of sequences and made critical improvements to quality control procedures. Renee Godtel provided expert annotation assistance. Tina Sphon assisted in sequence reaction setup.
- Imanishi T, Itoh T, Suzuki Y, O'Donovan C, Fukuchi S, Koyanagi KO, Barrero RA, Tamura T, Yamaguchi-Kabata Y, Tanino M, Yura K, Miyazaki S, Ikeo K, Homma K, Kasprzyk A, Nishikawa T, Hirakawa M, Thierry-Mieg J, Thierry-Mieg D, Ashurst J, Jia L, Nakao M, Thomas MA, Mulder N, Karavidopoulou Y, Jin L, Kim S, Yasuda T, Lenhard B, Eveno E, Suzuki Y, Yamasaki C, Takeda J, Gough C, Hilton P, Fujii Y, Sakai H, Tanaka S, Amid C, Bellgard M, Bonaldo Mde F, Bono H, Bromberg SK, Brookes AJ, Bruford E, Carninci P, Chelala C, Couillault C, de Souza SJ, Debily MA, Devignes MD, Dubchak I, Endo T, Estreicher A, Eyras E, Fukami-Kobayashi K, Gopinath GR, Graudens E, Hahn Y, Han M, Han ZG, Hanada K, Hanaoka H, Harada E, Hashimoto K, Hinz U, Hirai M, Hishiki T, Hopkinson I, Imbeaud S, Inoko H, Kanapin A, Kaneko Y, Kasukawa T, Kelso J, Kersey P, Kikuno R, Kimura K, Korn B, Kuryshev V, Makalowska I, Makino T, Mano S, Mariage-Samson R, Mashima J, Matsuda H, Mewes HW, Minoshima S, Nagai K, Nagasaki H, Nagata N, Nigam R, Ogasawara O, Ohara O, Ohtsubo M, Okada N, Okido T, Oota S, Ota M, Ota T, Otsuki T, Piatier-Tonneau D, Poustka A, Ren SX, Saitou N, Sakai K, Sakamoto S, Sakate R, Schupp I, Servant F, Sherry S, Shiba R, Shimizu N, Shimoyama M, Simpson AJ, Soares B, Steward C, Suwa M, Suzuki M, Takahashi A, Tamiya G, Tanaka H, Taylor T, Terwilliger JD, Unneberg P, Veeramachaneni V, Watanabe S, Wilming L, Yasuda N, Yoo HS, Stodolsky M, Makalowski W, Go M, Nakai K, Takagi T, Kanehisa M, Sakaki Y, Quackenbush J, Okazaki Y, Hayashizaki Y, Hide W, Chakraborty R, Nishikawa K, Sugawara H, Tateno Y, Chen Z, Oishi M, Tonellato P, Apweiler R, Okubo K, Wagner L, Wiemann S, Strausberg RL, Isogai T, Auffray C, Nomura N, Gojobori T, Sugano S: Integrative annotation of 21,037 human genes validated by full-length cDNA clones. PLoS Biol. 2004, 2 (6): e162-10.1371/journal.pbio.0020162.PubMedPubMed CentralView Article
- Castelli V, Aury JM, Jaillon O, Wincker P, Clepet C, Menard M, Cruaud C, Quetier F, Scarpelli C, Schachter V, Temple G, Caboche M, Weissenbach J, Salanoubat M: Whole genome sequence comparisons and "full-length" cDNA sequences: a combined approach to evaluate and improve Arabidopsis genome annotation. Genome Res. 2004, 14 (3): 406-413. 10.1101/gr.1515604.PubMedPubMed CentralView Article
- Hayashizaki Y: The Riken mouse genome encyclopedia project. C R Biol. 2003, 326 (10-11): 923-929.PubMedView Article
- Kikuchi S, Satoh K, Nagata T, Kawagashira N, Doi K, Kishimoto N, Yazaki J, Ishikawa M, Yamada H, Ooka H, Hotta I, Kojima K, Namiki T, Ohneda E, Yahagi W, Suzuki K, Li CJ, Ohtsuki K, Shishiki T, Otomo Y, Murakami K, Iida Y, Sugano S, Fujimura T, Suzuki Y, Tsunoda Y, Kurosaki T, Kodama T, Masuda H, Kobayashi M, Xie Q, Lu M, Narikawa R, Sugiyama A, Mizuno K, Yokomizo S, Niikura J, Ikeda R, Ishibiki J, Kawamata M, Yoshimura A, Miura J, Kusumegi T, Oka M, Ryu R, Ueda M, Matsubara K, Kawai J, Carninci P, Adachi J, Aizawa K, Arakawa T, Fukuda S, Hara A, Hashizume W, Hayatsu N, Imotani K, Ishii Y, Itoh M, Kagawa I, Kondo S, Konno H, Miyazaki A, Osato N, Ota Y, Saito R, Sasaki D, Sato K, Shibata K, Shinagawa A, Shiraki T, Yoshino M, Hayashizaki Y, Yasunishi A: Collection, mapping, and annotation of over 28,000 cDNA clones from japonica rice. Science. 2003, 301 (5631): 376-379. 10.1126/science.1081288.PubMedView Article
- Gerhard DS, Wagner L, Feingold EA, Shenmen CM, Grouse LH, Schuler G, Klein SL, Old S, Rasooly R, Good P, Guyer M, Peck AM, Derge JG, Lipman D, Collins FS, Jang W, Sherry S, Feolo M, Misquitta L, Lee E, Rotmistrovsky K, Greenhut SF, Schaefer CF, Buetow K, Bonner TI, Haussler D, Kent J, Kiekhaus M, Furey T, Brent M, Prange C, Schreiber K, Shapiro N, Bhat NK, Hopkins RF, Hsie F, Driscoll T, Soares MB, Casavant TL, Scheetz TE, Brown-stein MJ, Usdin TB, Toshiyuki S, Carninci P, Piao Y, Dudekula DB, Ko MS, Kawakami K, Suzuki Y, Sugano S, Gruber CE, Smith MR, Simmons B, Moore T, Waterman R, Johnson SL, Ruan Y, Wei CL, Mathavan S, Gunaratne PH, Wu J, Garcia AM, Hulyk SW, Fuh E, Yuan Y, Sneed A, Kowis C, Hodgson A, Muzny DM, McPherson J, Gibbs RA, Fahey J, Helton E, Ketteman M, Madan A, Rodrigues S, Sanchez A, Whiting M, Madari A, Young AC, Wetherby KD, Granite SJ, Kwong PN, Brinkley CP, Pearson RL, Bouffard GG, Blakesly RW, Green ED, Dickson MC, Rodriguez AC, Grimwood J, Schmutz J, Myers RM, Butterfield YS, Griffith M, Griffith OL, Krzywinski MI, Liao N, Morrin R, Palmquist D, Petrescu AS, Skalska U, Smailus DE, Stott JM, Schnerch A, Schein JE, Jones SJ, Holt RA, Baross A, Marra MA, Clifton S, Makowski KA, Bosak S, Malek J: The status, quality, and expansion of the NIH full-length cDNA project: the Mammalian Gene Collection (MGC). Genome Res. 2004, 14 (10B): 2121-2127. 10.1101/gr.2596504.PubMedView Article
- Gibbs R, Weinstock G, Kappes S, Schook L, Skow L, Womack J: Bovine Genomic Sequencing Initiative. 2002, National Human Genome Research Institute
- Stapleton M, Liao G, Brokstein P, Hong L, Carninci P, Shiraki T, Hayashizaki Y, Champe M, Pacleb J, Wan K, Yu C, Carlson J, George R, Celniker S, Rubin GM: The Drosophila gene collection: identification of putative full-length cDNAs for 70% of D. melanogaster genes. Genome Res. 2002, 12 (8): 1294-1300. 10.1101/gr.269102.PubMedPubMed CentralView Article
- Smith TP, Grosse WM, Freking BA, Roberts AJ, Stone RT, Casas E, Wray JE, White J, Cho J, Fahrenkrug SC, Bennett GL, Heaton MP, Laegreid WW, Rohrer GA, Chitko-McKown CG, Pertea G, Holt I, Karamycheva S, Liang F, Quackenbush J, Keele JW: Sequence evaluation of four pooled-tissue normalized bovine cDNA libraries and construction of a gene index for cattle. Genome Res. 2001, 11 (4): 626-630. 10.1101/gr.170101.PubMedPubMed CentralView Article
- Sonstegard TS, Capuco AV, White J, Van Tassell CP, Connor EE, Cho J, Sultana R, Shade L, Wray JE, Wells KD, Quackenbush J: Analysis of bovine mammary gland EST and functional annotation of the Bos taurus gene index. Mamm Genome. 2002, 13 (7): 373-379. 10.1007/s00335-001-2145-4.PubMedView Article
- Pruitt KD, Tatusova T, Maglott DR: NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2005, 33 (Database issue): D501-4. 10.1093/nar/gki025.PubMedPubMed CentralView Article
- Mignone F, Grillo G, Licciulli F, Iacono M, Liuni S, Kersey PJ, Duarte J, Saccone C, Pesole G: UTRdb and UTRsite: a collection of sequences and regulatory motifs of the untranslated regions of eukaryotic mRNAs. Nucleic Acids Res. 2005, 33 (Database issue): D141-6. 10.1093/nar/gki021.PubMedPubMed CentralView Article
- Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Church DM, DiCuccio M, Edgar R, Federhen S, Helmberg W, Kenton DL, Khovayko O, Lipman DJ, Madden TL, Maglott DR, Ostell J, Pontius JU, Pruitt KD, Schuler GD, Schriml LM, Sequeira E, Sherry ST, Sirotkin K, Starchenko G, Suzek TO, Tatusov R, Tatusova TA, Wagner L, Yaschenko E: Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2005, 33 (Database issue): D39-45. 10.1093/nar/gki062.PubMedPubMed CentralView Article
- Lee Y, Tsai J, Sunkara S, Karamycheva S, Pertea G, Sultana R, Antonescu V, Chan A, Cheung F, Quackenbush J: The TIGR Gene Indices: clustering and assembling EST and known genes and integration with eukaryotic genomes. Nucleic Acids Res. 2005, 33 (Database issue): D71-4. 10.1093/nar/gki064.PubMedPubMed CentralView Article
- Pertea G, Huang X, Liang F, Antonescu V, Sultana R, Karamycheva S, Lee Y, White J, Cheung F, Parvizi B, Tsai J, Quackenbush J: TIGR Gene Indices clustering tools (TGICL): a software system for fast clustering of large EST datasets. Bioinformatics. 2003, 19 (5): 651-652. 10.1093/bioinformatics/btg034.PubMedView Article
- Caldwell RB, Kierzek AM, Arakawa H, Bezzubov Y, Zaim J, Fiedler P, Kutter S, Blagodatski A, Kostovska D, Koter M, Plachy J, Carninci P, Hayashizaki Y, Buerstedde JM: Full-length cDNAs from chicken bursal lymphocytes to facilitate gene function analysis. Genome Biol. 2005, 6 (1): R6-10.1186/gb-2004-6-1-r6.PubMedPubMed CentralView Article
- Kozak M: An analysis of 5'-noncoding sequences from 699 vertebrate messenger RNAs. Nucleic acids research. 1987, 15 (20): 8125-PubMedPubMed CentralView Article
- Kozak M: Compilation and analysis of sequences upstream from the translational start site in eukaryotic mRNAs. Nucleic acids research. 1984, 12 (2): 857-PubMedPubMed CentralView Article
- Iacono M, Mignone F, Pesole G: uAUG and uORFs in human and rodent 5'untranslated mRNAs. Gene. 2005, 349: 97-105. 10.1016/j.gene.2004.11.041.PubMedView Article
- Pesole G, Gissi C, Grillo G, Licciulli F, Liuni S, Saccone C: Analysis of oligonucleotide AUG start codon context in eukariotic mRNAs. Gene. 2000, 261 (1): 85-10.1016/S0378-1119(00)00471-6.PubMedView Article
- Kent WJ: BLAT--the BLAST-like alignment tool. Genome Res. 2002, 12 (4): 656-664. 10.1101/gr.229202. Article published online before March 2002.PubMedPubMed CentralView Article
- Smit AFA, Hubley R, Green P: RepeatMasker Open 3.0. 2005
- Altschul SF, Gish W, Miller W, Meyers EW, Lipman DJ: Basic Local Alignment Search Tool. Journal of Molecular Biology. 1990, 215 (3): 403-10.1006/jmbi.1990.9999.PubMedView Article
- Gordon D, Desmarais C, Green P: Automated Finishing with Autofinish. Genome Res. 2001, 11 (4): 614-625. 10.1101/gr.171401.PubMedPubMed CentralView Article
- Gordon D, Abajian C, Green P: Consed: A Graphical Tool for Sequence†Finishing. Genome Res. 1998, 8 (3): 195-202.PubMedView Article
- Stein LD, Mungall C, Shu S, Caudy M, Mangone M, Day A, Nickerson E, Stajich JE, Harris TW, Arva A, Lewis S: The generic genome browser: a building block for a model organism system database. Genome Res. 2002, 12 (10): 1599-1610. 10.1101/gr.403602.PubMedPubMed CentralView Article
- Berriman M, Rutherford K: Viewing and annotating sequence data with Artemis. Brief Bioinform. 2003, 4 (2): 124-132.PubMedView Article
- Rutherford K, Parkhill J, Crook J, Horsnell T, Rice P, Rajandream MA, Barrell B: Artemis: sequence visualization and annotation. Bioinformatics. 2000, 16 (10): 944-945. 10.1093/bioinformatics/16.10.944.PubMedView Article
- Gish W, States DJ: Identification of protein coding regions by database similarity search. Nat Genet. 1993, 3 (3): 266-10.1038/ng0393-266.PubMedView Article
- Crooks GE, Hon G, Chandonia JM, Brenner SE: WebLogo: a sequence logo generator. Genome Res. 2004, 14 (6): 1188-1190. 10.1101/gr.849004.PubMedPubMed CentralView Article
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.