Protein genes in repetitive sequence—antifreeze glycoproteins in Atlantic cod genome
© Zhuang et al.; licensee BioMed Central Ltd. 2012
Received: 15 December 2011
Accepted: 15 June 2012
Published: 2 July 2012
Highly repetitive sequences are the bane of genome sequence assembly, and the short read lengths produced by current next generation sequencing technologies further exacerbates this obstacle. An adopted practice is to exclude repetitive sequences in genome data assembly, as the majority of repeats lack protein-coding genes. However, this could result in the exclusion of important genotypes in newly sequenced non-model species. The absence of the antifreeze glycoproteins (AFGP) gene family in the recently sequenced Atlantic cod genome serves as an example.
The Atlantic cod (Gadus morhua) genome was assembled entirely from Roche 454 short reads, demonstrating the feasibility of this approach. However, a well-known major adaptive trait, the AFGP, essential for survival in frigid Arctic marine habitats was absent in the annotated genome. To assess whether this resulted from population difference, we performed Southern blot analysis of genomic DNA from multiple individuals from the North East Arctic cod population that the sequenced cod belonged, and verified that the AFGP genotype is indeed present. We searched the raw assemblies of the Atlantic cod using our G. morhua AFGP gene, and located partial AFGP coding sequences in two sequence scaffolds. We found these two scaffolds constitute a partial genomic AFGP locus through comparative sequence analyses with our newly assembled genomic AFGP locus of the related polar cod, Boreogadus saida. By examining the sequence assembly and annotation methodologies used for the Atlantic cod genome, we deduced the primary cause of the absence of the AFGP gene family from the annotated genome was the removal of all repetitive Roche 454 short reads before sequence assembly, which would exclude most of the highly repetitive AFGP coding sequences. Secondarily, the model teleost genomes used in projection annotation of the Atlantic cod genome have no antifreeze trait, perpetuating the unawareness that the AFGP gene family is missing.
We recovered some of the missing AFGP coding sequences and reconstructed a partial AFGP locus in the Atlantic cod genome, bringing to light that not all repetitive sequences lack protein coding information. Also, reliance on genomes of model organisms as reference for annotating protein-coding gene content of a newly sequenced non-model species could lead to omission of novel genetic traits.
Massively parallel deep coverage next generation sequencing (NGS) technologies have stimulated efforts of de novo genome sequence assembly in recent years. While NGS data productions advance at phenomenal rates, accurate genome assembly and annotation remain challenging, and the extent of what may be missing in these de novo assembled genomes is an ongoing matter of concern. All genome assembly efforts face the challenge of accurately assembling tandem and sparse repeat sequences. Current assembly algorithms collapse identical or very similar repeats leading to potential reduction or loss of genomic complexity . The short reads produced by current NGS technologies are especially prone to this problem, as repetitive sequences can be resolved only if the reads are long enough to span the repetitive region . A frequent practice is to exclude highly repetitive sequences in genome assembly and annotation, on the assumption that they lack protein-coding genes. While this assumption is generally valid, here we provide a clear example that excluding what appears to be simple repetitive sequences could result in the exclusion of an important fitness genotype, in the case of the Atlantic cod genome.
Atlantic cod is a key commercial fishery species in the cold waters of the north Atlantic seas and a prime target for domestication by countries across the north Atlantic Ocean. Star et al.  recently reported the annotated genome of a specimen (NEAC_001) from the cold-adapted North East Arctic cod (NEAC) stock from the Barents Sea. The NEAC_001 genome was among the first complex vertebrate genomes assembled entirely from short reads obtained using the Roche 454 GS FLX Titanium platform. The report discussed the interesting potential thermal adaptive properties of the hemoglobins of the cod, as well as its unique adaptive immune system as related to organismal fitness . While not the focus of the discussion, it is nevertheless surprising that a crucial fitness trait, the antifreeze glycoprotein (AFGP) [4–6], which has clear relevance for aquaculture of the species in cold, northerly latitudes, was absent from genome annotations and predicted transcripts and proteins. Neither was any allusion made to its presence in the extensive Supplementary Notes accompanying the report.
AFGP is one of the diverse, novel antifreeze proteins that evolved in various polar and subpolar marine teleost lineages, enabling their survival in freezing, icy seawater [7, 8]. Presence of AFGPs is long known in a number of northern and Arctic Atlantic cod populations [4–6]. The near-identical AFGPs in the Arctic/northern cods (family Gadidae) and in the unrelated Antarctic notothenioid fishes endemic to the Southern Ocean, is an established prime example of convergent evolution and at the rare protein sequence level . Antifreeze proteins recognize environmental ice crystals that enter the fish, bind to them and stop their expansion, thereby prevent the fish body fluids, which have less salt and thus a higher freezing point than seawater, from freezing . Absence of the AFGP genotype in the cold-adapted NEAC_001 would seem inconsistent with its frigid Arctic habitats.
AFGPs are highly repetitive in sequence in the protein and particularly in the coding sequences because they are encoded as large polyprotein precursors . Thus the possibility exists that the repetitive AFGP coding sequence repeats might have been inadvertently excluded along with other non-protein coding repetitive sequences during genome sequence assembly. Here we report investigations of the Atlantic cod genome data leading to our discovery that AFGP coding sequences exist in NEAC_001. We provide experimental evidence as well as bioinformatics proofs from comparative genomic sequence analyses, which support the presence of an AFGP gene family in Atlantic cod. From examining in detail the published methodologies in the Atlantic cod genome assembly and annotation process, we deduced the probable causes leading to the exclusion of this major genetic trait.
Results and discussion
AFGP gene family in Atlantic cod
AFGP locus in Atlantic cod genome
We envisioned that AFGP cds would still be in the raw genome data of the sequenced NEAC_001, thus we BLAST searched the raw assembly ATLCOD1A with Gm1-1 AFGP sequence as query. (ATLCOD1A was assembled with Newbler, and subsequently repeat-masked and used for projection genome re-ordering and annotation. The search yielded sequence similarities in two sequence scaffolds, ATLCOD1As00125 and ATLCOD1As03479. We identified a total of seven partial AFGP genes/coding regions in these two scaffolds—five in ATLCOD1As00125 and two in ATLCOD1As03479. The majority of the repetitive AFGP cds was missing, thus these AFGP genes contain gaps in the middle, but the available sequence lengths at the ends flanking the gaps are sufficient for gene identification. An alignment of the ATLCOD1A partial AFGP cds with the full AFGP coding region in Gm1-1 AFGP gene is shown in Additional file 1.
Mature AFGPs occur as a family of size isoforms composed of four to tens of repeats of the tripeptide (Ala/Pro-Ala-Thr), with glycosylation (the disaccharide galactose-N-acetylgalactosamine) on the Thr residues . The alignment (Additional file 1) shows that the partial AFGP genes of NEAC_001 encode the characteristic tandem tripeptide repeats of the AFGP peptide backbone, as well as the conserved C-terminus sequence, AAAVL*. The aligned nucleotide sequences between ATLCOD1A_AFGP5 and Gm1-1 are 99.8% identical, thus the two are quite clearly counterparts of each other in NEAC_001 and the Øresund individual. These high sequence identities indicate that at least some of the NEAC_001 partial AFGP genes we identified from the two ALTCOD1A sequence scaffolds are intact/functional genes.
Together, the results from the genomic Southern blot (Figure 1) and comparative sequence analyses (Additional file 1 and Figure 2) clearly support an AFGP genomic locus with intact genes is present in the sequenced NEAC_001.
Possible cause of AFGP exclusion from the annotated cod genome
Through detailed examination of the assembly and annotation process described by Star et al.  in their Supplementary Notes, we deduced the possible cause of AFGP exclusion in the annotated Atlantic cod genome to be two-fold. The primary cause is the removal of repetitive sequences in the initial steps of the bioinformatics pipeline, and secondarily due to the use of genomes of non-AFGP bearing model teleosts as reference for annotating protein gene content in Atlantic cod.
Codon usage bias in the 141 9-nt tripeptide repeat coding sequences in G. morhua AFGP gene Gm1-1
Codon 1 (Ala/Pro)
Codon 2 (Ala)
Codon 3 (Thr/Arg)
Trinucleotide equivalent of the biased 9-nt tripeptide repeat coding sequences in G. morhua AFGP gene Gm1-1
Single codon/3-nt equivalent (Ala/Pro/Thr/Arg)
Supplementary Note 5 of Star et al.  indicated that before assembling the Roche 454 reads, they “excluded highly repetitive, non-informative reads” from their data set. They stated that these included shotgun reads encompassing STRs and SSRs, and sets of paired-end reads if one or both ends consist of STRs or SSRs. This data reduction likely completely eliminated the STR- or SSR-like AFGP cds except for the 5′ and 3′ ends of the AFGP coding exon immediately adjacent to non-repetitive upstream and downstream sequence (Additional file 1), leading to gaps in the middle of AFGP genes we identified in the two ATLCOD1A scaffolds (Additional file 1; Figure 2B, C). The STR/SSR-culled read data were assembled using Newbler and the Celera assembler, generating ATLCOD1A and ATLCOD1B respectively (Supplementary Notes 5, 6 of ). The Newbler assembly ATLCOD1A was chosen for genome annotation. Repeat masking was applied to ATLCOD1A using existing TE (transposable elements) libraries and a de novo created custom library for cod, (Supplementary Note 16 and Supplementary Table 6 of ) resulting in the masking of 25.4% of the assembly. The repeat-masked ATLCOD1A then underwent whole-genome structural alignment and re-ordering using the three-spined stickleback genome as reference followed by projection annotation Supplementary Note 17 of . We compared the unmasked and repeat-masked version of ATLCOD1A, and found most of the AFGP cds that survived the STR/SSR culling (shown in Figure 2B, C) became masked (highlighted in Additional file 1), rendering the AFGP genotype essentially non-existent prior to protein gene annotation. Annotation was carried out by projecting protein-coding gene models from three-spined stickleback through the whole genome alignments onto the re-ordered cod genomic regions. Additional protein-coding gene models from other teleost genomes (medaka and zebrafish) were mapped onto cod genome regions having no alignment with stickleback (Supplementary Note 17 of ). None of these model teleosts require antifreeze protection in their temperate or tropical habitats and had not evolved the novel antifreeze genotype, thus, the exclusion of a major genetic trait in the assembled Atlantic cod genome remained unrecognized.
Gene-rich repetitive sequences
Exclusion of the well-known prominent and important AFGP trait in Atlantic cod genome annotation brings to light that common assumptions of repetitive sequences as gene-less or gene-poor do not always apply. Atlantic cod AFGP genes are by no means a lone case of gene-rich repetitive sequences representing a major and/or novel trait. Other prominent proteins composed of short repetitive sequences are present in a variety of organisms, including the convergently evolved AFGPs in the Antarctic notothenioids [9, 15], other antifreeze proteins in fish , insects  and plants , fibrous silk fibroins in spiders  and moths , amelogenin in primates , human dentin sialophosphoprotein , involucrin , collagens , and others. Exclusion of the repetitive coding sequences of these prominent proteins from the genome assembly of the respective organism would be a major blunder. Missing repetitive and duplicated sequences due to limitations in sequence assemblers has also resulted in omission of more subtle coding exons in human genomes despite the availability of a highly refined reference genome [25, 26].
We have “resurrected” some of the missing AFGP genes from the Atlantic cod raw genome assemblies, and reconstructed a partial genomic AFGP locus for this species. While it is well appreciated that all genome assemblies have gaps of information due to bioinformatics limitations, missing a known and prominent fitness trait lends to confusion in the field. We therefore suggest that this biologically relevant trait be restored to the cod genome annotations. The gadid AFGP trait is relevant not only to the biology and culturing potential of the cod species, but as an evolutionary innovation, has broad relevance in the pursuit of understanding molecular mechanisms of invention of new gene and function. For the ever expanding efforts at de novo assembly of new genomes, the case of the missing Atlantic cod AFGP genotype hopefully will promote vigilance in avoiding categorical assumptions that all repetitive sequences lack protein coding information. While we await improved algorithms for accurately assembling repetitive sequences, longer reads from traditional sequencing methods such as Sanger still has its place and the painstaking approach of assembling repeat sequences with extensive manual inspection and validation is unavoidable. This approach has been successfully applied to assemble the ~400 kbp highly repetitive and polymorphic AFGP genomic locus that convergently evolved in the Antarctic notothenioid fish . Lastly, while projection annotation of de novo assembled genomes of new species using model genomes as reference certainly has great utility, novel traits might have evolved since the species diverged. Thus an appreciation of the evolutionary history and major known biological traits of a new species targeted for genome sequencing is inevitably necessary.
Specimens, DNA isolation and Southern blot hybridization
Atlantic cod G. morhua individuals of the Norwegian coastal cod (NCC) stock and North East Arctic cod (NEAC) stock were caught by trawl from the Finnmark coast and marginal Barents Sea sites. Individuals from outer Øresund, Denmark were caught with hook and line. Polar cod B. saida were obtained by trawling near Spitzbergen, and the fresh water cod Lota lota that does not have the AFGP trait was obtained from Oneida Lake, New York. DNA was isolated from liver or gill tissues using standard Tris.HCl/SDS lysis and phenol/chloroform extractions. About 10–15 μg of Taq I (NEB) digested DNA was vacuum blotted onto Hybond-N membrane (GE Health Science). Hybridization to a P32-labeled B. saida AFGP coding sequence probe (AFGP gene Bs3-1, ) was carried out in PerfectHyb (Sigma) at 55°C. The blot was washed thoroughly in 0.1XSSC/0.5%SDS at 55°C and autoradiographed using a phosphor storage screen and the phosphoimager STORM (Molecular Dynamics).
Pan I genotyping of NEAC and NCC
A 773-base pair fragment of the pantophysin gene was PCR-amplified from DNA and scored for the presence or absence of a Dra I site, representing the Pan IB and Pan IA allelic classes respectively, following published protocol .
Isolation and sequencing of Øresund G. morhua AFGP gene
A partial genomic DNA library enriched for AFGP genes was constructed using the λZAP Express vector (Stratagene). Genomic DNA was completely digested with Mbo I, and DNA fragments within the size range that hybridized to the P32-labeled B. saida AFGP probe as determined in separate Southern blot experiments were recovered from agarose gel and ligated to the compatible ends of Bam HI digested λZap Express vector. The ligation was packaged using Gigapack Gold III (Stratagene) to form the phage library, which was then screened with the AFGP probe by plaque lift filter hybridization. Positive phage clones were screened to homogeneity and excised to produce the phagemid (pBK-CMV) DNA with the ExAssist helper phage (Stratagene). One phagemid clone Gm1-1 was selected for sequencing. A nested set of unidirectional deletion clones of Gm1-1 phagemid DNA containing the repetitive AFGP cds was generated using the Erase-a-Base system (Promega), and sequenced using BigDye v2.0 chemistry (ABI).
Isolation and sequencing of B. saida AFGP genomic locus
A BAC (Bacterial Artificial Chromosome) genomic DNA library was constructed for a B. saida individual following published protocols [15, 27] with modifications. Briefly, agarose-plug immobilized red blood cell DNA was treated with CTAB (cetyl trimethylammonium bromide) (Teknova) to remove blood cell glycoproteins. The treated plugs were then partially digested with EcoR I in the presence of EcoR I methylase and resolved on pulsed field electrophoresis using CHEF Mapper XL (BioRad). Size fragments of 100–200 kbp were electroeluted and ligated to the EcoR I digested pCC1BAC vector (Epicentre), and the ligation was electroporated into E. coli DH10B (Invitrogen). Recombinant clones were robotically archived and printed on nylon hybridization membrane as macroarray. The arrayed library was screened with the P32-labeled Bs3-1 AFGP probe. Details of library statistics and fingerprinted contig analyses of AFGP-positives will be published elsewhere. The partial AFGP genomic locus used in this study comprised of two overlapping AFGP-positive BAC clones that were sequenced using both shotgun libraries constructed with the Nextera kit (Epicentre) and 3-kbp paired-end libraries, on the Roche 454 GS FLX Titanium platform. The BAC insert sequences were assembled using Roche GS De Novo Assembler (Newbler) V2.6 with extensive manual inspection.
AFGP sequence search in Atlantic cod genome data
The Atlantic cod genome data were downloaded from http://codgenome.no/data/. The genome assemblies we used for searching AFGP cds and other analyses were: (i) the Newbler assembly ATLCOD1A, unmasked and repeat-masked; (ii) the Celera assembly ATLCOD1B; (iii) ATLCOD1C, the assembly after processing ALTCOD1A through the projection pipeline; and (iv) ALTCOD1_ANN, genome annotations including all predicted transcripts and proteins. We used our Atlantic cod Gm1-1 and polar cod AFGP sequences as queries to search for similar nucleotide sequence using BLASTN, and for similar translated sequences using TBLASTN. Searches utilized the default parameter settings of BLAST 2.2.24+.
We thank John Fleng Steffensen for assistance in collecting Øresund Atlantic cod, and Tony VanDeValk for collecting Lota lota. The work was supported by NSF DEB grant 09-19496 ARRA to CHCC.
- Green P: Whole-genome disassembly. Proc Natl Acad Sci USA. 2002, 99 (7): 4143-PubMed CentralView ArticlePubMed
- Finotello F, Lavezzo E, Fontana P, Peruzzo D, Albiero A, Barzon L, Falda M, Di Camillo B, Toppo S: Comparative analysis of algorithms for whole-genome assembly of pyrosequencing data. Brief Bioinform. 2011, 13 (3): 269-280.View ArticlePubMed
- Star B, Nederbragt AJ, Jentoft S, Grimholt U, Malmstrom M, Gregers TF, Rounge TB, Paulsen J, Solbakken MH, Sharma A: The genome sequence of Atlantic cod reveals a unique immune system. Nature. 2011, 477 (7363): 207-210.PubMed CentralView ArticlePubMed
- Goddard SV, Kao MH, Fletcher GL: Population differences in antifreeze production cycles of juvenile Atlantic cod (Gadus morhua) reflect adaptations to overwintering environment. Can J Fish Aquat Sci. 1999, 56: 1991-1999.View Article
- Goddard SV, Wroblewski JS, Taggart CT, Howse KA, Bailey WL, Kao MH, Fletcher GL: Overwintering of adult northern Atlantic cod (Gadus morhua) in cold inshore waters as evidenced by plasma antifreeze glycoprotein levels. Can J Fish Aquat Sci. 1994, 51: 2834-2842.View Article
- Hew CL, Slaughter D, Fletcher GL, Joshi S: Antifreeze glycoproteins in the plasma of Newfoundland Atlantic cod (Gadus morhua). Can J Zool. 1981, 59: 2186-2192.View Article
- Cheng C-HC: Evolution of the diverse antifreeze proteins. Curr Opin Genet Dev. 1998, 8: 715-720.View ArticlePubMed
- Fletcher GL, Hew CL, Davies PL: Antifreeze proteins of teleost fishes. Annu Rev Physiol. 2001, 63: 359-390.View ArticlePubMed
- Chen L, DeVries AL, Cheng C-HC: Convergent evolution of antifreeze glycoproteins in Antarctic notothenioid fish and Arctic cod. Proc Natl Acad Sci USA. 1997, 94 (8): 3817-3822.PubMed CentralView ArticlePubMed
- DeVries AL, Cheng C-HC: Antifreeze proteins and organismal freezing avoidance in polar fishes. The physiology of polar fishes. vol. 22. Edited by: Farrell AP, Steffensen JF. 2005, Elsevier Academic Press, San Diego, 155-201.View Article
- Pogson GH, Fevolden SE: Natural selection and the genetic differentiation of coastal and Arctic populations of the Atlantic cod in northern Norway: a test involving nucleotide sequence variation at the pantophysin (PanI) locus. Mol Ecol. 2003, 12 (1): 63-74.View ArticlePubMed
- Sarvas TH, Fevolden SE: Pantophysin (Pan I) locus divergence between inshore v. offshore and northern v. southern populations of Atlantic cod in the north-east Atlantic. J Fish Biol. 2005, 67: 444-469.View Article
- Johansen SD, Coucheron DH, Andreassen M, Karlsen BO, Furmanek T, Jørgensen TE, Emblem Å, Breines R, Nordeide JT, Moum T: Large-scale sequence analyses of Atlantic cod. New Biotechnol. 2009, 25 (5): 263-271.View Article
- Frazer KA, Pachter L, Poliakov A, Rubin EM, Dubchak I: VISTA: computational tools for comparative genomics. Nucleic Acids Res. 2004, 32: W273-W279.PubMed CentralView ArticlePubMed
- Nicodemus-Johnson J, Silic S, Ghigliotti L, Pisano E, Cheng CHC: Assembly of the Antifreeze Glycoprotein/Trypsinogen-Like Protease Genomic Locus in the Antarctic fish Dissostichus mawsoni (Norman). Genomics. 2011, 98: 194-201.View ArticlePubMed
- Scott GK, Davies PL, Kao MH, Fletcher GL: Differential amplification of antifreeze protein genes in the pleuronectinae. J Mol Evol. 1988, 27 (1): 29-35.View ArticlePubMed
- Graham LA, Davies PL: Glycine-rich antifreeze proteins from snow fleas. Science. 2005, 310 (5747): 461-View ArticlePubMed
- Middleton AJ, Brown AM, Davies PL, Walker VK: Identification of the ice-binding face of a plant antifreeze protein. FEBS Lett. 2009, 583 (4): 815-819.View ArticlePubMed
- Gatesy J, Hayashi C, Motriuk D, Woods J, Lewis R: Extreme diversity, conservation, and convergence of spider silk fibroin sequences. Science. 2001, 291 (5513): 2603-View ArticlePubMed
- Regier JC: Evolution and higher-order structure of architectural proteins in silkmoth chorion. EMBO J. 1986, 5 (8): 1981-PubMed CentralPubMed
- Lacruz RS, Lakshminarayanan R, Bromley KM, Hacia JG, Bromage TG, Snead ML, Moradian-Oldak J, Paine ML: Structural analysis of a repetitive protein sequence motif in strepsirrhine primate amelogenin. PLoS One. 2011, 6 (3): e18028-PubMed CentralView ArticlePubMed
- MacDougall M, Simmons D, Luan X, Nydegger J, Feng J, Gu TT: Dentin phosphoprotein and dentin sialoprotein are cleavage products expressed from a single transcript coded by a gene on human chromosome 4. J Biol Chem. 1997, 272 (2): 835-View ArticlePubMed
- Eckert RL, Green H: Structure and evolution of the human involucrin gene. Cell. 1986, 46: 583-589.View ArticlePubMed
- Yamada Y, Avvedimento VE, Mudryj M, Ohkubo H, Vogeli G, Irani M, Pastan I, de Crombrugghe B: The collagen gene: evidence for its evolutionary assembly by amplification of a DNA segment containing an exon of 54 bp. Cell. 1980, 22: 887-892.View ArticlePubMed
- Alkan C, Sajjadian S, Eichler EE: Limitations of next-generation genome sequence assembly. Nat Methods. 2011, 8: 61-65.PubMed CentralView ArticlePubMed
- Kidd JM, Sampas N, Antonacci F, Graves T, Fulton R, Hayden HS, Alkan C, Malig M, Ventura M, Giannuzzi G: Characterization of missing human genome sequences and copy-number polymorphic insertions. Nat Methods. 2010, 7 (5): 365-371.PubMed CentralView ArticlePubMed
- Miyake T, Amemiya CT: BAC libraries and comparative genomics of aquatic chordate species. Comp Biochem Physiol C. 2004, 138: 233-244.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.