A genomic overview of short genetic variations in a basal chordate, Ciona intestinalis
© Satou et al.; licensee BioMed Central Ltd. 2012
Received: 19 December 2011
Accepted: 30 May 2012
Published: 30 May 2012
Although the Ciona intestinalis genome contains many allelic polymorphisms, there is only limited data analyzed systematically. Establishing a dense map of genetic variations in C. intestinalis is necessary not only for linkage analysis, but also for other experimental biology including molecular developmental and evolutionary studies, because animals from natural populations are typically used for experiments.
Here, we identified over three million candidate short genomic variations within a 110 Mb euchromatin region among five C. intestinalis individuals. The average nucleotide diversity was approximately 1.1%. Genetic variations were found at a similar density in intergenic and gene regions. Non-synonymous and nonsense nucleotide substitutions were found in 12,493 and 1,214 genes accounting for 81.9% and 8.0% of the entire gene set, respectively, and over 60% of genes in the single animal encode non-identical proteins between maternal and paternal alleles.
Our results provide a framework for studying evolution of the animal genome, as well as a useful resource for a wide range of C. intestinalis researchers.
KeywordsCiona intestinalis Short genetic variations Single nucleotide polymorphisms (SNPs) Ascidian
Ciona intestinalis is a chordate with a simpler and more compact genome than found in vertebrates or even cephalochordates, making it a useful model for studying chordate evolution. In 2002, we decoded the genome of a single C. intestinalis animal from a natural population (Half Moon Bay, California, USA) . From the study, the 160 Mb C. intestinalis genome was shown to contain 110 Mb of euchromatin regions that encode approximately 16,000 genes, yielding an average gene density of one gene per 6.8 kb of DNA.
Based on the frequency of allelic polymorphisms of this single animal, the difference between two haplotypes was estimated at 1.2%. In vertebrates, millions of single nucleotide polymorphisms (SNPs) and insertions/deletions (indels) have also been reported. For example, genome sequencing of multiple human individuals has identified approximately three million SNPs per individual, which account for 0.1% of the haploid genome size [2–4]. In zebrafish and fugu, allelic polymorphisms within a single individual comprise 0.4% of the genome [5, 6]. However, higher heterozygosity within individuals has been reported in several invertebrate deuterostomes, including sea urchin (4−5%) , amphioxus (4.0%)  and two other tunicates (4.5% in Ciona savignyi and 2.2% in Oikopleura dioica) [9–13].
C. intestinalis has been used for a wide range of biological studies including developmental and evolutionary studies. Establishing a dense map of genetic variations in C. intestinalis is essential for these studies, because no inbred strains are available and animals from natural populations are used for experiments. Genetic variation might affect reproduction of results obtained from different animals, and thus the map would be useful for designing and interpreting a wide range of molecular biology experiments. At the same time, genetic variation might be useful for annotating the genome, because functional elements including tissue-specific enhancers are thought to include less variation .
In most genome projects, DNA from only one individual was sequenced, and systematic genome-wide analyses of multiple individuals were rarely performed. As part of the C. intestinalis genome project, we also sequenced the genomes of three individuals collected from a natural population in Onagawa, Miyagi, Japan . However, these reads were not incorporated into the final assembly due to the high degree of polymorphism. In the present study, we re-map all of the whole-genome shotgun reads generated in that project and identify short genetic variations to better understand the nature of high polymorphic genomes. This information will also be useful for experimental biologists, who typically use animals whose genetic backgrounds differ from the reference sequence.
Results and discussion
Identification of short genetic variations
In the present study, we used only high quality regions (quality value ≥ 25) of raw shotgun reads from three different sets of genomic DNAs, which we used for determining the draft genome sequence of C. intestinalis. All of the sequence data were generated by the Sanger method . The first dataset (US1) is from a single individual from Half Moon Bay, California, USA, and gave rise to most of the sequences incorporated into the draft assembly. The second small set (US2) is from a different individual from the same population and also was included in the assembly. The third set (JP) is from a mixture of three individuals from a natural population in Japan. Because a preliminary inspection of this third dataset revealed substantial differences from the first and second sets, these sequences were not incorporated into the first draft assembly .
Sequences used in the present study
No. sequence fragments
No. sequence fragments mapped
Summary of candidate short genetic variations found in the present study
Candidate sites Total
Sites represented by only one read
Sites represented by multiple reads
Among all SNPs, 59% were A:T↔G:C transitions. Among the remaining 41%, C:G↔G:C transversions (6%) were represented much less than less than A:T↔C:G (19%) and A:T↔T:A (16%) transversions.
Among the candidates, 2,516,066 were represented by multiple reads, and they were considered to represent true genetic variations with high probability. The remaining 1,335,992 (1,095,119 SNPs and 240,873 indels) were represented by only one read, and we estimated the error rate of them and the minimum number of true variations. Based on quality values, 198,777 sequence errors are expected in the entire datasets. Even if all of these errors are included among these 1,334,992 candidates, a large fraction (~85%) is expected to represent true genetic variations.
To estimate the minimum number of true variations represented by single reads, we compared them with 1,179,850 of ESTs. Of these ESTs, 73% are derived from animals sampled from Onagawa, Miyagi, Japan, where we sampled the animals for generating the JP dataset. Among 1,095,119 substitutions, 203,286 sites were covered by one or more ESTs, and in 78,862 positions (39%), the same variations were observed in ESTs. This suggests that at least 39% of the candidate variations represented by only one read are true variations. Note that this latter estimation is very conservative and may be much lower than the real; more variations could be found in common if we had more ESTs. If this error estimation (roughly 39–85%) can be applied to indels as well, genetic variations may account for 2.7–3.3% of nucleotide positions among the five animals.
Since our method cannot identify long indels and chromosomal rearrangements, this may be an underestimate of the genetic differences between the animals. On the other hand, copy number variations might lead to mis-identification of genetic variations. Nonetheless, the genetic variations we identified provide an overview of short genetic variations in C. intestinalis.
Experimental validation of short genetic variations
To validate the identified genetic variations, we amplified five short genomic regions on five different chromosomes from 11 different animals by PCR and determined their sequences. These animals were randomly chosen among offsprings of 70 animals sampled from the same place where we had took the animals for generating the JP dataset. Of the 148 SNPs and 20 indels originally identified within these five regions, 120 (71%) were confirmed in one or more animals examined (Additional file 1 Table S1 and Additional file 1 Table S2). The remaining 29% may include bona fide variations that could be confirmed by sampling more individuals.
Differences between and within the three datasets
A similar calculation indicated that six haploid genomes from three individuals of a single population in Japan included 2,721,870-3,233,291 (2.43-2.88% of the entire genome) sites that differed from the reference genome. Of these, 2,288,273 sites were represented by multiple JP reads. At 357,229 sites, all of the JP reads differed from the reference. Because these 357,229 sites were not polymorphic among the JP reads, it is possible that these sites are homozygous in the three Japanese individuals, and represent genetic variations between the two populations. The average number of pairwise differences between the Japanese individuals in regions covered by two or more JP reads covered (99.8Mb) was 1,089,997 and the average number of pairwise differences per site (nucleotide diversity) was 0.0109. These values were close to the SNP sites and frequencies observed in the US1 set. Thus, the nucleotide diversity of C. intestinalis is lower than those of C. savignyi, sea urchin and amphioxus but much higher than those of humans [2-4] and fish [5, 6].
Among 3,233,449 SNPs and 618,609 indels, 813,622 (25.2%) and 133,418 (21.6%) were found in both the JP and the US (US1 and US2) datasets (Figure 1). This suggests that a significant fraction of genetic variations are shared between the two populations and are not minor alleles. If we obtain more sequence data from different individuals in future, more variations would turn out to be shared between the two populations. At the same time, these data also suggest that C. intestinalis possesses a large gene pool, because we found genetic variations taking ~3.3% of the genome from only the five animals of the two populations. If we obtain more sequence data from different individuals in future, more novel variations would be found. Indeed, in the above validation experiment, we identified 30 additional variations over 3.6 kb (data not shown), suggesting there are still unknown genetic variations in the gene pool of this animal.
Characterization of the short genetic variations
Among the SNPs found in coding sequences, 34.5% caused non-synonymous amino acid changes in encoded proteins, 0.7% caused nonsense mutations and the remaining 64.8% did not change the amino acid sequence (Figure 3E). On average, we found 5.6 non-synonymous, 0.1 nonsense and 10.4 synonymous SNPs per gene. In the reference genome, 15,254 gene models are predicted . The present study identified SNPs that caused non-synonymous and nonsense alternations in 12,493 (81.9%) and 1,214 (8.0%) genes, respectively. Even when the US1 dataset is examined, 9,347 (61.3%) and 610 (4.0%) genes showed non-synonymous and nonsense changes, respectively. On the other hand, 87.52% indels were found in introns, and 2.3%, 8.8% and 1.9% were found in 5’-UTRs, 3’-UTRs and coding regions (Figure 3C). Thus the density of indels was much lower in coding regions than in introns, 5’-UTRs and 3’-UTRs (Figure 3D).
Genetic variations and gene function
From 11,700 high-quality gene models with clear ORFs, we calculated the ratio of the rate of non-synonymous substitutions (dN) to that of synonymous substitutions (dS) to be 0.16. In contrast, the closely related, Ciona savignyi genome shows a much lower dN/dS ratio (0.07) . The dN/dS ratio in C. intestinalis was more similar to that of zebrafish (dN/dS =0.14) , suggesting relaxed selection pressure compared with C. savignyi.
For functional annotations, we first compared the whole C. intestinalis proteome with the human proteome using the Blastp program , because the human proteome is the most thoroughly curated and annotated and because both of C. intestinalis and humans belong to the same phylum, Chordata. Then, gene ontology (GO) terms from the best-hit human protein for each C. intestinalis protein were used in the following analysis.
The average length of proteins associated with these overrepresented GO terms was shorter than that of proteins in the entire genomes (236 and 447 amino acids, respectively). However, this difference is not directly relevant; the density of non-synonymous/nonsense substitutions were markedly lower in genes encoding proteins associated with these GO terms than in the entire genome, while synonymous and intronic SNPs were found at almost the same frequency or slightly less frequently (Figure 4B). Thus, genes encoding proteins associated with these overrepresented GO terms have less non-synonymous/nonsense substitutions.
We calculated dN/dS ratios of the C. intestinalis genes encoding proteins with human homologs (8,263 genes). Among them, 126 genes showed a dN/dS larger than 1, and therefore they are candidate genes under positive selection pressure. However, we could not identify a shared biological function among them.
Short genetic variation data as a resource for experimental biologists
Because no C. intestinalis inbred or laboratory strains have been established and researchers are using animals from natural populations, the genetic backgrounds of these animals differ from each other and from the reference. Therefore, we integrated the short genetic variations identified in the present study into the ghost database, a major Ciona database with a large set of genome-wide data [16, 18], while all of the sequence data have been available in the public database . Users can browse these variations as a track of the genome browser, each of which is linked to a detailed description. The information will be helpful not only for linkage analysis but also for designing and interpreting a wide range of molecular biology experiments.
When we had published the first genome sequence of C. intestinalis in 2002 , we were aware that this genome was extremely polymorphic. However, our present data indicate that the C. intestinalis genome is less polymorphic than the genomes of other invertebrate species that have been sampled from natural, non-inbred populations. This may be related to the fact that among these species, C. intestinalis has the smallest and most compact genome, with the exception of Oikopleura, which has a larger population size and an extraordinary short generation time . In addition, the efficiency of proofreading system during genome replication and the genome repair systems might be different among these species.
Our results suggest that C. intestinalis has a huge gene pool, and that researchers using these animals must be aware of the possible effects of genetic variation in their experiments. Indeed, genetic variation might affect reproduction of results obtained from different laboratories, or even from different animals.
At the same time, our results will be useful for annotating the genome, because a previous study showed that intraspecies sequence comparisons could effectively identify candidates of tissue-specific enhancers .
Materials and methods
Sequence reads and alignments
The raw shotgun sequence reads were generated in the previous study , which can be found in the trace archive of the public database. From these sequences, we extracted regions with a stretch of 100 or more bases, each of which has a high-quality value (≥ 25).
These high-quality sequences were aligned against the reference genome (KH assembly)  using the ssaha2 program ; the mapping criterion was identity of 90% or more of the entire read length. From the alignments, we identified short genetic variations (Additional file 2 Table S4, Additional file 3 Table S5, Additional file 4 Table S6). The KH gene models were used for identifying gene regions . When multiple transcripts were predicted for a single gene locus, we used transcripts with the longest coding sequences. We estimated dN/dS by using CODEML  with the F3 x 4 codon frequency model from concatenated coding sequences of 11,700 genes, which have clear ORFs beginning with ATG and ending with a stop codon.
Experimental validation of short genetic variations
Ciona intestinalis adults were cultivated at the Maizuru Fisheries Research Station of Kyoto University, in Maizuru, which faces the Sea of Japan. These animals are offsprings of 70 animals that were sampled from Onagawa, Miyagi, Japan. Genomic DNA was isolated from the sperm of 11 different individuals.
We amplified genomic fragments by polymerase-chain-reactions with five different sets of primers. PCR primers used are shown in Additional file 1 Table S7. The amplified sequence fragments were directly sequenced using an ABI3130xl sequencer.
Gene ontology analysis
All of the predicted C. intestinalis proteins were subjected to a BLAST search against the human proteome using uniprotKB (threshold value < 1e-5) . The best hit human proteins were chosen as likely orthologs, and GO annotations for these candidate orthologs were used.
single nucleotide polymorphisms
expressed sequence tag.
We thank Kazuko Hirayama and all the staff members of the Maizuru Fisheries Research Station of Kyoto University for collecting and cultivating Ciona intestinalis under the National Bio-Resource Project (NBRP) of MEXT, Japan. We also thank Chikako Imaizumi for technical assistance. This research was supported by a Grant-in-Aid from the MEXT, Japan, to YS (21671004).
- Dehal P, Satou Y, Campbell RK, Chapman J, Degnan B, De Tomaso A, Davidson B, Di Gregorio A, Gelpke M, Goodstein DM: The draft genome of Ciona intestinalis: insights into chordate and vertebrate origins. Science. 2002, 298 (5601): 2157-2167. 10.1126/science.1080049.View ArticlePubMedGoogle Scholar
- Wang J, Wang W, Li R, Li Y, Tian G, Goodman L, Fan W, Zhang J, Li J, Guo Y: The diploid genome sequence of an Asian individual. Nature. 2008, 456 (7218): 60-65. 10.1038/nature07484.PubMed CentralView ArticlePubMedGoogle Scholar
- Levy S, Sutton G, Ng PC, Feuk L, Halpern AL, Walenz BP, Axelrod N, Huang J, Kirkness EF, Denisov G: The diploid genome sequence of an individual human. PLoS Biol. 2007, 5 (10): e254-10.1371/journal.pbio.0050254.PubMed CentralView ArticlePubMedGoogle Scholar
- Wheeler DA, Srinivasan M, Egholm M, Shen Y, Chen L, McGuire A, He W, Chen YJ, Makhijani V, Roth GT: The complete genome of an individual by massively parallel DNA sequencing. Nature. 2008, 452 (7189): 872-876. 10.1038/nature06884.View ArticlePubMedGoogle Scholar
- Aparicio S, Chapman J, Stupka E, Putnam N, Chia JM, Dehal P, Christoffels A, Rash S, Hoon S, Smit A: Whole-genome shotgun assembly and analysis of the genome of Fugu rubripes. Science. 2002, 297 (5585): 1301-1310. 10.1126/science.1072104.View ArticlePubMedGoogle Scholar
- Guryev V, Koudijs MJ, Berezikov E, Johnson SL, Plasterk RH, van Eeden FJ, Cuppen E: Genetic variation in the zebrafish. Genome Res. 2006, 16 (4): 491-497. 10.1101/gr.4791006.PubMed CentralView ArticlePubMedGoogle Scholar
- Sodergren E, Weinstock GM, Davidson EH, Cameron RA, Gibbs RA, Angerer RC, Angerer LM, Arnone MI, Burgess DR, Burke RD: The genome of the sea urchin Strongylocentrotus purpuratus. Science. 2006, 314 (5801): 941-952.View ArticlePubMedGoogle Scholar
- Putnam NH, Butts T, Ferrier DE, Furlong RF, Hellsten U, Kawashima T, Robinson-Rechavi M, Shoguchi E, Terry A, Yu JK: The amphioxus genome and the evolution of the chordate karyotype. Nature. 2008, 453 (7198): 1064-1071. 10.1038/nature06967.View ArticlePubMedGoogle Scholar
- Silva N, Smith WC: Inverse correlation of population similarity and introduction date for invasive ascidians. PLoS ONE. 2008, 3 (6): e2552-10.1371/journal.pone.0002552.PubMed CentralView ArticlePubMedGoogle Scholar
- Denoeud F, Henriet S, Mungpakdee S, Aury JM, Da Silva C, Brinkmann H, Mikhaleva J, Olsen LC, Jubin C, Canestro C: Plasticity of animal genome architecture unmasked by rapid evolution of a pelagic tunicate. Science. 2010, 330 (6009): 1381-1385. 10.1126/science.1194167.PubMed CentralView ArticlePubMedGoogle Scholar
- Vinson JP, Jaffe DB, O'Neill K, Karlsson EK, Stange-Thomann N, Anderson S, Mesirov JP, Satoh N, Satou Y, Nusbaum C: Assembly of polymorphic genomes: algorithms and application to Ciona savignyi. Genome Res. 2005, 15 (8): 1127-1135. 10.1101/gr.3722605.PubMed CentralView ArticlePubMedGoogle Scholar
- Small KS, Brudno M, Hill MM, Sidow A: A haplome alignment and reference sequence of the highly polymorphic Ciona savignyi genome. Genome Biol. 2007, 8 (3): R41-10.1186/gb-2007-8-3-r41.PubMed CentralView ArticlePubMedGoogle Scholar
- Small KS, Brudno M, Hill MM, Sidow A: Extreme genomic variation in a natural population. Proc Natl Acad Sci U S A. 2007, 104 (13): 5698-5703. 10.1073/pnas.0700890104.PubMed CentralView ArticlePubMedGoogle Scholar
- Boffelli D, Weer CV, Weng L, Lewis KD, Shoukry MI, Pachter L, Keys DN, Rubin EM: Intraspecies sequence comparisons for annotating genomes. Genome Res. 2004, 14 (12): 2406-2411. 10.1101/gr.3199704.PubMed CentralView ArticlePubMedGoogle Scholar
- Sanger F, Coulson AR: A rapid method for determining sequences in DNA by primed synthesis with DNA polymerase. J Mol Biol. 1975, 94 (3): 441-448. 10.1016/0022-2836(75)90213-2.View ArticlePubMedGoogle Scholar
- Satou Y, Mineta K, Ogasawara M, Sasakura Y, Shoguchi E, Ueno K, Yamada L, Matsumoto J, Wasserscheid J, Dewar K: Improved genome assembly and evidence-based global gene model set for the chordate Ciona intestinalis: new insight into intron and operon populations. Genome Biol. 2008, 9 (10): R152-10.1186/gb-2008-9-10-r152.PubMed CentralView ArticlePubMedGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990, 215 (3): 403-410.View ArticlePubMedGoogle Scholar
- Satou Y, Kawashima T, Shoguchi E, Nakayama A, Satoh N: An integrated database of the ascidian, Ciona intestinalis: towards functional genomics. Zoolog Sci. 2005, 22 (8): 837-843. 10.2108/zsj.22.837.View ArticlePubMedGoogle Scholar
- Ning Z, Cox AJ, Mullikin JC: SSAHA: a fast search method for large DNA databases. Genome Res. 2001, 11 (10): 1725-1729. 10.1101/gr.194201.PubMed CentralView ArticlePubMedGoogle Scholar
- Yang Z: PAML: a program package for phylogenetic analysis by maximum likelihood. Computer applications in the biosciences : CABIOS. 1997, 13 (5): 555-556.PubMedGoogle Scholar
- UniProtConsortium: Ongoing and future developments at the Universal Protein Resource. Nucleic Acids Res. 2011, 39(Database issue): D214-219.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.