Copy number variations among silkworms
© Zhao et al.; licensee BioMed Central Ltd. 2014
Received: 31 July 2013
Accepted: 25 March 2014
Published: 31 March 2014
Skip to main content
© Zhao et al.; licensee BioMed Central Ltd. 2014
Received: 31 July 2013
Accepted: 25 March 2014
Published: 31 March 2014
Copy number variations (CNVs), which are important source for genetic and phenotypic variation, have been shown to be associated with disease as well as important QTLs, especially in domesticated animals. However, little is known about the CNVs in silkworm.
In this study, we have constructed the first CNVs map based on genome-wide analysis of CNVs in domesticated silkworm. Using next-generation sequencing as well as quantitative PCR (qPCR), we identified ~319 CNVs in total and almost half of them (~ 49%) were distributed on uncharacterized chromosome. The CNVs covered 10.8 Mb, which is about 2.3% of the entire silkworm genome. Furthermore, approximately 61% of CNVs directly overlapped with SDs in silkworm. The genes in CNVs are mainly related to reproduction, immunity, detoxification and signal recognition, which is consistent with the observations in mammals.
An initial CNVs map for silkworm has been described in this study. And this map provides new information for genetic variations in silkworm. Furthermore, the silkworm CNVs may play important roles in reproduction, immunity, detoxification and signal recognition. This study provided insight into the evolution of the silkworm genome and an invaluable resource for insect genomics research.
Copy number variations (CNVs) are defined as DNA sequences ranging from 1 kb to few Mb that have different numbers of repeats among individuals [1, 2]. Comparing with single nucleotide polymorphisms (SNPs), CNVs represent a higher percentage of genetic variation and have greater effects on a genome [3, 4]. For example, CNVs play roles in determining phenotypic difference among individuals through changing gene structure and dosage, regulating gene expression and function [5–8]. In addition to normal phenotypic variation, CNVs are also related to genetic disease susceptibility [8, 9]. And recently, CNV detection is substantially carried out in domesticated animals and these studies revealed that CNVs are associated with several phenotypic traits. For example, duplication of KIT gene in pigs determines the Dominant white locus ; while in sheep, the coat color is related to the duplication of ASIP. In ridgeback dogs, hair ridge and predisposition to dermoid sinus are caused by duplication of 4 genes (FGF3, FGF4, FGF19 and ORAOV1) ; and in Shar-Pei dogs, the wrinkled skin phenotype and a periodic fever syndrome are caused by upstream duplication of HAS2. Also, partial deletion of ED1 gene in bovine caused anhidrotic ecodermal dysplasia . In avian species, CNV in intron 1 of the SOX5 gene led to the pea-comb phenotype in chicken . Thus, detection of CNVs at a whole-genome level can give a lot of useful information and has been carried out in several domesticated animals, including pigs, sheep, cattle, dogs,horses and chickens [16–28] as well as crops . However, there is no information on CNVs in silkworm.
The domesticated silkworm (Bombyx mori), a model of Lepidoptera insects, has great economic value because of its silk production as well as its value as a good bioreactor . It is widely accepted that B. mori is domesticated from the wild silkworm, Bombyx mandarina, about 5000 years ago . And nowadays, more than 1,000 Bombyx mori inbred and mutant strains are kept all over the world . In 2008, an estimated 432 Mb silkworm genome was published , with 8.5-fold sequence coverage and N50 size of ~3.7 Mb. And 87% of the scaffold sequences anchored to all 28 chromosomes, which can provide us a reliable genome to analyze the CNVs in silkworm. A previous study showed that the copy number of carotenoid-binding protein (CBP), a major determinant of cocoon color, varied greatly among B. mori strains . Thus, the detection of CNVs at a whole-genome level is necessary for understanding phenotypic variations between different silkworms.
As far as we know, comparative genomic hybridization (CGH) and SNP arrays are routinely used for CNV identification [34–37]. However, the power of CNV detection is easily influenced by low probe density. In addition, although a subset of CNVs showed evidence of linkage disequilibrium with flanking SNPs , a significant number of CNVs located in the regions are not well recovered by SNP arrays [39, 40].
With the development of next-generation sequencing (NGS) and complementary analysis program, there are some better approaches to screen CNVs systematically at a whole-genome level. Generally, NGS employed the read depth (RD) methods to analyze data and previous studies indicated that data with the genome coverage greater than 4 fold are sufficient for RD detection of CNVs [25, 41–43]. To date, several methods have exploited sequence data in 1000 Genomes Project Pilot studies to detect CNVs [44, 45]. And several programs are developed to analyze CNVs. These programs included CNAnorm ( http://www.precancer.leeds.ac.uk/), Bayesian information criterion , ReadDepth , CNV-seq , mrsFAST  and so on . Specifically, an R package named readDepth can detect CNVs based on sequence depth and then invoke a circular binary segmentation algorithm to call segment boundaries . This program has high sensitivity and specificity and is appropriate for screening CNVs in duplication and repeat-rich regions . In this study, we resequenced 4 silkworms (2 domesticated silkworms and 2 wild silkworms). Then, we first used readDepth to screen the silkworm CNVs at a genome level and second used CNAnorm to recheck the CNVs, which can result in the high-confidence CNVs. Finally we tried to explore the distribution pattern and potential functions of the CNVs.
Resequencing data of four silkworms
The CNV calls in four silkworms
After filtering by RD and CNAnorm
Average size (kb)
Average size (kb)
Among four silkworms, the domesticated silkworm N4 contained the largest number of CNVs while wild silkworm NanC contained the fewest. As expected, the “uncharacterized chromosome” (ChrUn), sequences that cannot be mapped to the genome, contains most CNVs (~49%), which is consistent with the observation in cattle . However, the CNVs on ChrUn need to be further investigated since ChrUn contigs are shorter and mapping of ChrUn sequence reads is ambiguous. In our study, CNV detection would be leveraged on the reference genome, thus, copy numbers are reported more like relative copies comparing to the reference genome. A well assembled reference as well as the well-annotated duplications in genome would be important to the CNV detection using this method. Therefore, the correct assemble of the contigs on ChrUn as well as annotations of repeats in the genome may help to improve the identification of CNVs. In order to get the accurate information about the CNVs and excluded false positives, clone-ordered-based approaches for sequence assembly and further annotation of repeats are needed in further study. The remaining CNVs are distributed on the silkworm chromosomes 1–27 and there is no CNV on the chromosome 28.
The positions of CNVs were determined independently within each silkworm and we compared them among different silkworms. Generally, we classified the duplicated sequences as shared or specific to an individual based on the predicted absolute copy numbers. The results showed that most of the CNVs were shared among two or more silkworms (Additional file 6). Specifically, the domesticated silkworm N4 had the largest number of unique CNVs while wild silkworm NanC contained the smallest number of unique CNVs (Table 2; Additional file 6). In general, a genome is assumed to be more tolerant to duplications than to deletions [51–53], accordingly, CNV gain should be more than loss. However, we found that silkworm had more CNV losses than gains, which is consistent with other species [16, 17, 19, 23]. This result may be due to biological as well as technical reasons. One of the most important mechanisms which may be responsible for CNV formation, named as non-allelic homologous recombination, was proven to generate more deletions than duplications . On the other hand, the detection method may favor the identification of deletions as reported in several other studies [20, 44, 55]. However, to validate the real status of CNVs, other techniques such as quantitative PCR (qPCR) is necessary.
Generally, it is accepted that SDs provide substrates of gene and genome innovation as well as genome rearrangement. SDs are also hotspots of formation of CNVs. Thus, SDs may arise from ancient CNVs fixed in the population [57, 63–65]. As observed in other animals (dog, cattle, mouse, rat), there is a consistency (~50%-60%) between large CNVs and SDs (Figure 2) [16, 22, 60]. Thus, the association of large CNVs with SDs supports the hypothesis that CNV formation is mainly due to nonallelic homologous recombination (NAHR). This mechanism was proven to generate more deletions than duplications .
There are 208 functional genes resided at these high-confidence CNV loci. And 101 genes of them are duplicated in the silkworm genome. For example, CNV locus on scaffold 944 (scaffold 944: 6581–8724) encodes a HSP70 (heat shock protein 70) protein. In silkworm, a second copy of HSP70 is located on nscaf2801 (nscaf2801: 598000–599981).
The functional genes located in CNVs possess a large spectrum of GO molecular functions (Figure 3) and provide a wonderful resource for validating the hypothesis that phenotypic variation within and among silkworms may be related to CNVs. For example, the carotenoid-binding protein (CBP), a major determinant of cocoon color, was found to have different copy numbers among the domesticated silkworms, ranging from 1 to 20 . In present study, we also found that CBP gene (BGIBMGA009791-TA) is in CNV regions in 3 (XiaF, AK, NanC) of 4 silkworms investigated. This also further validated the efficacy of our CNV detection.
Genes with molecular function falling in binding and catalytic are enriched in the CNVs as well as SDs (Figure 3) (T-test, p < 0.01), which proved that particular gene classes are overrepresented in CNVs. A lot of these genes may very important in the lineage-specific adaptions of the organism to a particular environment. For example, Antimicrobial peptides (AMP) genes, which play important roles in innate immune system in insects , were found to be enriched in silkworm CNVs (6 genes were identified). Furthermore, since silkworm has to digest the secondary products in the mulberry leaves, some enzymes should be evolved to adapt to it . For example, cytochrome P450 enzymes are involved in such biological processes in the silkworm . In this study, we identified 10 genes belonged to P450 gene family. We also identified Carboxylesterase (COE), which involved in xenobiotic detoxification as well as pheromone degradation , in the CNVs regions. Other genes family related with important functions in lineage-specific evolution included Lipoprotein_11, heat shock proteins are also identified in our study (Additional file 9).
We investigated the genes in the regions of domesticated-specific, wild-specific and all-possessed CNVs. The domesticated-specific CNVs contained 24 functional genes, while wild-specific CNVs contained only 17 genes. We also surveyed the functions and expression patterns of these genes. Most of the genes in these CNV regions are related to detoxification, reproduction and immunity since they were expressed in midgut, testis, ovary and homocyte, respectively. In domesticated-specific CNV regions, there is an extra gene cluster which was expressed in silkgland (Additional file 10). However, most members of this gene cluster were poorly annotated in the silkworm database, indicating that the functional information on the genes in CNVs has been very limited to date. This deserves further investigation in future.
We used real time quantitative PCR (qPCR) to validate CNVs in 5 genomic regions as well as 10 genes. Four of five loci (genomic sequences) were validated by this method (Additional file 11). For the exception, the silkworm genome has two copies of Target_r1 (scaffold984:1…11044) based on the BLASTN searches against B. mori. And the qPCR results showed little variation among 4 silkworms (2 domesticated and 2 wild) at this locus. This might be: (1) prediction errors of CNVs, that is, the false positive; (2) polymorphisms such as indels and SNPs that influence binding of the qPCR primers. For four validated regions, we found that there was a big difference in copy number at the locus of Target_r3 between domesticated and wild silkworms. That is, domesticated silkworm contained more copies than wild type at this locus based on the qPCR results. Also, this region belongs to domesticated-specific region. Furthermore, we found that only one gene (BGIBMGA014594-TA) is located in this CNV region. However, this gene was poorly annotated so far. A previous study showed that this gene was specifically and highly expressed in testis, indicated that this gene may play important roles in reproduction . Further study is needed to characterize its function.
The CNVs (86.7%, 12/15) were confirmed to be positive CNVs by qRT-PCR (Figure 5, Addational file 8). It should be emphasized that not all true CNVs could be detected by qPCR, especially some low-copy duplications with less sequence similarities. Thus, 13.3% for false positive rate is a conserved estimate in our CNV analysis.
We have constructed the first CNVs map in silkworm based on next-generation re-sequencing data. A total of ~319 CNVs were identified in the silkworm genome. We presented the frequency, pattern and gene-content of these CNVs. Our results indicated that the genes in CNVs may be involved in specific biological functions such as reproduction, immunity, detoxification and signal recognition. Besides, we identified 80 CNVs that may be individual-specific. Most of genes in these 80 regions were also related to reproduction or detoxification. The data presented in this study provided insight into the evolution of the silkworm genome and an invaluable resource for insect genomics research.
Silkworm genome was obtained from previous studies [33, 72]. We prepared libraries for four silkworms (two wild silkworms named as AK and NanC and two domesticated silkworms named as XiaF and N4). We sequenced them using Illumina (Hiseq2000) according to standard manufacturer protocols. The low-quality (Quality < 20) nucleotides were trimmed by sliding a 5 bp window.
We used the BWA program to align the paired-end reads to the silkworm genome reference , the criteria are the same as to previous study . For the detection of CNVs among four silkworms, we have applied a program called readDepth  using a parameter 0.01 of an FDR rate, which resulted in bins with a size of 1.7 kbp. And readDepth calculates the thresholds for copy number gain and loss for each silkworm (Additional file 13). The readDepth uses a binning procedure to call copy number variants based on sequence depth and then call segment boundaries using a circular binary segmentation algorithm. Our previous results suggested that there are ~1.4% of SDs in the reference genome , which can help us to adjust the data in the program. The GC bias was corrected using LOESS method to fit a regression line to the data [41, 47].
In order to find the high-confident CNVs, we calculated the read depth (RD) of the regions predicted by the readDepth. And we calculate the average read depth for the unique regions of silkworm identified before . We only kept the regions with RD greater than 3 standard deviations from the mean . Then, these regions whose RD differed significantly from the average of genome RD (Chi square test; p < 0.05) were termed as potential CNVs.
Because different algorithms can generate different CNV results , we used CNAnorm ( http://www.bioconductor.org/packages/release/bioc/html/CNAnorm.html) to recheck our CNV regions to reduce the false-positive or false-negative rate. We employed parameters of –readNum 150, −-saveTest, −-saveControl in PERL script of bam2windows.pl (a script in the CNAnorm package). The parameter lambda 7 was used to decrease noise without losing resolution and ploidy (ploidy = (sugg.ploidy(CNN4) + 1)) was used to check the potential CNVs in the genome.
Heatmaps were obtained based on the absolute copy number call generated by readDepth. The gplots R package ( http://cran.r-project.org/web/packages/gplots/index.html) was employed to get the heatmap of the absolute copy number call in four silkworms.
Gene content of B. mori segmental duplications was assessed using the glean consensus gene set ( http://silkworm.genomics.org.cn/) . We obtained a total of 14,623 silkworm peptides from SilkDB. In addition, using Gene Ontology (GO) , we tested the hypothesis that the molecular function, biological process, and pathway terms were under- or overrepresented in CNV regions. Furthermore, we compared the GO results between the genes from SDs and the genes from CNV regions. Pfam  was also used to annotate the function of the genes in CNV regions.
Genomics DNAs were extracted from domesticated and wild silkworms, and stored in Tris-EDTA (TE) buffer at 4°C. The primers used in qPCR are designed using Primer 5.0 and listed in Additional file 14. The principle for copy number quantifying using qPCR was described in previous study . According to previous studies, OR2 was chosen as control because of its highly-conserved sequence and single copy in the silkworm genome [24, 78, 79]. Con_R is a two-copy region in the silkworm genome according to B. mori genome database [71, 72, 80, 81]. We also used this region as control to estimate copy numbers of target regions.
Each PCR reaction was prepared as follows: 10 μl of SYBR-Green PCR master mix, 1 μl of each primer (10 μM), 7 μl of water, and 1 μl of genome template. Quantitative real-time PCR was carried out using the ABI Stepone plus system. The thermocycler program had an initial 95°C denaturation step followed by 40 cycles consisting of a 10-s denaturation at 95°C, a 40-s annealing at 60°C, and a 30-s extension step at 72°C. At the end of each reaction, a disassociation curve was created, which was used to help to detect the presence of primer dimers of other unwanted amplification products that may produce a detectable cycle threshold (Ct) value. Copy number was analyzed according to comparative Ct method. The ∆CT and ∆∆CT were calculated by the formulas ∆CT = CT target – CT control (single copy) and ∆∆CT = ∆CT SD samples -∆CT single copy sample, respectively. The domesticated silkworm JianPZ was taken as a standard for determining gene copy number.
Raw sequence reads have been deposited in the ENA database (The European Bioinformatics Institute) with the accession number PRJEB5458 and can also be downloaded from http://bioinfor.cqu.edu.cn/read_silkworm/.
This work was supported by the Hi-Tech Research and Development (863) Program of China (2013AA102507) and by a grant from National Natural Science Foundation of China (No. 31272363).
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.