Identification of both copy number variation-type and constant-type core elements in a large segmental duplication region of the mouse genome
- Juzoh Umemori†1, 2, 7,
- Akihiro Mori†3,
- Kenji Ichiyanagi4,
- Takeaki Uno5 and
- Tsuyoshi Koide1, 6Email author
© Umemori et al.; licensee BioMed Central Ltd. 2013
Received: 3 December 2012
Accepted: 5 July 2013
Published: 8 July 2013
Copy number variation (CNV), an important source of diversity in genomic structure, is frequently found in clusters called CNV regions (CNVRs). CNVRs are strongly associated with segmental duplications (SDs), but the composition of these complex repetitive structures remains unclear.
We conducted self-comparative-plot analysis of all mouse chromosomes using the high-speed and large-scale-homology search algorithm SHEAP. For eight chromosomes, we identified various types of large SD as tartan-checked patterns within the self-comparative plots. A complex arrangement of diagonal split lines in the self-comparative-plots indicated the presence of large homologous repetitive sequences. We focused on one SD on chromosome 13 (SD13M), and developed SHEPHERD, a stepwise ab initio method, to extract longer repetitive elements and to characterize repetitive structures in this region. Analysis using SHEPHERD showed the existence of 60 core elements, which were expected to be the basic units that form SDs within the repetitive structure of SD13M. The demonstration that sequences homologous to the core elements (>70% homology) covered approximately 90% of the SD13M region indicated that our method can characterize the repetitive structure of SD13M effectively. Core elements were composed largely of fragmented repeats of a previously identified type, such as long interspersed nuclear elements (LINEs), together with partial genic regions. Comparative genome hybridization array analysis showed that whereas 42 core elements were components of CNVR that varied among mouse strains, 8 did not vary among strains (constant type), and the status of the others could not be determined. The CNV-type core elements contained significantly larger proportions of long terminal repeat (LTR) types of retrotransposon than the constant-type core elements, which had no CNV. The higher divergence rates observed in the CNV-type core elements than in the constant type indicate that the CNV-type core elements have a longer evolutionary history than constant-type core elements in SD13M.
Our methodology for the identification of repetitive core sequences simplifies characterization of the structures of large SDs and detailed analysis of CNV. The results of detailed structural and quantitative analyses in this study might help to elucidate the biological role of one of the SDs on chromosome 13.
KeywordsComparative genome hybridization array Repetitive element Retrotransposon Mouse genome Homology search
Copy number variation (CNV) of genomic segments is a common phenomenon that affects approximately 12% and 10.7% of the human and mouse genomes, respectively [1–5]. Comprehensive genomic analyses have shown that CNV sequences often overlap and form clusters of variable regions [3, 6, 7]. These regions, known as CNV regions (CNVRs), are associated with variations in gene expression and phenotype [3, 5, 7–15]. Frequently, CNVRs of intermediate size and larger (>10 kbp) are associated with segmental duplications (SDs) in the human and mouse genomes [3, 6, 7, 16]. An SD is defined as a block of highly homologous (>90%) duplicated genomic DNA that, in the human genome, can range from 1 kbp to several hundred thousand bp [13, 15]. In the mouse genome, SDs can be as large as 1 Mbp in size [17, 18]. Many of the large SDs contain repetitive sequences with ambiguous borders and copy numbers that vary among strains. These sequences are called complex CNVRs . A previous study proposed that CNVRs are associated with differences in gene expressions among strains, possibly through changes of local chromatin structures in CNVRs . Previous studies identified SD regions through systematic analysis of the mouse genome and characterized CNV in these regions [17, 18]. However, the detailed character of the repeating unit and the structure of the duplication pattern remained to be resolved. To better understand the evolution of SDs and the biological role of CNVRs, the repetitive structure of SDs must be elucidated in more detail. In this study, we aimed to identify repetitive “core elements” as well as copy numbers of the elements and the detailed structure of large SDs in the mouse genome. Core elements were defined as consensus sequences of repetitive sequences and were expected to be the basic units that formed SDs.
We characterized the organization and variation in copy number of core elements in one of the large SDs on chromosome 13 in mice. The strategy implemented in this study involved four steps: (i) self-comparison of the DNA sequences of entire mouse chromosomes (self-comparative-plot analysis) using the high-speed and large-scale-homology search algorithm, Similarity/Homology Efficient Analyze Procedure (SHEAP), to identify candidate SDs , (ii) identification of core elements and description of the repetitive structure of the SD, using the newly developed stepwise ab initio method, blast-based Systematic analysis of HErPlot to Extract Regional Distinction (SHEPHERD), (iii) comparison of the CNV found in the core elements among mouse strains by comparative genome hybridization array (aCGH), and (iv) characterization of core elements that contain CNV (CNV type) and those that do not (constant type).
Detection of segmental duplications by SHEAP
Comparison of the proportion of known repeats in SD13M with the values for the entire genome and average values for SDs
Ratio in SD13M (%)
Segmental duplication average (%) 
Whole genome average (%) 
Identification of core elements for SD13M
Extraction of fundamental repetitive sequences from a self-comparative-plot matrix
Among the 16,872 repetitive sequences identified in SD13M, there were 2,638 sequences within the range of 3–4.5 kbp. To eliminate redundant and overlapping sequences from among these 2,638 repetitive sequences, we selected one sequence out of the 2,638 repetitive sequences and removed the other sequences represented by diagonal lines that were located in the same column or in the same row as that sequence in the self-comparative-plot map (Figure 1B). This step has the advantage of reducing machine loading given that the pairwise analysis that is most commonly used requires approximately 7 million (2638 * 2637) pairs, and thus require many days to complete. In contrast, by eliminating redundant and overlapping sequences from 2,637 sequences, the use of a self-comparative-plot map enables the analysis to be completed within a short period of time. After repeating this process for different sequences until all redundant sequences had been removed, 547 nonredundant and nonoverlapping repetitive sequences remained, which covered approximately 80% of the SD13M region. We defined these sequences as fundamental repetitive sequences.
Clustering of fundamental repetitive sequences
We clustered the fundamental repetitive sequences into groups, such that there was the maximum redundancy in sequence similarity within each group, but the least possible redundancy between the groups. The overlap of these fundamental repetitive sequences was tested by pairwise alignments using bl2seq (without filtering option; –FF) [28, 29]. We counted the number of sequences that overlapped with other sequences for different lengths of overlap (Figure 3C). The sequences of 92% of the total number of fundamental repetitive sequences shared at least 2.7 kbp in length, and 14% of the sequences shared 4 kbp with at least one other sequence. Given that most of the sequences that were shared had a size equivalent to that of fundamental repetitive sequences, which are between 3.0 and 4.5 kbp in length, the fundamental repetitive sequences could be classified into a smaller number of groups as follows. We considered that the most representative sequence in each group should have the highest number of matching counts with other sequences in the group, and that sequences similar to the representative sequence should belong to that group. As a result of this process, we clustered the 547 fundamental repetitive sequences into 59 groups (representative sequences).
Identification of core elements
Characterization of core elements
To characterize the core elements, we annotated them with RepeatMasker and with BLASTN using the RefSeqGene database. The known repeats and RefSeq sequences that were detected in each core element are listed in Additional file 4. As expected from the results of the self-plot analysis with masked sequences, all of the core elements contained at least a partial sequence of a known repeat, such as a long interspersed nuclear element (LINE), short interspersed nuclear element (SINE), or long terminal repeat (LTR) type of retrotransposon, as well as uncharacterized repeats such as MurSatRep1. The average proportion of known repeats in the core elements was 59.5% (Additional file 5), which indicated that core elements consisted largely of known repeats. Most of these known repeats were fragmented and overlapped with each other. One-third of the core elements (20 out of 60) contained partial sequences of RefSeq genes (Additional file 4). These partial sequences could be divided roughly into three types of reported or predicted genes: members of the zinc finger protein (Zfp) family, members of the vomeronasal 2 receptor (Vmn2r) family, and chromobox homolog 3 (Cbx3). The average proportion of the total lengths of these annotated gene-like sequences that was found in the core elements were 8% for Zfp, 16% for Vmn2r, and 37% for Cbx3 (Additional file 6).
CNV of core elements among mouse strains
The results of the aCGH were confirmed by quantitative PCR analysis using genomic DNA as the template with several sets of primers (Figure 6A and 6B, Additional files 7 and 8; see Methods). The qPCR analyses showed that the relative amount of core element 541 increased additively as the dosage of the MSM allele in a given consomic strain increased (Additional file 8A). Conversely, the relative amount of core element 454 remained almost constant when the dosage of the MSM allele increased in the same consomic strain (Additional file 8B). These results indicate that the copy number of core element 541 is greater in MSM than in B6, whereas the copy number of core element 454 does not differ between B6 and MSM.
Comparison of constant-type and CNV-type core elements
Annotation of known repeats in core elements with large CNV and without CNV
Expected copy numberc
Known repeats and RefSeq genes a
Total length in
Core element 042
LINE, Vmn2r , LTR
Core element 108
Cbx3 , LTR, SINE
Core element 127
Vmn2r , LINE
Core element 244
LINE, Vmn2r , SINE
Core element 352
LINE,LTR, Simple, SINE
Core element 454
Core element 462
Core element 484
LINE, Simple, SINE
Core element 103.1
Core element 103.2
Zfp , LINE, LTR
Core element 146
Core element 154
Core element 177.1
Core element 182
LINE, LTR, MurSatRep1
Core element 364
Core element 447
LTR, SINE, Simple/Sat
Core element 510
Core element 541
It has been reported that the mammalian genome contains many complex arrays of repetitive sequences in the centromeric and subtelomeric regions, as well as other SD regions in which many repetitive sequences coexist in a complex manner [30, 31]. However, many complex arrays of repetitive sequences, in particular large SD regions, have been neglected during the detailed characterization of genome structure, partly owing to the lack of an appropriate method for the comprehensive analysis of such highly complex structures. In the present study, we conducted a whole-genome search for complex arrays of repetitive regions by the self-comparative-plot method using the SHEAP program. The advantages of SHEAP are: (i) its applicability to massively long sequences (i.e., whole genome sequences of human or mouse), (ii) its applicability to sequences that contain many global repetitive structures, and (iii) its ability to complete the analysis within a reasonable time frame. With respect to the last point, SHEAP can complete the self-comparison of one human or mouse chromosome within 20 minutes when using a conventional personal computer. As a result, in this study, it was possible to visualize remarkably large SD regions, which covered more than 500 kbp and were composed of complex arrays of duplicated sequences in both forward and reverse directions, as square dark patches.
The mouse genome has been systematically searched for regions that contain SDs . All of the large SD regions that were identified in the present study were also reported as SD regions in an earlier study . However, other SD regions that were reported previously, such as those on chromosomes 1, 6, 8, and 17, were not detected as dark square patches in the self-comparative-plots of whole chromosomes that were generated by SHEAP. The results indicate a limitation of this approach based on self-comparative plots because dark square patches were not apparent in some SD regions. Nevertheless, they were detected at higher magnification (Additional file 9), and showed different patterns to those of the dark square patches. These results suggest that different types of SD exist in the mouse genome. Indeed, in the self-comparative-plot analysis of sequence similarity among large SD regions (Additional file 10), we found that most of the SDs comprised unique repetitive sequences, although all of the SDs share many known repetitive elements. Furthermore, this observation suggests that interchromosomal nonallelic homologous recombination has occurred rarely among the SDs in the mouse genome, consistent with a previously described finding , and that the SDs have formed and evolved independently.
The present study is the first detailed analysis of repetitive elements in SD13M, which is one of the large SDs of the mouse genome. The results showed that six core elements within SD13M contained the functionally uncharacterized satellite repeat MurSatRep. The presence of this satellite repeat was characteristic of SD13M because its frequency was greater in SD13M than in SDs overall (Table 1). The transposable element MurSatRep1 is presumed to be associated with pericentromeric duplications (Repbase database). These results support the contention that core elements might have structural significance, similar to repetitive sequences in the centromeric region [32, 33]. In addition, four core elements contained MMSAT4, which has been reported to be a satellite sequence that encodes zinc finger proteins. The presence of this repeat was also characteristic of SD13M, but its function is unknown. Other core elements contained known repetitive elements such as LINEs, small regions of RefSeq sequences, and LTR sequences. LINEs are known to be enriched in intermediately sized and larger SDs (>10 kbp) and duplicated gene regions, and are supposed to facilitate nonallelic homologous recombination [7, 18, 34, 35]. The existence of these repetitive elements in the core elements strongly supports the hypothesis that SD13M was formed by combinations of nonallelic homologous recombination events. Furthermore, regions with abundant transposable elements are thought to be targeted preferentially by other transposition events . The presence of a higher proportion of LTR sequences in CNV-type than in constant-type core elements suggests that retrotransposition of LTRs also promotes nonallelic homologous recombination and caused CNV in SD13M. This model is very similar to the case of centromere expansion in rice, in which retrotransposons and satellite repeats were duplicated by intra-element homologous recombination .
In the present study, we characterized both the structures and the relative quantities of the repetitive elements in a complex SD region on chromosome 13 of mouse. Although we did not address the functional significance of SDs in this study, their characteristic repetitive structure indicates that they are similar to the functionally important centromeric region [32, 33]. Interestingly, SD13M is included in the region of chromosome 13 that was difficult to substitute from strain B6 to MSM during the course of establishing a consomic strain . The results of structural and quantitative analyses in this study may help to elucidate the biological role of SD13M.
In order to detect SDs, we conducted self-plot analysis using genome sequence data (Jul. 2007 assembly of the mouse genome, mm9, NCBI Build37). The Y chromosome was excluded from the analysis because the sequence data for this chromosome included uncertain nucleotides at the level of 83.0%. SDs were visualized and detected using SHEAP, an algorithm capable of efficient discovery of similar short substrings . SHEAP can draw a self-comparative-plot more rapidly than harplot or BLAST-based programs [37–39]. The regions that contained SDs were detected simply as clusters of dots that formed complex diagonal lines on images of Figure 1. For the rough detection and visualization of SDs in each chromosome, the criteria of the SHEAP analysis were set to assign a dot to a pixel whenever a pair of sequences of 3,000 bp shared more than three 30-bp homologous sequences (≤2 mismatches), and the overall distance between the paired sequences was larger than 300 bp.For the detailed analysis of repetitive elements in SDs, dots were assigned to pairs of 300-bp sequences when they shared at least one 30-bp homologous sequence (≤2 mismatches).
For further detailed analysis of SD13M, we used sequence data from B6 (NCBI Build37). Before conducting further analysis, we checked assembly data of BAC contig, and found no apparent errors (data not shown). Although we cannot rule out the possibility of small errors, the overall sequence is reliable. The detection of known repeats and masking of SD13M were conducted with Repeatmasker (downloaded from http://www.repeatmasker.org/) using Repeatmaskerlibraries -20090604 (downloaded from http://www.girinst.org/, megablast –p megablast –W 28 –G 0 –E 0 –q -2 –i filenameA –jfilenameB). The pairwise alignments of fundamental repetitive sequences were conducted using bl2seq (bl2seq –ifilenameA –jfilenameB –p blastn –FF). Known repeats were characterized using Repbase (http://www.girinst.org/repbase/). All RefSeq genes in masked core elements were identified with BLASTN (2.2.24+) [40–42] using the RefSeqGene database (Mus musculus, NCBI Transcript Reference Sequences). A MUSCLE analysis was conducted through a web site (http://www.ebi.ac.uk/Tools/msa/muscle/) in March 2011.
Three inbred strains of mouse, BLG2/Ms (BLG2), C57BL/6J (B6), and MSM/Ms (MSM), were maintained in the animal facility at the National Institute of Genetics (NIG), Mishima, Japan. Both BLG2 and MSM were established as inbred strains after 20 generations of brother–sister mating [43, 44]. The BLG2 and MSM strains belong to the musculus subspecies group, whereas B6 belongs to the domesticus subspecies group . All mice were kept in accordance with NIG guidelines, and all procedures were carried out with approval (No. 18–18 and 19–6) from the Committee for Animal Care and Use of the NIG.
Comparative genome hybridization array (aCGH)
To conduct aCGH analysis on the SD13M region, we designed four types of custom tiling array probe. The first and second types of probe covered a region of approximately 6 Mbp that surrounded SD13M (63,267,529–69,226,366; NCBI Build37). Probes of the first type were completely unique within that genomic region, whereas the second type of probe appeared more than twice in the genomic region covered, but did not appear in other genomic regions. When the first and second types of probe were combined, the average interval between them was 46.3 bp. These probes should detect CNV only in SD13M. The third type of probe covered a small area of chromosome 17 (80,000,245–80,099,784, NCBI Build37) that does not contain an SD and was used for normalization. Owing to the fact that the probes were isothermal, the lengths of the probes ranged from 50 to 75 bp. All of the probes were arrayed in triplicate. The total number of probes in an array was 75,000 (25,000 types of probe × 3). As a result of the aCGH analysis, a total of 9,929 types of probe were mapped on 53 core elements. It was not possible to map a sufficient number of probes on the remaining seven core elements because appropriate sequence probes for the aCGH tiling array were not well represented on these elements (number of probes/core element: < 30).
Genomic DNA was purified from the nuclei of kidney cells from B6, MSM, and BLG mice, and then purified further with DNeasy (Qiagen). Reference DNA (B6) and test DNA (BLG2 or MSM) samples were labeled differentially with Cy3 and Cy5, respectively, and hybridized competitively to a microarray chip. Labeling and hybridization were carried out by a commercial aCGH service (Nimblegen Systems, Roche). The fluorescence ratio between Cy3 and Cy5 was normalized against the average value for the control probes designed for chromosome 17.Four sets of aCGH analysis were conducted between two strain pairs, BLG2 and B6, and B6 and MSM. Each genomic DNA sample had two biological replicates. Relative CNV values as compared with B6 are described as the log2 values for each probe on SD13M. Given that most of the probes showed a higher copy number in the BLG2 and MSM strains than in B6, the strain that was the source of the sequence information, the difference of the CNV values was unlikely to have been caused by sequence polymorphisms.
Quantitative PCR (qPCR) analysis using genomic DNA
Primers for qPCR of the core elements were designed by a web-based service, PRIMER3 (http://frodo.wi.mit.edu/primer3/), using a mispriming library (RODENT_AND_SIMPLE). A single-copy-number gene, parathyroid hormone-related protein (Pthlh, NM_08970), was used to normalize the levels of genomic DNA . The sequences of all the primers used in the present study are listed in Additional file 7. The qPCR on the genomic DNA of B6, BLG, and MSM mice was conducted using SYBR® Premix Ex Taq™ II (TAKARA) and a Thermal Cycler Dice Real Time System (TAKARA), in accordance with the manufacturer’s instructions. All reactions were carried out with biological triplicates, each with experimental duplicates. Relative comparative threshold cycle (Ct) values were calculated on the basis of the second derivative maximum method using dedicated software (TAKARA TP800). Relative copy numbers of core elements were estimated by comparison with other strains or genotypes on the basis of the Ct values. Genomic DNA was prepared from different versions of consomic strain B6-Chr13AMSM, which contains entire chromosome 13 of MSM in a B6 genetic background. The homozygotes of entire chromosome 13 infrequently appeared in the crosses of the heterozygotes for chromosome 13 of MSM and B6. The different versions were homozygous or heterozygous for the MSM allele of the SD13M region (SD13MMSM/MSM and SD13MMSM/B6, respectively), or homozygous for the B6 allele (SD13MB6/B6). By using genomic DNA from these strains, we were able to investigate the relative copy number of core elements by targeting the SD13M region. The genotypes of these consomic mice are shown in Additional file 7.
Divergence within each group of core elements
The pairwise divergences of the homologous sequences in each group of core elements were calculated by Repeatmasker using custom-made Repeatmasker library files that contained each of the homologous sequences. The divergence of a core element group was represented by the average of these pairwise divergences. Average divergences were calculated for the seven core elements with constant copy number and for the 46 core elements with CNVs. Core elements 177 and 254 were excluded from the analysis because their sequences were partially contained within core elements 541 and 244, respectively.
Programs and statistics
SHEAP is available online (http://research.nii.ac.jp/~uno/codes.htm). All programs, including SHEAP, SHEPHERD, and a pair-comparison program based on BLAST, are available upon request. Free software, R (http://www.r-project.org/), was used for graphics and statistical analysis.
For the analysis of CNV in core elements, the significance of differences in copy number was determined by a simple two-sided t-test. When the average of the aCGH values (log2) mapped on each core element was zero, the null hypothesis of no difference in copy number between two strains was applied. Thus, t-statistics were calculated using the formula (i), where n(i) indicates the number of probes, and U(i) and (i) indicate unbiased estimates of the population variance and the average of the aCGH values mapped on the core elements, respectively. The P value for each core element was calculated by t-statistics with a t-distribution [df = (probe number) –1] under the null hypothesis. We adjusted the P value for multiple comparisons (Bonferroni, N = 54), and when it was below the threshold for significance (<0.05), the core element was interpreted as having significant CNV.
Copy number variation
Long interspersed nuclear element
Comparative genome hybridization array
Long terminal repeat
Short interspersed nuclear element
Zinc finger protein
Vomeronasal 2 receptor
Chromobox homolog 3.
We are grateful to T. Shiroishi, K. Moriwaki, and A. Kiso for the supply of mice and for helpful advice, Y. Ninomiya for statistical advice, and Y. Sato for evolutionary interpretation. We thank all members of the Mouse Genomics Resource Laboratory at the National Institute of Genetics for rearing the mice and for supporting this study. This study was supported financially by the Research Organization of Information and Systems, Transdisciplinary Research Integration Center (JU, TU, TK), KAKENHI (Grant-in-Aid for Scientific Research) from the Ministry of Education, Culture, Sports, Science and Technology of Japan, and the Japan Society for the Promotion of Science (JSPS).
- Sebat J, Lakshmi B, Troge J, Alexander J, Young J, Lundin P, Månér S, Massa H, Walker M, Chi M, Navin N, Lucito R, Healy J, Hicks J, Ye K, Reiner A, Gilliam TC, Trask B, Patterson N, Zetterberg A, Wigler M: Large-scale copy number polymorphism in the human genome. Science. 2004, 305: 525-528. 10.1126/science.1098918.View ArticlePubMedGoogle Scholar
- Freeman JL, Perry GH, Feuk L, Redon R, McCarroll SA, Altshuler DM, Aburatani H, Jones KW, Tyler-Smith C, Hurles ME, Carter NP, Scherer SW, Lee C: Copy number variation: new insights in genome diversity. Genome Res. 2006, 16: 949-961. 10.1101/gr.3677206.View ArticlePubMedGoogle Scholar
- Redon R, Ishikawa S, Fitch KR, Feuk L, Perry GH, Andrews TD, Fiegler H, Shapero MH, Carson AR, Chen W, Cho EK, Dallaire S, Freeman JL, González JR, Gratacòs M, Huang J, Kalaitzopoulos D, Komura D, MacDonald JR, Marshall CR, Mei R, Montgomery L, Nishimura K, Okamura K, Shen F, Somerville MJ, Tchinda J, Valsesia A, Woodwark C, Yang F, et al: Global variation in copy number in the human genome. Nature. 2006, 444: 444-454. 10.1038/nature05329.PubMed CentralView ArticlePubMedGoogle Scholar
- Beckmann JS, Estivill X, Antonarakis SE: Copy number variants and genetic traits: closer to the resolution of phenotypic to genotypic variability. Nat Rev Genet. 2007, 8: 639-646.View ArticlePubMedGoogle Scholar
- Henrichsen CN, Vinckenbosch N, Zöllner S, Chaignat E, Pradervand S, Schütz F, Ruedi M, Kaessmann H, Reymond A: Segmental copy number variation shapes tissue transcriptomes. Nat Genet. 2009, 41: 424-429. 10.1038/ng.345.View ArticlePubMedGoogle Scholar
- Goidts V, Cooper DN, Armengol L, Schempp W, Conroy J, Estivill X, Nowak N, Hameister H, Kehrer-Sawatzki H: Complex patterns of copy number variation at sites of segmental duplications: an important category of structural variation in the human genome. Hum Genet. 2006, 120: 270-284. 10.1007/s00439-006-0217-y.View ArticlePubMedGoogle Scholar
- Cahan P, Li Y, Izumi M, Graubert TA: The impact of copy number variation on local gene expression in mouse hematopoietic stem and progenitor cells. Nat Genet. 2009, 41: 430-437. 10.1038/ng.350.PubMed CentralView ArticlePubMedGoogle Scholar
- Feuk L, Marshall CR, Wintle RF, Scherer SW: Structural variants: changing the landscape of chromosomes and design of disease studies. Hum Mol Genet. 2006, 15 (Spec No 1): R57-R66.View ArticlePubMedGoogle Scholar
- Li J, Yang T, Wang L, Yan H, Zhang Y, Guo Y, Pan F, Zhang Z, Peng Y, Zhou Q, He L, Zhu X, Deng H, Levy S, Papasian CJ, Drees BM, Hamilton JJ, Recker RR, Cheng J, Deng H-W: Whole genome distribution and ethnic differentiation of copy number variation in Caucasian and Asian populations. PLoS One. 2009, 4: e7958-10.1371/journal.pone.0007958.PubMed CentralView ArticlePubMedGoogle Scholar
- Sha B-Y, Yang TL, Zhao LJ, Chen XD, Guo Y, Chen Y, Pan F, Zhang ZX, Dong S-S, Xu XH, Deng HW: Genome-wide association study suggested copy number variation may be associated with body mass index in the Chinese population. J Hum Genet. 2009, 54: 199-202. 10.1038/jhg.2009.10.PubMed CentralView ArticlePubMedGoogle Scholar
- Conrad DF, Pinto D, Redon R, Feuk L, Gokcumen O, Zhang Y, Aerts J, Andrews TD, Barnes C, Campbell P, Fitzgerald T, Hu M, Ihm CH, Kristiansson K, Macarthur DG, Macdonald JR, Onyiah I, Pang AWC, Robson S, Stirrups K, Valsesia A, Walter K, Wei J, Tyler-Smith C, Carter NP, Lee C, Scherer SW, Hurles ME: Origins and functional impact of copy number variation in the human genome. Nature. 2010, 464: 704-712. 10.1038/nature08516.PubMed CentralView ArticlePubMedGoogle Scholar
- Deng FY, Zhao LJ, Pei YF, Sha BY, Liu XG, Yan H, Wang L, Yang TL, Recker RR, Papasian CJ, Deng HW: Genome-wide copy number variation association study suggested VPS13B gene for osteoporosis in Caucasians. Osteoporos Int. 2010, 21: 579-587.View ArticlePubMedGoogle Scholar
- Shaffer L, Theisen: Disorders caused by chromosome abnormalities. Appl Clin Genet. 2010, 3: 159-174.PubMed CentralView ArticlePubMedGoogle Scholar
- Yim S-H, Kim TM, Hu HJ, Kim JH, Kim BJ, Lee JY, Han BG, Shin SH, Jung SH, Chung YJ: Copy number variations in East-Asian population and their evolutionary and functional implications. Hum Mol Genet. 2010, 19: 1001-1008. 10.1093/hmg/ddp564.PubMed CentralView ArticlePubMedGoogle Scholar
- Chaignat E, Yahya-Graison EA, Henrichsen CN, Chrast J, Schütz F, Pradervand S, Reymond A: Copy number variation modifies expression time courses. Genome Res. 2011, 21: 106-113. 10.1101/gr.112748.110.PubMed CentralView ArticlePubMedGoogle Scholar
- Sharp AJ, Locke DP, McGrath SD, Cheng Z, Bailey JA, Vallente RU, Pertz LM, Clark RA, Schwartz S, Segraves R, Oseroff VV, Albertson DG, Pinkel D, Eichler EE: Segmental duplications and copy-number variation in the human genome. Am J Hum Genet. 2005, 77: 78-88. 10.1086/431652.PubMed CentralView ArticlePubMedGoogle Scholar
- Graubert TA, Cahan P, Edwin D, Selzer RR, Richmond TA, Eis PS, Shannon WD, Li X, McLeod HL, Cheverud JM, Ley TJ: A high-resolution map of segmental DNA copy number variation in the mouse genome. PLoS Genet. 2007, 3: e3-10.1371/journal.pgen.0030003.PubMed CentralView ArticlePubMedGoogle Scholar
- She X, Cheng Z, Zöllner S, Church DM, Eichler EE: Mouse segmental duplication and copy number variation. Nat Genet. 2008, 40: 909-914. 10.1038/ng.172.PubMed CentralView ArticlePubMedGoogle Scholar
- Uno T: Multi-sorting algorithm for finding pairs of similar short substrings from large-scale string data. Knowl Inf Syst. 2009, 25: 229-251.View ArticleGoogle Scholar
- Jurka J, Kapitonov VV, Pavlicek A, Klonowski P, Kohany O, Walichiewicz J: Repbase Update, a database of eukaryotic repetitive elements. Cytogenet Genome Res. 2005, 110: 462-467. 10.1159/000084979.View ArticlePubMedGoogle Scholar
- Takada T, Mita A, Maeno A, Sakai T, Shitara H, Kikkawa Y, Moriwaki K, Yonekawa H, Shiroishi T: Mouse inter-subspecific consomic strains for genetic dissection of quantitative complex traits. Genome Res. 2008, 18: 500-508. 10.1101/gr.7175308.PubMed CentralView ArticlePubMedGoogle Scholar
- Bao Z, Eddy SR: Automated de novo identification of repeat sequence families in sequenced genomes. Genome Res. 2002, 12: 1269-1276. 10.1101/gr.88502.PubMed CentralView ArticlePubMedGoogle Scholar
- Pevzner PA, Pevzner PA, Tang H, Tesler G: De novo repeat classification and fragment assembly. Genome Res. 2004, 14: 1786-1796. 10.1101/gr.2395204.PubMed CentralView ArticlePubMedGoogle Scholar
- Edgar RC, Myers EW: PILER: identification and classification of genomic repeats. Bioinformatics. 2005, 21 (Suppl 1): i152-i158. 10.1093/bioinformatics/bti1003.View ArticlePubMedGoogle Scholar
- Price AL, Jones NC, Pevzner PA: De novo identification of repeat families in large genomes. Bioinformatics. 2005, 21 (Suppl 1): i351-i358. 10.1093/bioinformatics/bti1018.View ArticlePubMedGoogle Scholar
- Hou M, Berman P, Hsu C-H, Harris RS: Homolog Miner: looking for homologous genomic groups in whole genomes. Bioinformatics. 2007, 23: 917-925. 10.1093/bioinformatics/btm048.View ArticlePubMedGoogle Scholar
- Jiang Z, Hubley R, Smit A, Eichler EE: DupMasker: a tool for annotating primate segmental duplications. Genome Res. 2008, 18: 1362-1368. 10.1101/gr.078477.108.PubMed CentralView ArticlePubMedGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990, 215: 403-410.View ArticlePubMedGoogle Scholar
- Tatusova TA, Madden TL: BLAST 2 Sequences, a new tool for comparing protein and nucleotide sequences. FEMS Microbiol Lett. 1999, 174: 247-250. 10.1111/j.1574-6968.1999.tb13575.x.View ArticlePubMedGoogle Scholar
- Eichler EE, Archidiacono N, Rocchi M: CAGGG repeats and the pericentromeric duplication of the hominoid genome. Genome Res. 1999, 9: 1048-1058. 10.1101/gr.9.11.1048.View ArticlePubMedGoogle Scholar
- Ambrosini A, Paul S, Hu S, Riethman H: Human subtelomeric duplicon structure and organization. Genome Biol. 2007, 8: R151-10.1186/gb-2007-8-7-r151.PubMed CentralView ArticlePubMedGoogle Scholar
- Harrington JJ, Van Bokkelen G, Mays RW, Gustashaw K, Willard HF: Formation of de novo centromeres and construction of first-generation human artificial microchromosomes. Nat Genet. 1997, 15: 345-355. 10.1038/ng0497-345.View ArticlePubMedGoogle Scholar
- Shang W-H, Hori T, Toyoda A, Kato J, Popendorf K, Sakakibara Y, Fujiyama A, Fukagawa T: Chickens possess centromeres with both extended tandem repeats and short non-tandem-repetitive sequences. Genome Res. 2010, 20: 1219-1228. 10.1101/gr.106245.110.PubMed CentralView ArticlePubMedGoogle Scholar
- Stankiewicz P, Lupski JR: Genome architecture, rearrangements and genomic disorders. Trends Genet. 2002, 18: 74-82. 10.1016/S0168-9525(02)02592-1.View ArticlePubMedGoogle Scholar
- Hancock JM: Gene factories, microfunctionalization and the evolution of gene families. Trends Genet. 2005, 21: 591-595. 10.1016/j.tig.2005.08.008.View ArticlePubMedGoogle Scholar
- Ma J, Jackson SA: Retrotransposon accumulation and satellite amplification mediated by segmental duplication facilitate centromere expansion in rice. Genome Res. 2006, 16: 251-259.PubMed CentralView ArticlePubMedGoogle Scholar
- Bailey JA, Yavor AM, Massa HF, Trask BJ, Eichler EE: Segmental duplications: organization and impact within the current human genome project assembly. Genome Res. 2001, 11: 1005-1017. 10.1101/gr.GR-1871R.PubMed CentralView ArticlePubMedGoogle Scholar
- Cheung J, Estivill X, Khaja R, MacDonald JR, Lau K, Tsui L-C, Scherer SW: Genome-wide detection of segmental duplications and potential assembly errors in the human genome sequence. Genome Biol. 2003, 4: R25-10.1186/gb-2003-4-4-r25.PubMed CentralView ArticlePubMedGoogle Scholar
- Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W: Human-mouse alignments with BLASTZ. Genome Res. 2003, 13: 103-107. 10.1101/gr.809403.PubMed CentralView ArticlePubMedGoogle Scholar
- Zhang Z, Schwartz S, Wagner L, Miller W: A greedy algorithm for aligning DNA sequences. J Comput Biol. 2000, 7: 203-214. 10.1089/10665270050081478.View ArticlePubMedGoogle Scholar
- Pruitt KD, Maglott DR: RefSeq and LocusLink: NCBI gene-centered resources. Nucleic Acids Res. 2001, 29: 137-140. 10.1093/nar/29.1.137.PubMed CentralView ArticlePubMedGoogle Scholar
- Waterston RH, Lindblad-Toh K, Birney E, Rogers J, Abril JF, Agarwal P, Agarwala R, Ainscough R, Alexandersson M, An P, Antonarakis SE, Attwood J, Baertsch R, Bailey J, Barlow K, Beck S, Berry E, Birren B, Bloom T, Bork P, Botcherby M, Bray N, Brent MR, Brown DG, Brown SD, Bult C, Burton J, Butler J, Campbell RD, Carninci P, et al: Initial sequencing and comparative analysis of the mouse genome. Nature. 2002, 420: 520-562. 10.1038/nature01262.View ArticlePubMedGoogle Scholar
- Koide T, Moriwaki K, Ikeda K, Niki H, Shiroishi T: Multi-phenotype behavioral characterization of inbred strains derived from wild stocks of Mus musculus. Mamm Genome. 2000, 11: 664-670. 10.1007/s003350010129.View ArticlePubMedGoogle Scholar
- Moriwaki K, Miyashita N, Mita A, Gotoh H, Tsuchiya K, Kato H, Mekada K, Noro C, Oota S, Yoshiki A, Obata Y, Yonekawa H, Shiroishi T: Unique inbred strain MSM/Ms established from the Japanese wild mouse. Exp Anim. 2009, 58: 123-134. 10.1538/expanim.58.123.View ArticlePubMedGoogle Scholar
- Ogasawara M, Imanishi T, Moriwaki K, Gaudieri S, Tsuda H, Hashimoto H, Shiroishi T, Gojobori T, Koide T: Length variation of CAG/CAA triplet repeats in 50 genes among 16 inbred mouse strains. Gene. 2005, 349: 107-119.View ArticlePubMedGoogle Scholar
- Baust C, Gagnier L, Baillie GJ, Harris MJ, Juriloff DM, Mager DL: Structure and expression of mobile ETnII retroelements and their coding-competent MusD relatives in the mouse. J Virol. 2003, 77: 11448-11458. 10.1128/JVI.77.21.11448-11458.2003.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.