Arabidopsis thaliana population analysis reveals high plasticity of the genomic region spanning MSH2, AT3G18530 and AT3G18535 genes and provides evidence for NAHR-driven recurrent CNV events occurring in this location

Background Intraspecies copy number variations (CNVs), defined as unbalanced structural variations of specific genomic loci, ≥1 kb in size, are present in the genomes of animals and plants. A growing number of examples indicate that CNVs may have functional significance and contribute to phenotypic diversity. In the model plant Arabidopsis thaliana at least several hundred protein-coding genes might display CNV; however, locus-specific genotyping studies in this plant have not been conducted. Results We analyzed the natural CNVs in the region overlapping MSH2 gene that encodes the DNA mismatch repair protein, and AT3G18530 and AT3G18535 genes that encode poorly characterized proteins. By applying multiplex ligation-dependent probe amplification and droplet digital PCR we genotyped those genes in 189 A. thaliana accessions. We found that AT3G18530 and AT3G18535 were duplicated (2–14 times) in 20 and deleted in 101 accessions. MSH2 was duplicated in 12 accessions (up to 12-14 copies) but never deleted. In all but one case, the MSH2 duplications were associated with those of AT3G18530 and AT3G18535. Considering the structure of the CNVs, we distinguished 5 genotypes for this region, determined their frequency and geographical distribution. We defined the CNV breakpoints in 35 accessions with AT3G18530 and AT3G18535 deletions and tandem duplications and showed that they were reciprocal events, resulting from non-allelic homologous recombination between 99 %-identical sequences flanking these genes. The widespread geographical distribution of the deletions supported by the SNP and linkage disequilibrium analyses of the genomic sequence confirmed the recurrent nature of this CNV. Conclusions We characterized in detail for the first time the complex multiallelic CNV in Arabidopsis genome. The region encoding MSH2, AT3G18530 and AT3G18535 genes shows enormous variation of copy numbers among natural ecotypes, being a remarkable example of high Arabidopsis genome plasticity. We provided the molecular insight into the mechanism underlying the recurrent nature of AT3G18530-AT3G18535 duplications/deletions. We also performed the first direct comparison of the two leading experimental methods, suitable for assessing the DNA copy number status. Our comprehensive case study provides foundation information for further analyses of CNV evolution in Arabidopsis and other plants, and their possible use in plant breeding. Electronic supplementary material The online version of this article (doi:10.1186/s12864-016-3221-1) contains supplementary material, which is available to authorized users.


Figure S1
Distribution of DNA copy number in regions covered by CNV_610 and CNV_611 in 80 natural accessions of Arabidopsis (MPICao2010 set).

Figure S8
The sequence composition of the left and right breakpoints in accessions with "del-2" and "dupl-2" genotypes.

Figure S9
Sequence alignment of CNV breakpoints in accessions with "del-2" genotype.

Figure S10
Sequence alignment of CNV breakpoints in accessions with simple "dupl-2" genotype.

Figure S11
Sequence alignment of CNV breakpoints in accessions with "dupl-2" genotype harboring extended duplication, that involves also the 3' flank of the right LCR.

Figure S12
Optimization of genomic DNA template input for ddPCR.

Figure S13
Optimization of primer annealing temperatures for ddPCR.

Table S2
Sequences of MLPA probes.

Table S3
Gene specific primers used for ddPCR assays.
Fig. S1 Distribution of DNA copy number in regions covered by CNV_610 and CNV_611 in 80 natural accessions of A. thaliana (MPICao2010 set). The source data are from [29]. DNA copy number is presented as relative to the reference genome, Col-0. The division according to the geographic origin of the accessions is identical to that proposed in the original paper.
Fig. S3 Exemplar electropherograms of AthMSH2-MLPA assay results. The data are representative for accessions with: A -no copy number changes detected; B -amplification of MSH2, AT3G18530 and AT3G18535 genes; Cdeletion of AT3G18530 and AT3G18535 genes. Normalized peak height from the Gene Marker v.2.4.0 analysis are presented (note that the scale on electropherogram B is different from A and C). Probes' IDs (see Table 1  The Network was constructed with the NeighborNet algorithm, based on bi-allelic SNPs of at least 10% frequency, surrounding the CNV region from both sides (20-kb flanks). Each accession is marked with a symbol shape indicating its CNV pattern and with a color indicating the genetic group it belongs to, according to a recent population study of 1135 Arabidopsis accessions [44]. No clear evolutionary splits between the accessions harboring distinct copy number of MSH2, AT3G18530 and AT3G18535 genes could be observed (see main text for details).

Fig. S6
Linkage disequilibrium (LD) at genomic regions surrounding the investigated CNV. LD plot shows correlation between the pairs of bi-allelic SNPs in the 20-kb DNA regions which flank MSH2-AT3G18530-AT3G18535 genes, in 154 accessions (including Col-0) as well as the correlation between the SNPs and the CNV patterns. The SNPs are mapped to their physical positions along the chromosome 3 (top) with black lines. Solid green arrows indicate genes; the yellow rectangle highlights MSH2, AT3G18530 and AT3G18535 genes. CNV genotypes were incorporated into the analysis as TRUE or FALSE values (i.e. the accession harbors/not harbors the indicated genotype, respectively; "dupl-1" genotype was excluded from this analysis). Although LD blocks were detected in the analyzed region from each side of CNV, we observed no correlation between any CNV pattern and any SNP (R 2 < 0.3, data highlighted by the red frame).
On next pages: On next pages: Fig. S11 Sequence alignment of CNV breakpoints in accessions with "dupl-2" genotype harboring extended duplication, that involves also the 3' flank of the right LCR. The sequences are aligned along with the LCRs. Left LCR (Chr3:6372413..6373650) is presented with its 3' flanking region (highlighted in purple) while right LCR (Chr3:6377368..6378605) is presented with both its with 5' flanking region (highlighted in green) and 3' flanking region (highlighted in teal). Positions distinguishing both LCRs are highlighted by yellow boxes, except for the last differentiating position, which is highlighted in orange. This position is within the microhomology region, which mediated the template switching in Di-G Etna-2 and La-0 accessions. The colors of LCR flanking regions match those in Supplementary Fig. S8.
Fig. S12 Optimization of genomic DNA template input for ddPCR. A -Each column represents a single well of ~18,000 droplets generated from 20-µl reaction mix containing indicated amount of Col-0 genomic DNA and a single set of primers, targeting DCL1 gene. B -The linearity of the dynamic range of ddPCR assays presented in (A).
Fig. S13 Optimization of primer annealing temperatures for ddPCR. A -Separation of positive and negative droplets at a range of primer annealing temperatures (56-60 °C). Each column (separated by yellow lines) represents a single well of ~18,000 droplets generated from 20-µl reaction mix containing 1 ng of Col-0 genomic DNA and a single set of primers, targeting DCL1, HDA15, MSH2, AT3G18530, AT3G18535 or BRC1 gene (as indicated on the bottom of panel B). B -The target concentrations detected in the assays presented in (A). The error bars indicate the Poisson 95% confidence intervals. Table S2. Sequences of MLPA probes. Each probe consists of two half probes design to hybridize to adjacent positions of the target genomic DNA. Each half probe consists of a target-specific sequence (red), universal primer sequence (blue) and (optional) stuffer sequence (black).