Large genotyping datasets have become commonplace due to efficient, cheap methods for SNP identification. Typical genotyping datasets may have thousands to millions of data points per accession, across tens to thousands of accessions. There is a need for tools to help rapidly explore such datasets, to assess characteristics such as overall differences between accessions and regional anomalies across the genome.
We present GCViT (Genotype Comparison Visualization Tool), for visualizing and exploring large genotyping datasets. GCViT can be used to identify introgressions, conserved or divergent genomic regions, pedigrees, and other features for more detailed exploration. The program can be used online or as a local instance for whole genome visualization of resequencing or SNP array data. The program performs comparisons of variants among user-selected accessions to identify allele differences and similarities between accessions and a user-selected reference, providing visualizations through histogram, heatmap, or haplotype views. The resulting analyses and images can be exported in various formats.
GCViT provides methods for interactively visualizing SNP data on a whole genome scale, and can produce publication-ready figures. It can be used in online or local installations. GCViT enables users to confirm or identify genomics regions of interest associated with particular traits.
As high throughput genotyping costs have dropped, the dense genotyping of large germplasm collections has become commonplace. Re-sequencing and SNP-array projects are used to identify sequence variants between multiple lines, and may be used to perform genome wide association studies (GWAS) to find variants that are associated with phenotypes. These studies can produce millions of SNPs. For example, Torkamaneh et al.  identified 15 million variants among 1007 accessions of soybean, which has relatively low diversity compared with a crop such as maize. Often these data sets are used for a single genome wide association study (GWAS), but such data sets are rich and may be repurposed for other studies. Reuse of this valuable data requires tools for visualization and analysis.
Several tools exist for exploring this data. The command line tool Genotype Query Tools (GQT)  and its web form, webGQT  provide a means of indexing and querying VCF files. However it lacks visualization options. Many tools are available for genomic and genotypic data visualization . Some of these tools include: Flapjack , Integrative Genomics Viewer (IGV) , Tassel-GBS , JBrowse [8,9,10], Xena , and SNPVersity . These tools provide visualization of selected genomic regions/genes in a single view, but lack a whole genome overview. Tools that provide a whole genome scale visualization in a single view include: CViT (Chromosome Visualization Tool) , Synteny Explorer , MizBee  and SNP & Variation Suite (SVS) Golden Helix® - yet these tools do not automate the comparisons of accessions using SNP data. CViT displays features on “backbones”, including complete genetic and cytogenetic maps, and whole genome views of genomic features. However, although CViT can be integrated into online resources, it is a standalone Perl application that generates static images on predefined comparisons, limiting its utility for interactive exploration.
Although GCViT is able to handle fairly large data sets, data sets of millions of SNPs and/or hundreds of genotypes may need to be subsampled. A utility for accomplishing this, subsample_vcf.pl, is included in the distribution. This script can be used to filter SNPs by quality, and will select representative SNPs within a specified genomic window size.
An instance of GCViT can be set up in a Docker container or installed as a Go and Nodejs application. Set up consists of the service configuration file, preparation of data sets and connecting them to GCViT, and configuration of the user interface (UI). Instructions for deploying an instance of GCViT are provided in the GitHub repository (https://github.com/LegumeFederation/gcvit).
The GCViT display
Binning and scaling
To handle the display of dense data, the chromosomes are divided up into bins and counts are represented for each bin. The default bin size is 500 kb, but this can be changed in the server-side configuration file and interactively by the user. Bin sizes should be set according to SNP density and the degree of scaling: very high-density genotype data may be suited for smaller bin sizes, but a large genome will require larger bins due to pixel size because data can’t be displayed at a scale less than one pixel.
There are 3 different data displays: histogram, heatmap, and haplotype. The histogram view shows SNP counts in each bin, where the size of each bar represents the count proportional to the minimum and maximum values across the entire genome. The heatmap view shows SNP counts within each bin using color ranges that are proportional to the minimum and maximum values across the genome. The haplotype view shows SNP presence/absence within each bin if the count in the bin matches or exceeds a given threshold.
The user interface (UI) controls are grouped into sections: “Configure View,” where the data set and genotypes are selected; “Display,” where the image and its interactive controls are displayed; and “View Controls,” which contains controls for turning on and off portions of the image. Detailed instructions for the UI are provided in GCViT itself, through the Help button.
Selecting the reference genotype
The first step is to select a data set and reference genotype. Data set availability is established in the configuration, along with file paths and data set name. In addition, availability of a particular data set may be controlled via simple authentication. Comparisons can be made only within a single data set.
Selecting the comparison genotypes
After selecting the data set and reference genotype, one or more comparison genotypes can be selected and each assigned a distinct color. A full color palette is provided to help distinguish the selected genotypes.
Comparisons can be displayed on the left or right side of the chromosome backbones. For each comparison, the user selects a display type (histogram, heatmap, or haplotype), and the type of comparison (alleles are different from the reference, same as the reference, or the total SNP count). Depending on the display type, the user has the option of setting specific minimum or maximum values rather than leaving GCViT to calculate them across the genome (histogram and heatmap), or of setting a threshold value (haplotype).
In the Configure View Options section, the image can be given a title, the bin size can be changed, the ruler placement can be modified, and the ruler interval (frequency of tic marks and how often coordinate counts are displayed) can be changed.
There are three main buttons, Display, Download, and Help. The Display button generates the image. The image may be larger than the viewport, in which case it can be moved by clicking and dragging the image. The Download button gives the option of downloading the results in SVG or PNG formats. There are some differences between the two options: the SVG format is downloaded as the whole image (which may be larger than what is displayed on the screen, while the PNG format will only download what is currently visible in the viewport. The GFF file that was created and used to draw the visualization can also be downloaded. The Help button provides information about GCViT and instructions for using the interface.
Pop up box
Clicking on a glyph in the image will pop up a box that identifies the bin number, chromosome coordinates, the value for each accession and the total value for the bin. The pop up box can be customized by modifying the CViTjs pop up template. Examples of potential customizations include link-outs to other resources, such as the Germplasm Repository Information Network (GRIN) accession page, or to a genome browser. In our example on SoyBase there are linkouts to the SoyBase Gbrowse instance, for exploration of genic features in the bin; and to the Legume Information System “Context Viewer,” which enables examination of synteny among similar genomic regions.
Above the image, a key is displayed with the currently displayed genotypes and their respective colors. This key will update only after the Display button has been pressed to update the view.
On the left side of the image is a toolbox that provides zoom controls and a set of drawing options that permit drawing free-hand lines or rectangles, an eraser, and a color palette. The image can be moved within the viewport by clicking and dragging with the mouse (Fig. 1). Note that the bin size does not change when zooming in or out.
At the bottom of the page, the ‘View Control’ section permits the user to toggle off and on individual chromosomes and other display elements (Fig. 1).
Online instances of GCViT at the time of writing include soybean (https://soybase.org/gcvit/), common bean (https://gcvit.phaseolus.legumeinfo.org), chickpea (https://gcvit.cicer.legumeinfo.org), and peanut (https://peanutbase.org/germplasm/gcvit/). Data sets available for soybean include: the whole U.S. germplasm collection genotyped with the SoySNP50K array , resequencing of 481 soybean accessions , resequencing of 102 Canadian accessions , the soybean Nested Association Mapping (SoyNAM) parents and progeny [19, 20], 222 Korean accessions genotyped using the Axiom® SoyaSNP array , 4234 Korean accessions using the Axiom® SoyaSNP array , GmHapMap data consisting of 1007 resequenced accessions , and genotyping of 374 U.S. and Brazilian accessions .
Data available for Chickpea contains genotype information from 279 Chickpea accessions . For common bean, diversity data is available for two diverse collections of Phaseolus vulgaris: the Mesoamerican Diversity Panel (MDP) and the Andean Diversity Panel (ADP) . The peanut data set contains the U.S. Peanut Mini Core Collection genotyped using the 58 K Affymetrix SNP array, Axiom Arachis .
There are many potential uses for GCViT. Here we describe four use cases.
Use case 1: identify introgressions between two common bean populations
Using GCViT on the Bean CAP diversity panels , the Andean and Mesoamerican populations can be compared to identify introgressions. In Fig. 2, the Andean line ‘Heirloom’ is the reference genotype, which is compared with three other Andrean lines: Dolly, Majesty and Bonus; and three Mesoamerican lines: Avalanche, Maverick and Zorro. Regions that were introgressed from the Mesoamerican population can be seen on chromosomes 3, 4 and 9. Although there are differences between Heirloom and the other three Andrean lines, there are few differences among the Mesoamerican lines, suggesting that these regions were introgressed from the Mesoamerican population.
Use case 2: inheritance analysis
Using the SoySNP50K data , pedigree relationships were plotted, comparing soybean line Blackhawk (PI 548516) to sibling Hawkeye (PI 548577), and parents Mukden (PI 548391) and Richland (PI 548406) (Fig. 3). In this example, every region with a difference between Blackhawk and its sibling, also shows a difference between Blackhawk and one of its parents, indicating that this region was inherited from different parents in each sibling. From this information, it is apparent that most of Gm04, Gm08, and Gm17 of the siblings were inherited from Mukden, while most of Gm05 was inherited from Richland.
Use case 3: identify conserved genomic regions and/or regions of interest
Using the SoySNP50k data, we can identify regions that are conserved between cultivated soybean lines (Glycine max) and its wild relative (Glycine soja). In this example we used soybean cultivar Williams 82 (Wm82) as the reference and compared it to 6 other cultivated soybean lines and 3 wild (G.soja) lines. On chromosome Gm05 and Gm20 there are regions that show no differences between Wm82 and the other cultivated soybeans, but clear differences between the cultivated soybeans and the wild soybeans (Fig. 4). These regions could indicate regions that were selected during domestication.
Use case 4: identify if two soybean accessions labeled the same are indeed the same
It is known that there can be genomic variation between soybean cultivars with the same name due to differential segregation of polymorphic regions during the breeding process . One cultivar where we see this variation is the representative soybean genome, Williams 82 (Wm82). In these two examples we show two different situations where two accessions with the same name are not 100% genetically similar.
In this example we use a soybean accession which was genotyped by two different studies. The two VCF files were merged using BCFtools and a similarity matrix was created using SNPRelate. Accessions overlap was identified and all of the accessions matched their counterpart, except for two. One of these accessions was PI 424032, which was found to have a similarity score of 0.715. The differences between these two accessions were then plotted using GCViT (Fig. 5a). In this example we can see that the line PI 424032 genotyped from the SoySNP50K and Lee et al.  are completely different. (Fig. 5a) It was later confirmed by the author/PI (personal comm. Dr. Soon-Chun Jeong) that the wrong seed was received from the Soybean Germplasm Repository.
Using two different lines of soybean accession Dwight, we are able to identify regions of selection. Using information from the SoySNp50K data set the similarity score was calculated between Dwight and PI 597386. The similarity score between these two lines is 0.977. Using GCViT we can plot where these two lines differ (Fig. 5b) This information shows that these two lines started out the same, but were then grown out for multiple generations in different labs (personal comm. Dr. Qijian Song).
GCViT remains in active development. As it is in the process of being adopted for additional organisms and research communities, we are receiving requests for enhancements and new features. These and future requests will be considered for inclusion in subsequent releases.
GCViT provides useful visualization of SNP data on a whole genome scale. This visualization can provide many insights. Images can be downloaded as publication-ready figures.
All data used in on-line versions of GCViT can be found at the Legume Federation Datastore.
Chromosome Visualization Tool
Genotype Comparison Visualization Tool
Germplasm Repository Information Network
Genome Wide Association Study
Legume Information System
Illumina Infinium BeadChip containing 50,000 SNPs from soybean, used to sequence the full U.S. soybean germplasm collection.
Variant Call Format
Torkamaneh D, Laroche J, Valliyodan B, O'donoughue L, Cober E, Rajcan I, et al. Soybean haplotype map (GmHapMap): a universal resource for soybean translational and functional genomics. BioRxiv. 2019:534578.
Layer RM, Kindlon N, Karczewski KJ, Quinlan AR, Exome Aggregation Consortium. Efficient genotype compression and analysis of large genetic-variation data sets. Nature Methods. 2016;13(1):63.
Cannon EK, Cannon SB. Chromosome visualization tool: a whole genome viewer. Int J Plant Genomics. 2011;2011:373875.
Bryan C, Guterman G, Ma KL, Lewin H, Larkin D, Kim J, Ma J, Farre M. Synteny explorer: an interactive visualization application for teaching genome evolution. IEEE Trans Vis Comput Graph. 2016;23(1):711–20.
Torkamaneh D, Laroche J, Tardivel A, O'Donoughue L, Cober E, Rajcan I, Belzile F. Comprehensive description of genomewide nucleotide and structural variation in short-season soya bean. Plant Biotechnol J. 2018;16(3):749–59.
Wei W, Mesquita AC, Figueiró AD, Wu X, Manjunatha S, Wickland DP, et al. Genome-wide association mapping of resistance to a Brazilian isolate of Sclerotinia sclerotiorum in soybean genotypes mostly from Brazil. BMC Genomics. 2017;18(1):849.
von Wettberg EJ, Chang PL, Başdemir F, Carrasquila-Garcia N, Korbu LB, Moenga SM, et al. Ecology and genomics of an important crop wild relative as a prelude to agricultural innovation. Nat Commun. 2018;9(1):1–3.
Moghaddam SM, Mamidi S, Osorno JM, Lee R, Brick M, Kelly J, et al. Genome-wide association study identifies candidate loci underlying agronomic traits in a middle American diversity panel of common bean. Plant Genome. 2016;9(3):1–21.
Otyama PI, Wilkey A, Kulkarni R, Assefa T, Chu Y, Clevenger J, et al. Evaluation of linkage disequilibrium, population structure, and genetic diversity in the US peanut mini core collection. BMC Genomics. 2019;20(1):481.
Haun WJ, Hyten DL, Xu WW, Gerhardt DJ, Albert TJ, Richmond T, et al. The composition and origins of genomic variation among individuals of the soybean reference cultivar Williams 82. Plant Physiol. 2011;155(2):645–55.
We thank Nathan Weeks for help in testing, containerization, and deployment of the GCViT package and to all the members of the USDA SoyBase and Legume Database group who have provided suggestions during the development process. This research was supported in part by the US Department of Agriculture, Agricultural Research Service, project 5030-21000-062-00D. USDA is an equal opportunity provider and employer.
This research was supported in part by the NSF project “Federated Plant Database Initiative for the Legumes,” award #1444806, and by the US. Department of Agriculture, Agricultural Research Service, project 5030–21000-069-00D. Mention of trade names or commercial products in this publication is solely for the purpose of providing specific information and does not imply recommendation or endorsement by the U.S. Department of Agriculture. USDA is an equal opportunity provider and Employer.
Authors and Affiliations
ORISE Fellow, USDA-ARS Corn Insects and Crop Genetics Research Unit, Ames, IA, 50011, USA
Andrew P. Wilkey
USDA-ARS Corn Insects and Crop Genetics Research Unit, Ames, IA, 50011, USA
Anne V. Brown, Steven B. Cannon & Ethalinda K. S. Cannon
APW: design and all code, AVB: design, SBC and EKSC: design and development of original CViT, and design guidance for GCViT. AVB and APW wrote the manuscript. All authors edited and approved the final manuscript.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
Wilkey, A.P., Brown, A.V., Cannon, S.B. et al. GCViT: a method for interactive, genome-wide visualization of resequencing and SNP array data.
BMC Genomics21, 822 (2020). https://doi.org/10.1186/s12864-020-07217-2