Genome decomposition analysis pipeline
Version 1.0 of GDA was used throughout, with default parameters unless otherwise specified. A window size of 5kbp was used throughout as this represents roughly the size of a gene in apicomplexans (e.g. Plasmodium spp.). The GDA v1.0 code was cloned from a private git repository to a Linux server and a Conda environment that includes all software dependencies established using the create_gda_conda_env.py script provided. This installation was used for running the feature extraction, clustering and analysis parts of the pipeline.
The pipeline extracts the values of various sequence features (e.g. GC content) with a sliding window (default size 5kbp) along all sequences in the assembly. The values are stored as separate bedgraph files (one per feature). The pipeline consists of a master script that is written in Nextflow [32]. The rest of the code of the pipeline has been written mostly in Python. The Nextflow script triggers multiple third party software tools that are used to detect genomic features. As an alternative to using the Conda environment, the pipeline and its dependencies are packaged as a Singularity [33] image, thus simplifying its installation in a shared environment.
Using a genome assembly FASTA file as the input, the genomic feature extraction pipeline determines low complexity sequence content using Dustmasker 1.0.0 [34], tandem repeat content using Tandem Repeats Finder 4.09.1 [35], 10 × coverage of simulated reads using WGSIM 1.0 (https://github.com/lh3/wgsim), retrotransposons using LTRharvest and LTRdigest from GenomeTools 1.6.1 [36], inverted repeats using einverted from EMBOSS 6.6.0 [37] and repeat families using either RepeatMasker + RepeatModeler 2.0.1 [38] or Red (05/22/2015) + MeShClust2 2.3.0 [39, 40]. GC%, AT skew, GC skew, and the frequency of CpG dinucleotides, stop codons and telomeric motifs in each window are determined using Python code. If the user does not provide the pipeline with a gene annotation file, the pipeline can annotate genes itself using Augustus 3.3.3 [41], tRNAscan-SE 2.0.6 [42], and Barrnap 0.9 [https://github.com/tseemann/barrnap]. It is possible to provide hints for Augustus using annotation transfer from a GFF3 file of a related genome with Liftoff 1.6.1 [43]. With additional input data, the pipeline can detect ectopic mitochondrial and apicoplast sequences using BLAST 2.10.1 [34], and RNA-Seq read coverage using HISAT2 2.2.1 [44]. If the user provides proteome FASTA files of species that are related to the target species, the pipeline can run OrthoMCL 1.4 [45]. A more detailed description of the variables can be found in Supplementary Table 1. Note that telomeric motifs, stop codons and kmers are not counted if they are broken up by a border between two windows. However, in the OrthoMCL results analysis part (when calculating the values of variables per gene in the window) a gene that is split between two windows is counted as a part of both windows.
The code for the dimensionality reduction and clustering of the data from genomic windows uses the Python UMAP [17] and HDBSCAN [18] libraries. The scaling of variables before running UMAP is done using MinMaxScaler from the scikit-learn package [46].
In the script for optimising the clustering parameters (gda_parameters.py), Silhouette score, Davies-Bouldin index and Calinski-Harabasz score are calculated for each clustering result using scikit-learn. These scores help to find the clustering settings that work the best for separating the genomic windows into distinct clusters.
After determining the optimal settings for n_neighbors and minimal cluster size, the pipeline runs the final clustering. Kolmogorov–Smirnov test is used to determine whether the distribution of values of a variable in a GDA cluster is significantly different from the distribution of the values of the same variable in the rest of the genomic windows. The test is performed using the ks_2samp function from the scipy package [47]. The Fisher test with Benjamini–Hochberg multiple hypothesis testing correction (using scipy.stats [47] and statsmodels.stats.multitest libraries [48] are used to determine if some types of cluster junctions occur with a different frequency than what is expected by chance. For example, this test yields a statistically significant result when windows belonging to a given cluster are located next to windows belonging to the same other cluster significantly more often than expected by chance.
While the clustering and visualisation parts of the GDA pipeline rely on bedgraph files, none of the third party software tools used by GDA produce output files in bedgraph format. We therefore use Python code written for the GDA pipeline to derive bedgraph files from the diverse set of output files produced by the third party tools. In some cases, the output of a software tool is first converted to GFF format and then the GFF file is converted to a bedgraph file. All bedgraph files corresponding to one assembly are merged into a tab-separated table. The code for merging bedgraph files into a table and for downsampling the table has been written in C + + instead of Python, in order to gain execution speed.
In this work, we distinguish four different feature sets: seq requires only the genome sequence as input, gene features are derived from a set of gene annotations (e.g. mRNA, rRNA, tRNA etc. features in a GFF file), rep features derived from running the RepeatModeler repeat classification and analysis tool, orth derived from running the OrthoMCL tool for determining orthologous and paralogous relationships between protein-coding genes. These feature sets are frequently combined, as stated. In this work “full feature set” refers to the combination of these four feature sets, e.g. seq + gene + rep + orth. GDA is capable of generating additional feature sets and any arbitrary genome data tracks can be added to incorporate novel features.
Datasets
Genome sequences and annotation for the following species were downloaded from VEuPathDB release 51 (https://toxodb.org/toxo/app/downloads/release-51/)—Plasmodium falciparum 3D7, P. knowlesi H, P. chabaudi AS, P. vivax P01, Toxoplasma gondii ME49, Babesia bovis T2Bo, B. microti RI, Theileria annulata Ankara, T. parva Muguga and Cryptosporidium parvum Iowa II. Features in the GFF files labelled protein_coding_gene were changed to gene. Eimeria tenella Houghton data was downloaded from ENA (https://www.ebi.ac.uk/ena/browser/view/GCA_905310635.1). For OrthoMCL runs (excluding large genome analysis), all the above species were included.
Analysis of Plasmodium falciparum
The feature extraction module of GDA was initially run using just the sequence as input, producing the following features: at_skew, cag_freq, cpg_percentage, dustmasker_low_complexity_percentage, einverted_inverted_repeat, N_percentage, gc_percentage, gc_skew, kmer_deviation_kmer_size_3, kmer_deviation_kmer_size_4, LTRdigest_protein_match, LTRdigest_LTR_retrotransposon, stop_codon_freq, tandem_repeats_fraction, telomere_freq, wgsim_depth_minimap2. A description of these features is available in Supp. Table 1.
The clustering_params function of GDA was used to determine suitable clustering parameters, with all combinations of n neighbours (n) = {5, 10, 15, 20} and minimum cluster size (c) = {50, 100, 200 500} explored. Parameter values were chosen to minimise the percentage of unclassified windows and maximise the silhouette score. This was achieved with n = 5 and c = 50. Feature extraction was also performed with the addition of gene annotations (seq + gene), resulting in the following additional features: exon_count, gene_average_exon_length, gene_average_intron_length, gene_length, mRNA_annotations, pseudogene_annotations, rRNA_annotations and tRNA_annotations. Clustering parameters were n = 10 and c = 40. To this feature set, repeat identification with RepeatModeler was added (seq + gene + rep), incorporating sum_of_simple_repeats, sum_of_complex_repeats, as well as numerous, specific simple and complex repeat family features. Clustering parameters for this feature set were n = 15, c = 50. The final feature set added features derived from an analysis of orthologues across the Apicomplexan phylum: apicomplexa_ortholog_count, apicomplexa_paralog_count, apicomplexa_protein_conservation_ratio and apicomplexa_species_specific_proteins_ratio (seq + gene + rep + orth). Here, the clustering parameters were chosen as n = 20, c = 50.
We wanted to determine whether cluster 3 (var/rif genes) and 4 (smaller multigene families) regions in the seq + gene + rep + orth run of P. falciparum were more or less well covered by HP1 chromatin modifications in internal regions versus subtelomeres. We defined subtelomeric windows as those within 200kbp of chromosome ends. To test whether there was a difference in HP1 occupancy between subtelomeric and internal multigene family regions, bedgraph files of log2 ratios of HP1 in trophozoites were downloaded from PlasmoDB, originally derived from [49]. We used bedtools intersect to identify genes overlapping windows of each cluster. Boxplots were drawn using the graphics v4.0.2 package in R. Kolmogorov–Smirnov tests, using the stats v4.0.2 package in R, were used to determine statistical significance.
Analysis of P. vivax and P. knowlesi
Full feature sets (seq + gene + rep + orth) were used for P. vivax and P. knowlesi. For P. vivax we chose parameters n = 20, c = 50, for P. knowlesi n = 10, c = 50. Clustering parameters were chosen using the clustering_params function of GDA as for other species.
Analysis of Eimeria tenella
We used parameters n = 10 and c = 100 with the seq feature set, resulting in exclusion of 2.74% of windows and a silhouette score of 0.53. The default CAG repeat feature was excluded because this feature was originally added specifically to help identify repeats in Eimeria spp. Here, we wanted to demonstrate that these repetitive regions could be identified without prior knowledge. We added rep features (n = 5, c = 50, silhouette score = 0.32), then gene features (13.68% unclassified windows and silhouette score = 0.18, n = 10, c = 50), then orth features (6 clusters, n = 10, c = 100, 1.9% windows unclassified).
Homopolymeric Amino Acid Repeats (HAARs) were identified using Python regular expressions, looking for runs of A, S, Q, L and N of at least 7 in predicted protein sequences. There were 13,389 A, 9,404 Q, 5,350 S, 334 L and 6 N repeats.
Analysis of Toxoplasma gondii
Non-chromosomal contigs were removed from the assembly. The seq + gene + rep + orth feature set was used with parameters n = 20, c = 50, resulting in 5 clusters, with no unannotated windows.
Analysis of large genomes
The GDA feature extraction pipeline was run with four genomes of increasing size, with and without RepeatModeler (rep feature set) to show how resource requirements scale. Each was run with orthologue analysis (orth), genome annotation (gene) feature sets as well as NUclear Mitochondrial DNA (NUMT) identification. All jobs were executed on the Wellcome Sanger Institute compute farm with Intel(R) Xeon(R) Gold 6226R CPU @ 2.90 GHz processors.
and up to 16 threads. Genomic windows size was 5 kbp in all runs—which represents 654,762 windows for H. sapiens. Gene annotations were read from existing GFF files from the same origin as the assembly FASTA files (PlasmoDB, NCBI or WormBase ParaSite).
Plasmodium falciparum 3D7 (PlasmoDB release 43) was used with the Pf_M76611 (PlasmoDB) mitochondrial genome reference and reference proteomes P. chabaudi chabaudi AS, P. ovale curtisi GH01, P. gallinaceum 8A, P. malariae UG01, P. berghei ANKA, P. vivax P01 (from PlasmoDB release 52). Caenorhabditis elegans (RefSeq GCF_000002985.6) was used with mitochondrial sequence NC_001328.1 (NCBI) and predicted proteomes GCF_000001215.4 Release 6 (Drosophila melanogaster), GCF_000146045.2 R64 (Saccharomyces cerevisiae) and GCF_000001405.39 GRCh38.p13 (Homo sapiens) from NCBI, GCA_900184235.1 (Oscheius tipulae) and GCA_000469685.2 (Haemonchus contortus) from GenBank and PRJEA36577.WBPS14 (Schistosoma mansoni) from WormBase ParaSite. Schistosoma mansoni (WormBase ParaSite release 14, assembly Smansoni_v7) was used with mitochondrial sequence NC_002545.1 (NCBI) and predicted proteomes PRJDA72781.WBPS14 (Clonorchis sinensis), PRJEB527.WBPS14 (Schistocephalus solidus), PRJEB122.WBPS14 (Echinococcus multilocularis), PRJEA34885.WBPS14 (Schistosoma japonicum), PRJNA179522.WBPS14 (Fasciola hepatica), PRJEB124.WBP from WormBase ParaSite [50]. Homo sapiens (NC_012920.1; NCBI) was run with mitochondrial sequence NC_012920.1 (NCBI) and predicted proteomes GCF_000002035.6_GRCz11 (Danio rerio), GCF_001663975.1 (Xenopus laevis v2), GCF_000001635.27_GRCm39 (Mus musculus) from NCBI.
For clustering analysis of the human genome, features were calculated in 5kbp windows and then merged into 50kbp windows. Tracks for individual repeat families were excluded but sum of simple repeats and sum of complex repeats features were used. Suitable clustering parameters were chosen using the clustering_params tool, with n_neigbors = 50 and minimum cluster size = 500.