Skip to main content

Table 1 Definitions for terms used in the current study

From: Contrasting new and available reference genomes to highlight uncertainties in assemblies and areas for future improvement: an example with monodontid species

Term

Definition

Data types

 Short reads

Accurate sequences of DNA typically 150 bp in length and typically generated on Illumina platforms [14]. Sequences may be paired or unpaired depending on whether both ends of DNA fragments were sequenced

 Linked reads

Short reads, but with molecular barcodes that tag reads from the same DNA fragment, creating “read clouds” that leverage long range information [15]. Sometimes referred to as 10 × Linked reads after 10 × Genomics, a company that provided this type of sequencing prior to 2020

 Long reads

Sequences produced directly from long fragments of DNA, thus providing long range information in the form of intact reads. Typically generated using PacBio or NanoPore platforms, long reads are historically more error prone compared to short reads, though read accuracy continues to improve (e.g. PacBio HiFi reads; [14])

 Chromosome conformation capture

A method used to map spatial organization of chromatin across genomes [16]. A suite of techniques can be used to cross link loci and sequence DNA fragments as paired-end short reads linked by unknown proximity. The higher order structure of sequences (e.g. chromosomes) can be inferred because loci interactions increase with linear proximity on the genome. Data is generated on similar platforms to short read sequences, e.g. Illumina

 Optical genome mapping

A restriction enzyme is applied to highly intact DNA and the lengths and order of fragments are measured. This information is used to guide the order and orientation of assembly fragments by matching patterns in the occurrence of sequence motifs [17, 18]. Note, the data represents mapping information or physical locations of sequence motifs, not sequence data. Bionano is currently the main provider for optical genome mapping services

Reference genome quality

 K-mer

Substrings of length k within DNA sequence data

 Coverage

The number of times, on average, a genomic region or complete genome has been sequenced. Oftentimes synonymous with the depth, or number, of uniquely overlapping reads in a dataset

 Contig

A DNA sequence assembled by overlapping k-mers or reads

 Scaffold

Contigs ordered and oriented into longer sequences, typically with gaps represented as Ns in between contigs [19]

 Contiguity

The level to which a reference genome is assembled into continuous sequences representing DNA, a genome fragmented into a larger amount of smaller sequences being less contiguous

Quantitative parameters

 N50

The minimum sequence length above which 50% of the reference genome is represented. A proxy for contiguity

 L50

The minimum number of sequences within which 50% of the reference genome is represented. A proxy for contiguity

 Completeness

The proportion of the genomic sequences captured in a reference assembly. This is typically benchmarked using the proportion of observed vs expected single copy orthologues appearing in an assembly (i.e. BUSCO scores; [10])

Qualitative parameters

 Accuracy

A general term to scale the match between an assembly and a hypothetical complete and error-free assembly

 Precision

A general term to scale the replicability of the assembly using similar or alternative methods

 Certainty/uncertainty

A general term to scale the confidence surrounding a genomic sequence or assembly

 Error/mis-join/mis-assembly

An incorrect inference regarding the order and/or orientation of a particular genomic sequence

 Discrepancy

An inconsistency between two reference genomes which could be due to an error or inter- or intraspecific variation

Discrepancies

 Debris

Segments of DNA, typically contigs, not assimilated into higher order scaffolding of chromosome sequences

 Gaps

Runs of Ns, typically 10-100, that appear between contigs within scaffolds, representing uncertainty between the adjoining sequences

 Translocation

A unique DNA segment appearing on different chromosomes between two assemblies

 Inversion

A unique DNA segment running in opposite directions between two assemblies

 Relocation

Unique DNA segments appearing in a different order between two assemblies

General terms

 Restriction enzyme

A protein that cleaves DNA at sites with a particular sequence, or restriction site

 Orthologous

A DNA segment or gene appearing in separate species and inherited from a common ancestor, typically retaining similar function

 Repetitive element

Patterns of DNA sequences that occur as multiple copies throughout a genome

 Transposable element

DNA sequences, typically genes, that can move location within a genome

 Reference genome

A representation or estimation of the entire genomic sequence of a species or individual

 End-user

Someone seeking to leverage a previously generated reference genome for applied purposes. For example, an end-user might use a reference genome to map sequences and call variant positions in a set of samples