Contrasting new and available reference genomes to highlight uncertainties in assemblies and areas for future improvement: an example with monodontid species

Bringloe, Trevor T.; Parent, Geneviève J.

doi:10.1186/s12864-023-09779-3

Table 1 Definitions for terms used in the current study

From: Contrasting new and available reference genomes to highlight uncertainties in assemblies and areas for future improvement: an example with monodontid species

Term	Definition
Data types
Short reads	Accurate sequences of DNA typically 150 bp in length and typically generated on Illumina platforms [14]. Sequences may be paired or unpaired depending on whether both ends of DNA fragments were sequenced
Linked reads	Short reads, but with molecular barcodes that tag reads from the same DNA fragment, creating “read clouds” that leverage long range information [15]. Sometimes referred to as 10 × Linked reads after 10 × Genomics, a company that provided this type of sequencing prior to 2020
Long reads	Sequences produced directly from long fragments of DNA, thus providing long range information in the form of intact reads. Typically generated using PacBio or NanoPore platforms, long reads are historically more error prone compared to short reads, though read accuracy continues to improve (e.g. PacBio HiFi reads; [14])
Chromosome conformation capture	A method used to map spatial organization of chromatin across genomes [16]. A suite of techniques can be used to cross link loci and sequence DNA fragments as paired-end short reads linked by unknown proximity. The higher order structure of sequences (e.g. chromosomes) can be inferred because loci interactions increase with linear proximity on the genome. Data is generated on similar platforms to short read sequences, e.g. Illumina
Optical genome mapping	A restriction enzyme is applied to highly intact DNA and the lengths and order of fragments are measured. This information is used to guide the order and orientation of assembly fragments by matching patterns in the occurrence of sequence motifs [17, 18]. Note, the data represents mapping information or physical locations of sequence motifs, not sequence data. Bionano is currently the main provider for optical genome mapping services
Reference genome quality
K-mer	Substrings of length k within DNA sequence data
Coverage	The number of times, on average, a genomic region or complete genome has been sequenced. Oftentimes synonymous with the depth, or number, of uniquely overlapping reads in a dataset
Contig	A DNA sequence assembled by overlapping k-mers or reads
Scaffold	Contigs ordered and oriented into longer sequences, typically with gaps represented as Ns in between contigs [19]
Contiguity	The level to which a reference genome is assembled into continuous sequences representing DNA, a genome fragmented into a larger amount of smaller sequences being less contiguous
Quantitative parameters
N50	The minimum sequence length above which 50% of the reference genome is represented. A proxy for contiguity
L50	The minimum number of sequences within which 50% of the reference genome is represented. A proxy for contiguity
Completeness	The proportion of the genomic sequences captured in a reference assembly. This is typically benchmarked using the proportion of observed vs expected single copy orthologues appearing in an assembly (i.e. BUSCO scores; [10])
Qualitative parameters
Accuracy	A general term to scale the match between an assembly and a hypothetical complete and error-free assembly
Precision	A general term to scale the replicability of the assembly using similar or alternative methods
Certainty/uncertainty	A general term to scale the confidence surrounding a genomic sequence or assembly
Error/mis-join/mis-assembly	An incorrect inference regarding the order and/or orientation of a particular genomic sequence
Discrepancy	An inconsistency between two reference genomes which could be due to an error or inter- or intraspecific variation
Discrepancies
Debris	Segments of DNA, typically contigs, not assimilated into higher order scaffolding of chromosome sequences
Gaps	Runs of Ns, typically 10-100, that appear between contigs within scaffolds, representing uncertainty between the adjoining sequences
Translocation	A unique DNA segment appearing on different chromosomes between two assemblies
Inversion	A unique DNA segment running in opposite directions between two assemblies
Relocation	Unique DNA segments appearing in a different order between two assemblies
General terms
Restriction enzyme	A protein that cleaves DNA at sites with a particular sequence, or restriction site
Orthologous	A DNA segment or gene appearing in separate species and inherited from a common ancestor, typically retaining similar function
Repetitive element	Patterns of DNA sequences that occur as multiple copies throughout a genome
Transposable element	DNA sequences, typically genes, that can move location within a genome
Reference genome	A representation or estimation of the entire genomic sequence of a species or individual
End-user	Someone seeking to leverage a previously generated reference genome for applied purposes. For example, an end-user might use a reference genome to map sequences and call variant positions in a set of samples

Back to article page

ISSN: 1471-2164

Contact us

Submission enquiries: bmcgenomics@biomedcentral.com
General enquiries: ORSupport@springernature.com

BMC Genomics

Contact us