Sampling, library preparation and sequencing
Individuals were collected from Bixa orellana accessions in the display/research garden of the Instituto de Medicina Tradicional de ES SALUD (IMET), (Iquitos, Peru) and the germplasm collection of Bixa orellana in the Facultad de Agronomía - Universidad Nacional de la Amazonía Peruana (UNAP). DNA extraction from leaves was performed using DNeasy Plant Mini Kit from Qiagen® according to the manufacturers specifications. DNA was prepared for sequencing using the Nextera™ DNA Flex Library Prep Kit according to the manufacturer’s specifications. Libraries were loaded onto an Illumina MiSeq using a V2 cartridge with read lengths set for (2 × 250).
Sequence assembly
Following DNA sequencing, data were trimmed to remove adapters and indexes using BCL2fastq. Sequences were then paired and quality trimmed using Usearch11 [8]. Quality trimming parameters trimmed sequences to a median Q score of 25, with minimum length of 100 bp, and sequences below Q25 or 100 bp length discarded. Following read preparation, genome assembly followed the workflow described in [29] where reads are initially mapped to the nearest relative to enrich for chloroplast derived reads, and then denovo assembled to establish initial contigs, followed by subsequent subdivision and reassembly of contigs in an iterative fashion until the set of contigs can be combined to single complete chromosome. Completion of assembly was inferred by identifying inverted repeats and overlap between beginning and ending sequence reflecting assembly of a complete circle.
Paired, filtered reads were mapped to Bixa orellana (NC_041550) to enrich reads for chloroplast specific sequences using Geneious R11 (https://www.geneious.com) Mapped reads were then denovo assembled using SPAdes [30], with kmer sizes 21, 35 and 55. In parallel, all paired, filtered reads were denovo assembled with SPAdes to generate contigs independently of the reference genome to ensure all changes in structure and synteny could be identified. The denovo contigs of both assemblies were merged and assembled in Geneious R11. Assembled contigs were compared to the reference genome and gaps were filled by extracting sequences adjacent to the gaps and mapping the paired filtered reads to those sequences using Bowtie assembler [31] in a reference guided assembly. Contigs extended by Bowtie were then subsequently combined with previous contig sets and again compared to the reference genome. The assembled plastome was tested for the large inverted repeats, and the presence of both repeats was indicative of complete chromosome assembly. Raw sequence data were deposited in NCBI Sequence Read Archive (SRR10320715 and SRR10320716) as part of the FDA GenomeTrakrCP: chloroplast DNA for botanical product identification and for the genesis of the CVM Complete Chloroplast Animal Feed Database ([15], BioProject PRJNA325670).
Sequence annotation and comparative gene content
Sequences were annotated with the Verdant online chloroplast assembly tool [32] (see: http://verdant.iplantcollaborative.org/plastidDB/). The existing reference for Bixa, NC_041550, was added to the Verdant database prior to upload and analysis of our sequences. The resulting gff files were downloaded and used to annotate the assembled Bixa chloroplast genomes (Fig. 2). Each CDS was extracted and translated to ensure correct open reading frames, and tRNA were extracted and assembled with homologous tRNA from the reference Bixa to ensure annotation captured the entire gene. Following evaluation and validation of all annotated genes, annotations were exported with the position, identity and phase of each gene. We then compared the number, synteny and organization of the set of genes relative to other chloroplast genomes within the Malvales (Theobroma and Heritiera (Malvaceae), Vatica (Dipterocarpaceae) and Daphne (Thymelaeceae)). Genes were classified based on the type of gene (CDS, tRNA, rRNA) as well as their organization within the chromosome (Large Single Copy (LSC), Inverted Repeat (IR), Small Single Copy (SSC)). The order of all genes was established and changes in the structural order of genes along the chromosome for all 5 chloroplasts was compared. Gain or loss of genes was denoted, and movement of genes among subunits of the genome (LSC, IR, SSC) based on expansion of the Inverted Repeat was recorded.
Phylogenetic inference
Representative Chloroplast genome sequences for Malvales were downloaded from the GenBank RefSeq library for use in evaluation of the monophylly of the Bixa orellana genomes, as well as for inference of the relationship of Bixaceae to Malvales, and in particular with respect to its status as sister group to either Malvaceae or Thymelaeceae. The two Bixa assembled in this study in addition to the existing Bixa reference genome were included in a group of 37 additional Malvales chloroplast genomes (see Table S1) as well as Arabidopsis thaliana from the sister order Brassicales as an outgroup. The 41 chloroplast genomes were analyzed with the Café (accelerated alignment free sequence analysis [33] kmer distance program to infer genetic distance (using k = 8 in conjunction with the D2 genetic distance metric [34]. The resulting phylip formatted distance matrix was imported in PAUP [35] where Arabidopsis thaliana was set as the outgroup, and a NJ distance tree was computed. The same cadre of species was used with an extracted set of barcoding regions (matK and rbcL) and a 5 gene MLST schema comprised of rpoC1 (1–3116) rbcL (3117–4546) matK (4547–6154).
atpB (6155–7667) accD (7668–10,198) to generate additional phylogenic trees based on universal alignment. NJ distance trees were computed and phylograms for all three trees were visualized in Fig Tree v1.4.4. To share distance values, phylogenetic trees are shared in Supplementary Materials Fig. 1.
Kmer identification from mixtures
Genomes Oryza sativa (DRR228665), Zea mays (SRR10812881), Gossypium hirsutum (SRR10081030), Pseudomonas aeruginosa (DRR215956), Salmonella enterica (SRR11851888), and Escherichia coli (SRR11851885) were downloaded from NCBI. Bacteria were added as a “contaminant” to test potential interference). All sequences were trimmed to 125 kb, and then combined in the following proportions; Oryza sativa (DRR228665) 30%, Zea mays (SRR10812881) 30%, Gossypium hirsutum (SRR10081030) 30%, Pseudomonas aeruginosa (DRR215956) 3.3%, Salmonella enterica (SRR11851888)3.3%, and Escherichia coli (SRR11851885) 3.3%. Mixed data was subsequently combined with different proportions of Bixa orellana plastome data and analyzed using Genome2-ID portal, a metagenomic software for plant taxonomic classification using kmers (DNA4 Technologies, LLC). To evaluate limit of detection, random selection of Bixa at 5, 2.5, 1.25, 0.6, 0.3, 0.15, 0.07, 0.03, 0.01%, were added to the mixture. Genome2-ID algorithms were run on mixed “dilutions” to identify, based on this approach, the point at which there was until no longer statistical support for identification of Bixa from the mixed sample.