Genomic outcomes from diverse SNP panels obtained with different de novo building-loci pipelines in a varied species context: Can pipelines be inuencing biological interpretations?

Background: The irruption of Next-generation sequencing (NGS) and restriction site-associated DNA sequencing (RAD-seq) in the last decade has led to the identication of thousands of molecular markers and their genotyping for rened genomic screening. This approach has been especially useful for non-model organisms with limited genomic resources. Many building-loci pipelines have been developed to obtain robust single nucleotide polymorphism (SNPs) genotyping datasets using a de novo RAD-seq approach, i.e. without reference genomes. Here, the performances of two building-loci pipelines, STACKS 2 and Meyer’s 2b-RAD v2.1 pipeline, were compared using a diverse set of aquatic species representing different genomic and/or population structure scenarios. Two bivalve species (Manila clam and common edible cockle) and three sh species (brown trout, silver catsh and small-spotted catshark) were studied. Four SNP panels were evaluated in each species to test both different building-loci pipelines and criteria for SNP selection. Furthermore, for Manila clam and brown trout, a reference genome approach was used as control. Results: Despite different outcomes were observed between pipelines and species with the diverse SNP calling and ltering steps tested, no remarkable differences were found on genetic diversity and differentiation within species with the SNP panels obtained with a de novo approach. The main differences were found in brown trout between the de novo and reference genome approaches. Genotyped vs missing data mismatches were the main genotyping difference detected between the two building-loci pipelines or between the de novo and reference genome comparisons. Conclusions: Tested building-loci pipelines seem not to have a substantial inuence on population genetics inference. Preliminary trials with bioinformatic pipelines are suggested to evaluate their inuence in population parameters related to the specic goals of the study. enzymes; SNPs: Single nucleotide polymorphisms (SNPs) ; REs: Restriction enzymes; RG: Reference genome SNP panel; RG-STA: shared SNP panel between RG and STA panels; RG-ALT: shared SNP panel between RG and ALT STA: STACKS de novo panel; STA pipeline: STACKS


Background
Next-generation sequencing (NGS) technologies have represented a breakthrough for genomic studies [1] due to the huge reduction of sequencing cost (less than 0.02$ per Mb; [2]) and the development of a broad and versatile range of techniques for different genomic approaches [3]. By harnessing the possibilities of NGS, diverse reduced-representation genome sequencing approaches, useful to identify and genotype thousands of markers for genomic screening, were suggested and quickly became popular [4,5]. One of these approaches is the restriction site-associated DNA sequencing (RAD-seq), currently in a more mature phase, which includes different methods (e.g. ddRAD-seq, ezRAD, 2b-RAD) whose performances have been compared using simulations and real data [6]. RAD-seq methods require speci c library preparation protocols, which exploit the ability of Restriction Enzymes (REs) to cut at speci c genomic targets rendering a collection of fragments representative of a genome fraction to be compared among samples. These collections can be screened to identify and genotype a variable number of single nucleotide polymorphisms (SNPs) depending on the goals of the study for population genomics, linkage mapping or genome wide association studies, among others. The 2b-RAD method here used exploits the properties of IIB REs which produce a collection of short DNA fragments (between 33-36 bp) by cutting at both sides of the recognition site [7]. This method has the advantages of simple library preparation, shortreads to be sequenced (single-end 50 bp) and, as other methods, the number of loci can be adjusted both using REs with different recognition site frequency or by xing nucleotides in the adaptors during library construction (i.e. selective-base ligation) [7,8].
Genomic laboratory protocols have been set up and optimized through years by introducing modi cations on the original RAD-seq methodology to get better results using different laboratory protocols for different scenarios (e.g. samples with low DNA quality, genome size, etc.; see Fig.5 in [8]). Similarly, the bioinformatics pipelines starting from raw data, a critical issue in RAD-seq methodologies, have undergone an important re nement and diversi cation. Nevertheless, there is not a consensus about what is the best strategy for each scenario, despite the increasing number of studies addressed to evaluate the impact of technical and/or bioinformatics protocols [9,10]. In a typical 2b-RAD library, hundreds of millions of reads are generated, and they need to be allocated to each multiplexed individual (dozens to hundreds in the same lane) and to each genomic position or locus in the reference genome (or RAD-tag catalogue). The rationale behind this is stacking raw reads belonging to the same locus, while discerning and separating at the same time the reads belonging to different loci. Results could be improved if a reference genome, belonging to the species itself or to other congeneric species, is available. This would enable to avoid mixing of reads pertaining to paralogous loci. In November 2020, there were reference genomes for 25 bivalve species and subspecies (22 genera) and 583 sh species (338 genera) with different assembly con dence at the NCBI database (https://www.ncbi.nlm.nih.gov/datasets/). Nevertheless, there are about 9,200 species within the 1,260 bivalve genera [11] and 35,672 recognized species within the 5,212 documented sh genera [12]. All in all, less than 0.2% of the genomes of the known eukaryotic species have been sequenced to date [13]. Although full genome sequencing assembly is becoming progressively more robust thanks to the longread sequencing methods and assembling strategies, most of the species will have to wait for long before their genomes are assembled. Therefore, de novo approaches (i.e. stacking reads without a reference genome) will be the only option for many studies, although some initiatives are trying to change this perspective (e.g. Earth Biogenome Project; https://www.earthbiogenome.org/). For this reason, one of the strengths of a RAD-based method is its applicability without a reference genome [14].
There are different bioinformatics pipelines to identify a high number of SNPs and achieve con dent genotypes using a RAD-seq approach. The most popular one is STACKS [15,16] (around 3000 citations, at Google scholar in Oct. 2020), but several other alternatives, including Meyer's 2b-RAD pipeline [7], which was the original building-loci pipeline for 2b-RAD data, have been recently published. Some of these pipelines are able to perform a de novo approach (dDocent [17]), whereas others need a reference genome for alignment (Fast-GBS [18], TASSEL-GBS v2 [19]) or can address both approaches (STACKS, Meyer's 2b-RAD v2.1 pipeline, ipyrad [20]). Several of these alternative pipelines merge and concatenate pre-existing applications, making their design exible and customized according to the data managed and the goals of the study, but also providing upgrading and reliable bug-x (e.g. dDocent, Fast-GBS, Meyer's 2b-RAD v2.1 pipeline). Several factors should be considered for the selection of the bioinformatics pipeline to be used, among which sampling variance (samples availability and number of reads across them), budget (e.g. total read sequencing), which determine the number of samples to be analysed and their coverage, population structure and genome architecture of the species are the most relevant. The genome of each species has its particular size, history (e.g. duplication events), polymorphism, complexity and inter-individual variability, which can hinder the identi cation of stacks of reads (putative RAD loci) and their variants, circumstances that should be considered when choosing the appropriate building-loci pipeline and its parameters.
Studies comparing bioinformatic pipelines and strategies already exist. Some comparisons between de novo and reference-based approaches are available [21,22], and one of them tested the performance of the different strategies used to obtain accurate population genetics inferences [22]. Noticeable differences were observed among bioinformatics pipelines in the number of detected SNPs, sometimes resulting in distinct values for population descriptors and inference [22]. Other studies have evaluated the same software with different species to optimise the selection of bioinformatic parameters (STACKS 1.42, [23]; STACKS 1.44, [10]), making a common advice of doing preliminary trials to optimize the building-loci pipeline selected parameters. Published step-by-step protocols with a single species [14] also exist. A number of SNP calling comparison between STACKS 1.08 and dDocent 1.0 has been carried out using three sh species [17], while Sovic et al. [24] tested a novel pipeline (i.e. AftrRAD) vs STACKS and PYRAD using simulated and species datasets to assess computational e ciency and SNP calling. It is not uncommon to nd large differences in the number of SNPs (e.g. of one order of magnitude) in some building-loci pipelines comparisons [17]. Recently, Wright et al. [25] compared population parameters (e.g. F ST , PCoA) with SNPs obtained from three pipelines (i.e. GATK, SAMtools, STACKS) using two species with reference genome. Results showed remarkable differences in some population parameters (e.g. Hardy Weinberg Equilibrium) across bioinformatic approaches. Considering this information, a main issue that should be clari ed on RAD-seq methodologies is the impact of building-loci pipelines on population genetics parameter estimations and derived conclusions using a de novo approach on different biological scenarios and to some extent, to be compared to a references genome approach.
In this work, two building-loci pipelines for SNP calling and genotyping: i) STACKS 2.0 (http://catchenlab.life.illinois.edu/stacks/) and ii) Meyer's 2b-RAD v2.1 pipeline were tested using a de novo approach on different genomics and population genetics scenarios by using ve aquatic species: (i) Manila clam (Ruditapes philippinarum), (ii) common edible cockle (Cerastoderma edule), (iii) brown trout (Salmo trutta), (iv) silver cat sh (Rhamdia quelen) and (v) small-spotted catshark (Scyliorhinus canicula). A range of population parameters were compared in the ve species applying similar parameter settings for each pipeline (description of pipelines in Methods). The two marine bivalve species from the Order Veneroida show high polymorphism and low population structure [26,27]; the brown trout belongs to the order Salmoniformes, which suffered a speci c genome duplication event [28], and shows one of the highest population structuring among vertebrates [29]; isolated populations from different ecosystems were analysed in the silver cat sh, a freshwater species from the order Siluriformes living in uvial and costal lagoon environments [30]; nally, the small-spotted catshark (order Carcharhiniformes) is a benthic species which populations here used show low genetic differentiation [31]. Despite these species represent different population genetics and evolutionary scenarios, this does not mean that they necessarily comprehend all the models for the manifold scenarios used to check the performance of building-loci pipelines. To date, two of the species used in this study have a reference genome available: Manila clam (assembly size: 1.123 Gb; 19 chromosomes [32]) and brown trout (2.370 Gb; 40 chromosomes [33]). There are different statistics to evaluate the quality of genome assemblies (see Table 2 in [34] such as scaffold N50 and N90 (i.e. the length of the scaffold at which 50% and 90% of the assembly length is covered, respectively). The scaffold N50 is much higher in brown trout than in Manila clam assembly (52,209 Kb and 345 Kb, respectively), but a very high accuracy and completeness (see Supplementary Material [32]). Anyway, for short read data such as for 2bRAD-seq, the contiguity of the genome should not have a big in uence in the results when conservative short-read aligner parameters and strong SNP ltering steps are applied.
Wang and Guo [35] hypothesized that bivalves with 19 chromosomes could have a tetraploid origin, due to its ability to tolerate chromosomal aneuploidies. Furthermore, gene/gene family expansions would be a rather common process in this group, likely more frequent than in other molluscs [36]. Both genomic features could pose a challenge regarding paralogous genes for stacking reads, a common problem when no reference genome is available. In addition, molluscs present the highest genetic polymorphism in animal kingdom [37], which could represent genotyping drawbacks related to the presence of null alleles. All salmonids, including brown trout, have a tetraploid origin in process of diploidization since their origin around 90 Mya. This speci c duplication should be added to the three Whole Genome Duplications (WGDs) events in the line of teleosts from the vertebrate ancestor [38,39], which represent a major issue regarding paralogy. This issue would not be so important in small-spotted catshark and silver cat sh, with two and three older WGD events in their evolutive lines, respectively [40,41]. The most recent, the teleost-speci c 3rd WGD, dated around 300 Mya. We followed a de novo approach for comparison between pipelines in all species, but reference genomes in these two species were also taken as a useful reference to elucidate which build-loci pipeline provides better results with a de novo approach.
For the ve species, population parameters, structure pattern and outlier loci detection were estimated as an essential outcome to evaluate the performance of four SNP panels after different ltering steps. From two building-loci pipelines, STACKS (STA panel onwards) and Meyer's 2b-RAD v2.1 pipeline (ALT panel onwards), and two criteria for SNP selection, common SNPs (i.e. shared between building-loci pipelines, COM panel onwards) and merged SNP (a combination of shared and exclusive SNPs from both buildingloci pipelines, MER panel onwards) panels. When available, the results from reference genome approach was used to compare the results of population parameters evaluated with the de novo approach. In this case three additional SNP panels were obtained: the former with the reference genome approach using STACKS and the two remaining with the shared SNPs (i.e. STACKS de novo and 2b-RAD v2.1). Our main goal was to assess the in uence of the genomic architecture and population structure on the biological conclusions obtained with the different bioinformatics pipelines, and accordingly, to propose methodological recommendations for future studies using a de novo approach.

Results
The number of ltered reads loaded into building-loci pipeline using the reference genome approach was lower than with the de novo approach. A percentage of 22.1% and 25.4% were nally used in Manila clam and in brown trout, respectively. This reduction mostly due to those reads aligning to more than one place (59.6% and 72.6%, respectively) that were ltered out (i.e. -m 1 in Bowtie 1.1.2), the remaining reads failed to align with the mismatch criteria applied (18.3 % and 2.0 %, respectively).
The number of initial SNPs, after the building-loci step with the de novo approach, ranged from 56,074 in brown trout (STA) and 125,823 in silver cat sh (ALT) to 356,389 (STA) and 426,317 (ALT) in common cockle (Tables S1-S5). These gures dropped throughout the successive ltering steps ( Fig. 1) up to nally being retained from 0.2% in Manila clam STA panel to 20.5% in silver cat sh STA panel ( Fig. 2 and Fig. 3). There was a remarkable difference at the initial number of SNPs obtained between STA and ALT pipelines in brown trout and small-spotted catshark, although the outcome after ltering was rather similar (Table 1). No comparison was made at BIAL lter (i.e. SNPs with more than two alleles excluded; Fig. 1), since triallelic SNPs are removed with STACKS by default. The proportion of missing genotypes after applying the minimum coverage ltering step was higher with ALT panel than with STA in almost all species (Fig. 4). Filtering patterns varied among species due to the different weight of each ltering step. For instance, in bivalves, where genetic polymorphism is higher, the SNP retention after the third ltering step (i.e. RAD loci with ≤ 3 SNPs per RAD-locus) was much lower than in sh species ( Fig. 2 and Fig. 3), while in silver cat sh and small-spotted catshark, was due to the minimum allele count (MAC). This was related to the smaller sampling size of those sh species (N= 21 and N= 28, respectively) and the higher frequency of missing data, especially in small-spotted catshark (83% after depth lter in STA and ALT panels). When comparing pipelines, no clear differences on the ltering pattern were observed through the different ltering steps ( Fig. 2 and Fig. 3), except for MAC in brown trout, where more SNPs were pruned in the ALT panel. This could be related to the increment on the missing data after the minimum coverage lter, with higher frequency in ALT (59.3% vs 43.7% for ALT and STA, respectively (Fig. 4). After the third ltering step, in brown trout (N=52) there were signi cantly more missing genotypes per SNP (P < 0.01) at ALT panel on average (28.74 ± 15.24) as compared to STA panel (21.07 ± 17.54). The two species with the lowest median coverage at this step were small-spotted catshark (median = 10x for ALT and STA panels), brown trout (median = 11x and 14x for ALT and STA panels, respectively) and Manila clam (21x for the STA panel).
The nal number of SNPs ranged from 479 (STA) and 956 (ALT) in Manila clam to 21,468 (STA) and 22,481 (ALT) in silver cat sh. These gures were always higher with the ALT pipeline for all species. The number of SNPs in COM panels, those from STA panel shared with ALT panel, ranged from 206 in Manila clam to 17,459 SNPs in silver cat sh, while the percentage of SNPs called in STA that were found in ALT ranged from 23.9% in small-spotted catshark to 81.3% in silver cat sh. The main source of variation when considering the whole COM panels genotype dataset (for instance in silver cat sh N total COM genotypes (366,639)= N samples (21) x N COM SNP (17,459)) was missing data, ranging from 2.6% in common cockle to 14.8% at small-spotted catshark (FigsS1-S4). Roughly speaking, there would be three types of missing calling data with a de novo approach: (i) SNPs not genotyped by a building-loci pipeline due to too low coverage (< 3x with used con guration); ii) SNPs genotyped by a building loci-pipeline but with a coverage lower than 8x (Min coverage 8x ltering step; Fig.1); and (iii) SNPs with enough coverage but ambiguous alternative nucleotide depth (see http://eli-meyer.github.io/2bRAD_utilities/#genotype). The main source of missing genotype differences between pipelines was related to COM SNPs from STA pipeline that passed the minimum coverage lter (Min coverage 8x), but not were genotyped by ALT pipeline due to having a coverage lower than 3x. Both pipelines never genotyped with a coverage lower than 3x with the used con guration. This situation was found in most species, causing from 46.8% of the total missing genotype differences between ALT and STA genotypes in shared SNPs (COM panel) in brown trout to 63.2% in small-spotted catshark. Nevertheless, the main source of missing data differences in Manila clam was genotypes removed by coverage lter (Min coverage 8x) in ALT but not in STA (53.5% of the missing genotype differences). The percentage of the genotyping differences caused by a homozygous-heterozygous differences between pipelines at the same SNP and individual ranged from 0.5% in brown trout to 2.6% in small-spotted catshark with respect to the whole COM genotype panel ( Table 2). The frequency of genotyping differences caused by different homozygotes at the same SNP and individual (e.g. AA for one pipeline and GG for the other) was negligible in almost all cases (from 0% to 0.098%). The number of SNPs obtained with reference genome (RG) approach was always lower than with both de novo building-loci pipelines (see Tables S1 and S3), however, the number of SNPs shared between RG and each of the de novo pipelines was similar (i.e. RG-STA and RG-ALT). The percentage of SNPs obtained with RG approach detected as well in both de novo pipelines was 47.7% (RG-STA/RG) and 37.7% (RG-ALT/RG) in Manila clam and 73.0% (RG-STA/RG) and 80.9% (RG-ALT/RG) in brown trout, unlike the reverse way where the percentages of shared SNPs were lower due to the higher number of SNPs obtained with ALT: 28.8% (RG-STA/STA) and 11.4% (RG-ALT/ALT) for Manila clam with (Table S1) and 46.5% (RG-STA/STA) and 32.9% (RG-ALT/ALT) for brown trout (Table S3). Both de novo building-loci pipelines performed similarly when compared with reference genome (RG) approach genotyping (RG-STA and RG-ALT genotype comparisons; Table 2).
The population parameters evaluated (e.g. diversity levels, global F ST ) were roughly similar when using the different SNP panels in each species, showing higher differences in F IS values ( Table 1). The most notable differences would be found in the F IS values and when comparing de novo and reference genome approaches in brown trout, especially regarding Hardy-Weinberg tests and related population parameters (i.e. F IS , Ho vs He; Table S4). Here, unlike Manila clam, the proportion of SNPs with extremely low F IS (≤ -0.5) greatly differed between both approaches. The structure patterns obtained using STRUCTURE and DAPC analyses (see "Methods" section) were similar between both approaches (Figs. S5-S9). The highest global F ST values among populations were found in brown trout and silver cat sh, as expected, and no different interpretations among panels could be extracted. Some minor discrepancies in the number of suggestive outliers among panels were detected. All suggestive outliers detected showed a positive αvalue suggesting diversifying selection. The complete set of population parameters is provided in Supplementary Information (Tables S1-S5).

Discussion
In the last decade, the binomial NGS / RAD-seq has been the choice for genomic screening in many studies due to the vast number of genetic markers identi ed and genotyped in a single step. In this context, species with low genomic information have been targeted for population genomics and evolutionary studies broadening the opportunities for more re ned approaches regarding conservation genetics and breeding programs. Nevertheless, the effect of genomic architecture, genetic diversity, and population structure into the outcomes of these techniques (number of SNPs, genotyping con dence) in the target species are essential issues to be addressed using both with simulation and real data approaches. These issues are not only important for the wet-lab protocols, but also for the bioinformatics pipelines to be used to analyse the huge amount of data produced. Technical decisions on the reducedrepresentation method and restriction enzymes selection to be applied when constructing libraries are critical. When a reference genome is available the number of potential loci obtained with different restriction enzymes (e.g. using ExtractSites.pl https://github.com/Eli-Meyer/2bRAD_utilities) and the uniqueness of RAD loci should be tested (e.g. using EvalFrags.pl from 2bRAD_utilities). For instance, in Manila clam was predicted that the percentage of the genome constituted by repetitive elements and combined transposable elements could exceed 70% [32]. This genomic information should be considered to make the best technical decisions. Without reference genome, different REs can be tested if the budget allows it (see Box 1 in [8])to improve the percentage of reads to build up con dent loci. In the same way, the different performances of software and bioinfomatic pipelines might depend on the species and its genomics context. These can affect not only the number of markers found, but more importantly, the biological conclusions drawn. A repertoire of bioinformatic publications to manage the large amount of genomic data at different stages (e.g. building-loci pipelines, SNP ltering steps) has been published to serve as guidelines for researchers with limited experience in the eld and advices for bioinformatic "Gordian Knots".
The panels used in this study came from species that differ in their genomic architecture, polymorphism and population structure, and these factors could in uence the results obtained also depending on the different population parameters used (e.g. global F ST , allelic richness). Nevertheless, the results obtained in our study within species using the four de novo panels (i.e. STA, ALT, COM, MER) were roughly similar for all the population parameters evaluated. Accordingly, biological inferences would hardly change. Minor differences were found in the number of suggestive outliers detected in common cockle. In this case, the number of suggestive outliers detected could be related to the total number of SNPs of each panel. But beyond this observation, our results suggest that whatever the pipeline chosen similar results are obtained with a de novo approach. The number of initial and nal SNPs obtained with reference genome was lower than obtained with a de novo approach. This would be related to the high number of input reads removed by duplicate or multiple alignment to the reference genomes due to the short length of 2b-RAD reads before going to the building-loci pipeline. Some population parameters between both approaches showed relevant differences (see F IS values in brown trout; Table S3).
A practical approach to decide between pipelines with different building-loci strategies for handling Genotyping-by-sequencing (GBS) data is to assay trials with a small subset of data and check for their results using a meaningful set of population parameters, previously selected according the objectives of the study. Indeed, using the number of SNPs obtained as the main criterion [17] to decide the best building-loci pipeline to be used is not advisable, since a higher number of SNPs does not necessarily indicate a better stacking and con dent RAD-seq data [10], and consequently, it might have a negative impact on the con dence of results and biological inferences. The initial number of SNPs obtained with STA and ALT pipelines across the different species tested was rather similar except for the brown trout and the small-spotted catshark. These species showed the lowest median coverage, near to the selected threshold coverage lter (8x) hence the differences on the number of putative loci from input data. While STACKS 2, with a de novo approach, starts with individual data demanding a number of identical reads to build a locus (see Methods), Meyer's 2b-RAD v2.1 works with a combined subset of con dent reads from all samples to build a global reference panel to which align every read. In cases with low coverage, a less demanding criterion to build loci can produce large differences in the initial number of SNPs. Nevertheless, through the ltering steps, the SNP number from building-loci pipelines converged in both species and importantly, the population parameters between SNPs panels were similar. Once chosen the pipeline, it would be recommendable to run several trials with different parameters to properly adjust them to the dataset. For instance, -M in STACKS (which de nes the maximum nucleotide differences allowed between intraindividual putative loci) depends on the levels of polymorphism of the species andm in STACKS on the existing coverage [23]. In the same way as for the choice of the building-loci pipeline, it would be advisable to choose parameters taking into account the results from population outcomes, since there is not a unique pipeline suited to every situation, as already indicated [21].
After the building-loci pipeline, it is important to adjust ltering (criteria and order [9]) according to the particular scenario of each species (e.g. sequencing and genotyping errors, duplicated loci [42]). Since the ltering parameters are dataset dependent [43], the ltering criteria should be adjusted accordingly (e.g. the stringency of MAC ltering step is sample size dependent). For instance, the number of SNPs was markedly reduced through ltering steps and the highest difference in the percentage of retained SNPs was found among species. In the study by O'Leary et al. [9] the percentage of retained SNPs ranged from 0 % to 63 % using the same ltering pipeline with four marine sh species. In our study, the three SNP/RAD-locus lter used to avoid inconsistent RAD-loci could not work well for highly polymorphic species or taxa (e.g. bivalves). Furthermore, the POP lter (i.e. 60% call rate per population) could be applied not so stringently since in previous studies qualitative interpretations of population parameters were maintained in most cases [22,25], and sometimes even improved [44]. Notwithstanding, the drawback could be using larger SNP panels for similar information. If well, for some studies it is fundamental to achieve the highest density of SNPs possible (e.g. linkage disequilibrium, outlier detection and gene mining, Genome-wide association study; GWAS). The biggest difference between pipelines was found with the MAC ltering step in brown trout which could be explained by the higher average of missing genotypes per SNP (MAC is sample size dependant) and the lower coverage per RAD-locus (misclassi cation of heterozygotes) in the ALT pipeline. Finally, more ltering steps might be necessary, especially when working without a reference genome (e.g. F IS SNP ltering step when paralogs or null alleles can be a problem to avoid misinterpretations); this is the case of HWE deviations in brown trout caused by potential paralogs whose impact can be reduced using a reference genome.
Attention should be paid to the order of the different ltering steps because this can alter the nal SNP panel. When adjusting the ltering parameters, it would be advisable to consider not exclusively the number of removed SNPs at each step separately, since they could result from the interaction among ltering steps. For instance, the coverage lter determines the increase of missing data which in uences the percentage of SNPs eliminated by MAC and population representation lters, according to the stringency of the coverage threshold used. Furthermore, missing data may be due to a lower coverage than the selected threshold or for not being genotyped by the building-loci software with the genotyping options selected (e.g. previously selected nucleotide frequencies range to genotype in ALT pipeline). We found that the last could be the main source of COM SNPs genotyping differences between both buildingloci pipelines excluding Manila clam. This means that the ALT pipeline genotyping parameter should be improved by choosing appropriate ranges for each species. The objective of any ltering strategy is removing SNPs that are not reliable without losing informative SNPs. Different factors can in uence the ltering criteria, e.g. to achieve the number of SNPs required to meet the research goals. In this sense, a panel made up with markers found by two different pipelines should ensure reliability. It was found that 67% of SNPs from STACKS panel were common with UNEAK panel, using a de novo approach in soybean (Glycine max L.) data [21]. With reference genome the overlap percentages among STACKS and other building-loci pipelines ranged from 76% to 96% [21]. Using a reference genome approach the percentages of shared SNPs between STACKS with SAMtools and GATK ranged from 7.3% to 71.4% [25]. The lowest values could be partially explained because STACKS panel recruited many more SNPs than the other building-loci pipeline. The lowest percentage of COM SNPs taking STA panel as genotyping reference in our study (i.e. 23.9% in small-spotted catshark and 43.0% in Manila clam) were in panels with less than 1000 SNPs. These low values may be explained by a strong ltering effect, on shared SNPs between pipelines. The highest number of COM SNPs were detected when STA panels included the highest number of SNPs, around 74% in brown trout and 81% in silver cat sh. Despite including a lower number of SNPs, the COM panels provided roughly similar results to the larger ones. This suggests that most informative markers are retained downstream, with the advantage of working with a reduced panel that can simplify and speed-up analyses. In the study by Díaz-Arce et al. [10] the possible effect of SNP number on F ST estimation using reduced SNPs subsets was tested and similar values regarding the full panel were obtained. Moreover, estimated genotyping accuracy may be higher with SNPs shared by more than one building-loci pipeline according to Torkamaneh et al. [21]. The impact of genotypic differences between shared SNP panels was low, such as those obtained by Wright et al. [25].
Summarizing, the results obtained suggest that both building-loci pipelines are adequate and provide more con dent results adjusting parameters and SNP ltering steps to the research context. Despite the differences observed in the number of SNPs among de novo approach panels, this seems not to affect dramatically the conclusions, at least in the biological scenarios managed in this study. When there is no reference genome, a COM panel could be interesting in terms of SNP panel consistency with species with high genomic complexity. In a general way for some population parameters, to have less SNPs do not imply loss of biological information and a COM panel could increase data reliability in these cases. In the case of choosing this option differences in genotyping between pipelines should be checked, although in this study genotyping differences between pipelines were infrequent. The main source of genotyping differences in COM SNP panels was missing data and had different sources. On one hand, different genotypes can be obtained due to the different building-loci pipeline parameters to call genotypes (e.g. -alpha at STACKS) and the different alignment strategies (e.g. reads with alternative allele can be stacked into another putative loci). For missing genotype differences, it should be considered that even RAD loci showing high coverage, might have missing data if building-loci pipeline parameters involved in genotyping are not properly set up. Furthermore, small differences in the building-loci pipeline could have more in uence in the number of missing genotypes when working with low coverage loci. Anyway, it would be advisable to use a few intra-library and inter-library sample-replicates to estimate genotyping errors [45] to increase the con dence in our data, especially if the RAD-seq libraries are designed with low estimated coverage per locus (e.g. around 10x), due to the impact of coverage in genotyping error rates [46].

Conclusion
The results here obtained show that selected building-loci pipeline could not have a substantial in uence in the population parameters and their derived biological interpretations. The different results obtained between some de novo and reference genome derived panels could be solved with improved SNP ltering steps. Anyway, our results are not directly transferable to any building-loci pipelines or genomic scenarios. Users should perform their own testing according with their research contexts. One recommendation would be to test building-loci parameters and ltering SNP steps with subset of samples to save hardware resources and computation time. Following this recommendation, a good approximation to the best bioinformatic parameters for the whole sample dataset can be done. This should be done using population parameters consistent with the research goals to make the best decisions. Although the recommendations previously raised are time consuming, they could improve the robustness of results and improve knowledge about the use of bioinformatics tools and datasets.

Samples studied
All datasets used in the present study are own resources obtained from previous research carried out by the authors. Four Manila clam (R. philippinarum) localities, three from the Adriatic Sea (Italy: Chioggia, N= 30; Porto Marghera, N= 30 [47]; and Po river mouth N= 25) and one from the Atlantic Ocean (Spain: Galicia, N= 25), were studied. Four common edible cockle (C. edule) localities from the European Atlantic area (Somme Bay, France, N= 30; Campelo, Spain, N= 30; Miño River mouth, Spain, N= 30; and Ría Formosa, Portugal, N= 30) were used from regions with extractive cockle activities [48]. Three localities of brown trout (Salmo trutta) from Duero River basin in the Iberian Peninsula (Águeda, N= 15; Omaña, N= 20; and Pisuerga, N= 17), two of them representing different mitochondrial pure native lineages (Atlantic and Duero, [49,50]) and one from the hybrid zone (Omaña; [51]), were evaluated. Two localities of silver cat sh (R. quelen), a Neotropical freshwater species distributed from the Northeast of Los Andes to the centre of Argentina and living in uvial and coastal lagoon environments [52], were analysed. These samples came from Sauce Lagoon (N=10) and Uruguay River Basin (N= 11) belonging to two divergent lineages [30]. Finally, two nearby localities without genetic differentiation of small-spotted catshark (S. canicula) from North Sea (N= 13) and Irish Sea (N= 15) were analysed [31]. All information from samples analysed is summarized in supplementary material (Table S6).
Library preparation DNA extraction and 2b-RAD libraries preparation using AlfI IIB RE followed the same protocol except for small-spotted catshark [31] where CspCI IIB RE was used instead. The libraries were sequenced using Illumina sequencing platforms (i.e. HiSeq 1500 for small-spotted catshark, NextSeq500 for the remaining species) following a 50 bp single-end chemistry. For details see [30,31].

Bioinformatic Analysis
Building-loci pipelines: background STACKS 2.0 and Meyer's 2b-RAD v2.1 were the building-loci pipelines chosen for comparing their performances using a de novo approach within the broad genome and population genetics species scenarios selected. Meyer's 2b-RAD v2.1 pipeline and STACKS building-loci pipelines have some similarities on their strategy; roughly, both are based on stacking reads into putative loci by sequence similarity, assuming that each locus correspond to a single place in the species genome. Nevertheless, there are many differences on how loci are built and how the user can control the existing options and genotyping strategies. STACKS, with a de novo approach, works rstly at the individual level demanding a number of identical reads to build a locus (i.e -m parameter in ustacks), while the 2b-RAD v2.1 pipeline works with a combined subset of samples to build a global reference panel to which align every read. For genotyping, STACKS uses a chi square test to call a heterozygote or a homozygote (i.e. -alpha and --gtalpha), whereas nucleotide frequencies based on allele read depth at each position and sample are used for genotyping in the 2b-RAD v2.1 pipeline . Accordingly, huge differences in the raw SNP number were obtained in preliminary analysis in our study. Anyway, we tried to apply the highest number of common parameters in both pipelines to be consistent among comparisons. STACKS 2.0 pipeline can be summarized as follows: (1) Raw sequence reads were demultiplexed and ltered according to different criteria such as quality, uncalled bases and read length (process_radtags); (2) reads from each individual were clustered into putative loci, and polymorphic nucleotide sites identi ed (ustacks); (3) putative loci were grouped across individuals and catalogues of RAD-loci, SNPs and alleles were constructed (cstacks); (4) putative loci from each individual were matched against the catalogue (sstacks); (5) the data were transposed to be oriented by locus, instead of by sample (tsv2bam); (6) all individuals were genotyped at each called SNP (gstacks); and (7) SNPs were nally subjected to population genetics lters (populations) and results written in different output les (e.g. GENEPOP [53], STRUCTURE [54] le formats).
The de novo approach used for the 2b-RAD v2.1 pipeline can be summarized as follows: (1) from a subset of high quality reads (Phred quality scores ≥ 30 at all positions) from all samples, a global de novo reference catalogue was built by clustering these reads with the BuildRef.pl script and cd-hit 4.6.8 [55,56]; (2) every read was aligned against the reference catalogue using a mapping program (in our case Bowtie 1.1.2 [57]); (3) allele frequencies were counted at each position (SAMBasecaller.pl) and genotypes determined from that information (NFGenotyper.pl); and (4) genotypes called across samples were combined into a single genotype matrix with samples as columns and loci as rows (CombineGenotypes.pl). Perl scripts belonging to the last version are publicly available (https://github.com/Eli-Meyer/2brad_utilities/) and earlier versions on request.
A reference genome-based pipeline was used for Manila clam and brown trout, the species with chromosome assembly level reference genomes. In this case, the differences in the pipelines of STACKS 2 and Meyer's 2b-RAD v2.1 are lower. For instance, STACKS 2 reduces the number of modules necessary from six to two when using reference genome and the building of putative loci is conditioned by the step using a short-read aligner. Our goal was not to compare this option between pipelines, but to take as reference the genome-based approach to be compared with the de novo approach as a gold standard within each pipeline.

Building-loci pipelines: analysis
After demultiplexing raw data, several ltering criteria were applied: (1) all reads were trimmed and ltered by the RE recognition site to retain only those sequences of 36 bp (or 32 bp from CII RE catshark) centred on the RE recognition site using our own Perl scripts and Trimmomatic 0.38 [58]; and (2) process_radtags (module belonging to STACKS) was used to remove reads with uncalled bases (-c option). Other parameters were species-speci c (e.g. window sliding size, -w; score limit, -see Table S7A). To check raw and ltered reads quality, FastQC 0.11.7 [59] was used. STACKS input sequences were oriented in the same orientation using our own Perl script to avoid oversplitting. To say, the set of input reads for both building-loci pipelines in each species was always the same.
At the building-loci pipeline step, the parameters considered were: (1) minimum number of identical reads  Tables S7B and S7C).
When a reference genome was available, Bowtie 1.1.2 was used as short read aligner. The number of mismatches allowed between reads and the reference genome using the -v alignment mode was 2 mismatches for brown trout and 3 mismatches for Manila clam, the same as mentioned above. Only reads which aligned to a single site in the reference genome were considered (-m 1). The same parameter values were used in STACKS modules shared between reference-based genome and de novo approaches (i.e. gstacks and populations modules).

SNP ltering steps and creation of datasets
The raw SNP panels of the STA and ALT pipelines were ltered using the same parameters for consistency, retaining only a set of markers and alleles represented across the individuals genotyped. Filters were applied in the same order for each dataset (Fig. 1), as recommended [9]: i) SNPs with more than two alleles were excluded (BIAL lter); ii) a minimum locus coverage of eight reads was chosen (Min coverage lter); iii) RAD-loci with more than three SNPs were excluded for further analyses; iv) SNPs were retained only if the less frequent allele was represented at least three times at the whole species sample (MAC, minimum allele count, lter); v) less than 40% of missing data in each population for a SNP to be retained (POP lter); vi) SNPs were excluded when they did not adjust to Hardy-Weinberg (HW) expectations (P < 0.01) in more than half of the populations analysed (HW lter); and vii) only the rst SNP per RAD-locus was retained when several SNPs were called in the same RAD-locus to avoid redundant information.
According to the aforementioned criteria, four SNP panels were tested and compared: (1) STA and (2) ALT SNP panels were further used to obtain (3) common (COM) and (4) merged (MER) panels. When reference genome was available three additional SNP panels were obtained (i.e. RG, RG-STA, RG-ALT). RAD-loci of these panels from both pipelines were compared to identify shared and private RAD-loci. In order to do this, cd-hit-est was used to cluster similar RAD-loci taken from the two pipeline catalogues with the same threshold of similarity (-c) used in the clustering steps of building-loci pipelines (i.e. two mismatches maximum for sh species and three mismatches maximum for bivalve species).
Furthermore, we used a -g value of 1 when clustering sequences to meet the established similarity threshold and a band alignment width (-b) of 1 to avoid previously separated sequences due to indels at speci c clusters. This procedure rendered COM and MER SNP panels created using a customized Perl script (see Supplementary material). SNPs at shared RAD-loci between STA and ALT panels were selected after the sixth ltering step, since the rst SNP at shared RAD-loci could be different after the full ltering pipeline (Fig. 1). MER panel was nally created by taking the SNPs from "private" ALT and STA pipelines RAD-loci (i.e. those from cd-hit-est clusters with only STA or ALT pipelines RAD-tags) plus the COM SNP panel previously obtained. Again, only the rst SNP per RAD locus was retained to avoid redundant information. The GENEPOP les of all shared SNP panels were compared to quantify their genotyping differences using own Perl script (see Supplementary material, customised_scripts.zip).
Comparison of outputs and population genetics analyses.
The results of the aforementioned pipelines were compared using both quantitative (i.e. number of SNPs) and biological criteria (population genetics data). Filtered GENEPOP les were obtained using customized Perl scripts. These les were transformed for subsequent analyses using the PGDSPIDER 2.1.1.5 software [60]. Firstly, the consequences of ltering over the number of RAD-loci and SNPs were evaluated for each combination of species-pipeline; secondly, common and private RAD-loci/SNPs between the two pipelines were obtained for each species. Finally, biological interpretations from each pipeline/species were compared, including basic population genetics results (i.e. genetic diversity levels and population structure).
Observed and expected heterozygosity (Ho and He, respectively), inbreeding coe cient (F IS , using 1000 bootstrap iterations to estimate their 95% con dence intervals) and allelic richness were calculated per population using R's package diveRsity [61]. Global F ST was calculated with Genepop R package [53].
STRUCTURE software [54], using R package ParallelStructure [62], was used to de ne the most likely steps. Structure results were parsed with Structure Harvester [63], which implements the Evanno's method [64] to detect the most likely number of clusters according to the data. CLUMPAK [65] was used to merge runs with the same K that suggested similar patterns of structuring and to obtain cluster membership plots. As a second approach to detect population structure, a Discriminant Analysis of Principal Components (DAPC) analysis was performed based on genetic data, as implemented in R package adegenet [66,67]. The optimal number of principal components to be used was estimated with the crossvalidation method implemented in the package and from one to three discriminant components were retained according to the amount of population structure variation they explained.  Tables   Table 1 Mean (bold values) Table 2 Genotypic differences between shared SNPs from the different pipelines.
Genotyping differences are presented as the relative frequency of total genotypes. COM panels were obtained with shared SNPs between STA and ALT panels (both de novo approach). RG-STA panels were obtained with shared SNPs between RG and STA (reference genome and de novo approach). RG-ALT panels were obtained with shared SNPs between RG and ALT (reference genome and de novo approach). Hom: Homozygous; MD: Missing data (i.e. missing genotype); Het: Heterozygous.