MHC genotyping of non-model organisms using next-generation sequencing: a new methodology to deal with artefacts and allelic dropout

Background The Major Histocompatibility Complex (MHC) is the most important genetic marker to study patterns of adaptive genetic variation determining pathogen resistance and associated life history decisions. It is used in many different research fields ranging from human medical, molecular evolutionary to functional biodiversity studies. Correct assessment of the individual allelic diversity pattern and the underlying structural sequence variation is the basic requirement to address the functional importance of MHC variability. Next-generation sequencing (NGS) technologies are likely to replace traditional genotyping methods to a great extent in the near future but first empirical studies strongly indicate the need for a rigorous quality control pipeline. Strict approaches for data validation and allele calling to distinguish true alleles from artefacts are required. Results We developed the analytical methodology and validated a data processing procedure which can be applied to any organism. It allows the separation of true alleles from artefacts and the evaluation of genotyping reliability, which in addition to artefacts considers for the first time the possibility of allelic dropout due to unbalanced amplification efficiencies across alleles. Finally, we developed a method to assess the confidence level per genotype a-posteriori, which helps to decide which alleles and individuals should be included in any further downstream analyses. The latter method could also be used for optimizing experiment designs in the future. Conclusions Combining our workflow with the study of amplification efficiency offers the chance for researchers to evaluate enormous amounts of NGS-generated data in great detail, improving confidence over the downstream analyses and subsequent applications.

To estimate the relative amplification efficiencies of the alleles, we designed two functions. 25 2.1 Defining the function DensityAmplicon() 26 The first function is called DensityAmplicon() and is used internally by the second function we 27 will describe afterward.

28
DensityAmplicon <-function(reads, proba, logOutput) { # Compute the likelihood or loglikelihood of an amplicon # # Args: # reads: the observed reads number. # proba: the efficiency values for alleles with non-null read numbers. # logOutput: if TRUE return the logdensity, if FALSE return the density. # # Returns: the density probability of an amplicon. allele.indexes <-which(reads != 0L) return(dmultinom(x=reads[allele.indexes], prob=proba[allele.indexes], log=logOutput)) } It is a function that will provide the likelihood or loglikelihood of an amplicon, given efficiency 29 values for the alleles. The function DensityAmplicon() relies itself on the function dmultinom() 30 that provides the density function of the multinomial distribution. This latter function is part of 31 R. Importantly, the function dmultinom() requires a set of probabilities summing to one, which is 32 different from amplification efficiencies values that are defined at the level of alleles, irrespectively 33 of the genotype, and that do not necessarily sum to one. Still, because the function dmultinom() 34 re-scales all values given as probabilities so that they do sum to one, we can provide directly 35 efficiency values to the prob argument of the function.

37
We can try the function by computing for instance the probability to have observed the number 38 of read associated with our first amplicon assuming equal efficiency between all alleles. Note that 39 we set here the logOutput argument to FALSE to obtain the result in the probability scale rather  to estimate the efficiency values from our data. To do that, we will use the function optim() 52 from R. Starting from initial values for efficiencies (defined by the par argument that we set to 53 equal efficiency values of 1), optim() will simulate other efficiency values until the loglikelihood 54 is maximal. Note that the loglikelihood is maximal for the same set of amplification efficiencies 55 that would maximize the likelihood on the non-log scale. Since optim() looks for minimum by 56 default, we use the option fnscale=-1 to search the maximum. Also, optim() can use different 57 optimization algorithms (see ?optim() for details). We chose to use the algorithm refereed as 58 method="L-BFGS-B" because it works very well on simulated dataset and it is reasonably fast.

59
This algorithm takes boundaries for parameters (defined by the upper and lower arguments). We 60 constrained optim() to look for efficiency estimates between 0.1 and 6, but if your estimates reach 61 those boundaries, you should modify those limits. Let's also measure meanwhile how long it takes 62 using the function system.time (look at the column elapsed, the result is expressed in seconds):  to imagine an amplicon with 4 alleles, 3 having an identical efficiency and 1 having an efficiency 71 twice higher as the others, and then to simulate the number of reads of each allele, given that the 72 total number of reads is 100, in two conditions: 1) with efficiences equals to 1,1,1 and 2; 2) with 73 efficiency equals to 0.7,0.7,0.7,1.4. Simulation are performed using the function rmultinom() from 74 R that we will reuse later.  As you can see, both simulations provide the same number of reads (we used set.seed() to 76 provide the same seed to the random generator used by rmultinom(), so that differences would 77 not be just caused by randomness). This is not surprising as we said before that efficiencies are 78 transformed into probabilities summing to 1, but it shows that optim() cannot estimate the ab-  par(mar = c(6, 6, 1, 1), mgp = c(4.5, 1, 0)) plot.new() plot.window(ylim = c(0, 2.5), xlim = c(1, nb.alleles)) abline(h = 1) segments(x0 = 1:nb.alleles, y0 = 1, y1 = efficiency.standardised, col = "grey") abline(h = 2, lty = 2) points(efficiency.standardised, col = "blue", pch = 20) axis(2, las = 2, cex = 1.5) name.alleles <-substring(colnames(matrix.data), first = 2) axis(1, labels = name.alleles, at = 1:nb.alleles, las = 2, cex = 0.5) box() title(ylab = "Standardised efficiency", xlab = "Alleles", cex.lab = 1.5) Alleles Standardised efficiency In this section, we will describe how to compute the minimum number of reads necessary to reach 93 a coverage of 99.9% for each genotype, using different set of assumptions. The results they present are derived from heavy computations involving the negative multinomial 98 distribution. Here, we instead propose a much simpler and computationally lighter alternative 99 method, which relies on using simulations involving the multinomial distribution. Note that our 100 method leads only to an approximation of the exact results derived from the negative multinomial 101 distribution. However, it turns out that in practice the approximation is very good and due to its 102 fast computational performance it allows one to easily explore more realistic scenarios than those 103 envisaged by Galan et al., which fully justifies our alternative simulation-based approach.

105
We first need a function that will simulate genotypes and return the coverage for a given To find the minimum number of reads required (i.e. T1), the function starts by a too small 118 T1 value (1 by default) and then increment this value untill the coverage becomes sufficiant. The 119 argument minReadNb allow saving computing time by starting the incrementation of T1 to a higher 120 value than 1. If this argument is however set to a too high value, we programmed a recursive call 121 to the function (i.e. the function calls itself) that will use a smaller initial value for T1 (the number 122 of units being removed is set by the argument penalty).

123
The same example as above, with a too high value for minReadNb as a starting point: ## Warning: minReadNb is too high; the function will use minReadNb = 40 ## Warning: minReadNb is too high; the function will use minReadNb = 30 T1.4.alleles.bis

## [1] 38
Because the result is stochastic, we might want to perform several rounds of simulations (here 125 100) and take the median value among the T1 values obtained. In order to save computing time, 126 we will set minReadNb a bit bellow than the number we just found. We will also measure how long To estimate the coverage confidence associated with a number of reads and taken into account 134 efficiency variation, we can use the same simulation process as the one we used to replicate 135 Galan's T1, meanwhile considering the amplification efficiencies we estimated (rather than an 136 efficiency of 1 for all alleles). We will predict the coverage for the genotypes present in our dataset SimuGalan()), the function SimuVariableEfficiencies() requires the argument row, which will 141 allow the function to know which amplification efficiencies to consider to simulate read counts. ## Warning: minReadNb is too high; the function will use minReadNb = 65 ## Warning: minReadNb is too high; the function will use minReadNb = 65 ## Warning: minReadNb is too high; the function will use minReadNb = 65 ## Warning: minReadNb is too high; the function will use minReadNb = 65 ## Warning: minReadNb is too high; the function will use minReadNb = 65 ## Warning: minReadNb is too high; the function will use minReadNb = 65 ## Warning: minReadNb is too high; the function will use minReadNb = 65 ## Warning: minReadNb is too high; the function will use minReadNb = 65 ## Warning: minReadNb is too high; the function will use minReadNb = 65 ## Warning: minReadNb is too high; the function will use minReadNb = 65 The previous values of T1 assume that the efficiency of an allele is independent from the genotype 157 in which it is expressed. To relax this assumption, we can assume that the observed read frequency 158 within a genotype are the best estimates of the amplification efficiencies of the alleles within the 159 genotype. Calculating T1 under this consideration will also allow to assess what would have been 160 the minimum total number of reads needed to identify the same genotypes than we did.

162
To do so, we first need a function that will perform the resampling: We program the function that will provide the confidence of coverage for a given genotype and 168 total number of reads: