Ethics statement for field work and collecting DNA samples
For the present study, we used 40 unrelated samples of the Pallid Atlantic Forest Rat Delomys sublineatus (Thomas, 1903), an endemic species in the Brazilian coastal rainforest Mata Atlantica. This rodent is used as an indicator species in conservation studies [24–32]. The species was selected due to its high copy number variation in MHC class II DRB loci and relevance for further immuno-ecological studies [32–34]. Trapping, animal handling and collection of tissue samples for this project complied with international guidelines and were approved and permitted by the national authority, the Instituto Brasileiro do Meio Ambiente e dos Recursos Naturais Renováveis-IBAMA (permission 11573-2) and conformed to guidelines sanctioned by the American Society of Mammalogists Animal Care and Use Committee [35]. Because our study did not involve any experimentation (e.g. maintenance in captivity, injection of drugs, or surgery) an approval from the Ethics Committee of the Institute of Biosciences, University of São Paulo (Comissão de Ética em Uso de Animais Vertebrados em Experimentação, CEA, http://ib.usp.br/etica_animais.htm) was not required.
Traditional approach: molecular cloning followed by Sanger sequencing
Genomic DNA was extracted from ear tissue using the GEN-ial all tissue kit (GEN-IAL GmbH, Troisdorf, Germany) following the manufacturer’s instructions. We examined a 228 bp fragment of MHC class II DRB exon2 coding for the major part of the functional important antigen-binding sites of the β1 domain. The targeted fragment was amplified in all individuals (N = 40) using the primer pair JF8-iV (f-primer: 5'-TGGACGAGCAAGACGTTCCTGT-3') and Tub2JF (r-primer: 5'- CGAYCCCGWAGTTGTGTCTGCA-3'). The primers fit to usually highly conserved parts of MHC class II exon2 across mammals. The forward primer has been designed to optimize specificity to Delomys sublineatus. Extensive testing of different primer combinations have been carried out at the beginning of the project to design the best primers possible. To minimize misincorporation errors, PCR products were generated with a proofreading polymerase (HotStar HiFidelity polymerase, Qiagen, Hilden, Germany). Two independent PCRs for cloning were performed per individual and the outcome was considered as congruent if the identical number and sequence pattern of variants were detected by subsequent Sanger sequencing.
PCR was conducted in a total reaction volume of 20 μl including 200 ng DNA, 0.75 μM of each primer (Sigma-Aldrich, Steinheim, Germany), 5× HotStar HiFidelity PCR buffer (including dNTPs and MgSO4) and 0.5 unit of Taq polymerase. Thermocycling comprised an initial denaturation step at 96°C for 10 min, followed by 33 cycles of 1 min denaturation at 96°C; 1 min annealing at 58°C and 3 min extension at 72°C. A final extension step was performed at 72°C for 15 min. PCR products were purified (QIAquick PCR Purification Kit, Qiagen, Hilden, Germany) and cloned into a pCR®4-TOPO vector using the TOPO TA cloning kit for sequencing (Invitrogen, Karlsruhe, Germany). Initially, up to 90 clones were sequenced per individual to detect the saturation threshold. The relationship between the number of different MHC alleles in relation to the number of sequenced clones per individual indicated that the saturation plateau was reached after 20-25 clones in most of the individuals (Figures not shown). As a conservative approach, 40 recombinant clones per individual were selected and amplified using the vector primers T7 and M13 rev. Cloned PCR products were purified and sequenced directly on an A3130xl automated sequencer using the BigDye Terminator v3.1 Cycle Sequencing Kit (both Applied Biosystems Deutschland GmbH, Darmstadt, Germany).
The criteria used to define a sequence from cloning/Sanger sequencing as a true MHC allele were based on its occurrence in at least two independent PCR reactions derived from the same or different individuals. Allele sequences were named according to the nomenclature rules set by Klein et al. [36].
NGS approach: laboratory procedures for 454 pyrosequencing
In order to facilitate PCR error and bias recognition, all individuals (N = 40) were amplified twice in independent PCRs (referred to as amplicon replicates), i.e. each individual was included twice on the 454 plate using different barcoding tag combinations. For library preparation we used fusion primers consisting of four parts (Additional file 1: Figure S1). At the 5’ end the primers contain an adaptor sequence (forward adaptor A: CGTATCGCCTCCCTCGCGCCA, reverse adaptor B: CTATGCGCCTTGCCAGCCCGC), followed by an internal library key (TCAG). A combination of barcodes sequences recommended by Roche, the so-called multiplex identifiers (MIDs, 10 bp long), were used to identify each amplicon replicate. For each direction a set of nine different MIDs differing in at least 3 positions from each other was chosen (f-MID, r-MID). The 9 forward and 9 reverse MIDs produce 81 possible combinations, and allowed us to pool and distinguish 80 different samples (two PCR replicates for each of the 40 individuals) in one single 454 region by using only 18 different fusion primers. The DRB-specific amplification primers were the same as the ones used for cloning (JF8-iV, Tub2JF) and formed the last part of the fusion primers (Additional file 1: Figure S1).
PCR was carried out in 25 μL reaction volumes containing 0.4 μM of each fusion primer, 0.2 mM dNTPs, 2.5 μL FastStart buffer and 1.25 U FastStart HiFi Polymerase (Roche Diagnostics GmbH, Grenzach-Wyhlen, Germany). Reactions comprised an initial denaturation step at 94°C for 2 min, followed by 35 cycles of 30 sec denaturation at 94°C; 30 sec annealing at 58°C, 1 min extension at 72°C and a final extension at 72°C for 7 min. After amplification the PCR products were purified using the Agencourt AMPure system (Agencourt Bioscience Corporation, Beverly, MA) and then quantified by the Quant-iT PicoGreen dsDNA Assay Kit (Invitrogen Corporation). Subsequently, all amplicons were diluted to 200,000 molecules/μl and pooled.
During emulsion PCR (emPCR), the library was clonally amplified using the GS FLX Titanium SV emPCR Lib-A kit (Roche Diagnostics GmbH). Following immobilization onto DNA capture beads, the bead-bound amplicons were added to the emulsion oil and the amplification reagents. Through shaking on a Tissue Lyser (Qiagen) each bead is captured within its own microreactor where PCR amplification occurs. EmPCR was dispensed into a 96-well plate and the PCR program was run according to manufacturer's instructions. After amplification, the beads were recovered by emulsion breaking and washed. Using a biotinylated primer A (complementary to adaptor A), which is bound to streptavidin-coated magnetic beads, DNA library beads were enriched. The DNA library beads were then separated from the magnetic beads by melting the double-stranded amplification products, leaving a population of bead-bound single-stranded template DNA fragments, to which the sequencing primer was annealed. Then the library pool was sequenced in one GS FLX Titanium run on 1/8th of a 70 × 75 PicoTiter plate.
Initial 454 data quality check and reads filtering
454 sequencing images and signals were processed with the Genome Sequencer FLX System Software using the standard amplicon pipeline option. A definition of all terms used is provided in Table 1. In order to ensure data quality and to select reads for subsequent data validation procedures we used several filtering steps (Figure 1). As the initial step, all reads substantially shorter (<250 bp) than the expected length (~290 bp including MIDs, specific MHC primers and target) were removed. It has to be noted that at this step potential pseudogenes with structural abnormalities were lost which might be of interest to population genetic studies or to comparative genetics. Subsequently, all remaining reads were sorted based on their forward and reverse MID, discarding those reads with incomplete/incorrect MID sequences which could not be assigned to any individual. From this point onwards all steps were performed for each independent amplicon (i.e. each MID combination) separately. Reads were further filtered out when showing an incomplete/incorrect primer region and/or lower quality (less than 95% bases with Phred quality score > 20). In the last steps we looked for indels, allowing only multiple of 3 bp indels (corresponding to one amino acid), and selected all reads with an expected target sequence length. The reads were finally aligned with Muscle v3.8 [37] using a Python script to automate the process. The alignments were entered in the software Geneious Pro v5.5 [38] and a visual inspection was quickly performed in order to identify possible reads with changes in the reading frame. A shift in the reading frame was not accepted because the region analyzed comprises a coding exon. All remaining reads were ready for subsequent allele and artefact identification workflow to detect potential sequencing errors and PCR artefacts before the final putative allele calling (Figure 2).
Putative MHC alleles and artefacts identification
A workflow consisting of a series of steps (I to III) was developed for assigning the final reads to putative MHC alleles (Figure 2). Each step was performed for all amplicons across individuals before moving to the next step, and step III was repeated until all clusters were classified either as ‘putative artefact’, ‘unclassified variant’, or ‘putative allele’.
The first step of this workflow treats each independent amplicon separately and begins with the assembling of all identical reads into clusters (step I). All reads not assigned to clusters (i.e. singletons) were considered as ‘putative artefacts’ and not included in further analyses as they are likely a result of PCR or sequencing errors. For the subsequent steps all clusters (from now on also called variants as they represent a specific sequence) were numbered and organized in hierarchical order based on their frequency (i.e. Cluster1 as the most frequent). The classification of variants at the end of step I was done based on an intra-amplicon evaluation, and assigned the clusters to three different categories: ‘chimera’, ‘1-2 bp diff’, ‘> 2 bp diff’. This classification was done to facilitate subsequent artefacts recognition, based on two important assumptions: 1. Artefactual sequences generated in vitro are less frequent than their source(s) (usually true alleles) within an amplicon and 2. Artefacts should be less frequent within an amplicon than any true allele. With these assumptions, we could work with variant frequencies (i.e. percentages within the amplicon) as the main tool for defining ‘putative alleles’. According to these assumptions, true MHC alleles were expected to be amongst the most frequent clusters in the dataset, although some attention must be paid for possible amplicons for which primers presented a sub-optimal efficiency and real MHC alleles might therefore appear in lower frequencies.
Chimeras were identified with a dedicated Python script, which simply tested if the query cluster could be a combination of two different, but more frequent clusters. Although it could be difficult to differentiate in vitro chimeras from true recombinants, we have considered that chimeras should always be less frequent than the putative alleles from which they originated, as stated in our assumption number 1 (see above).
After sorting out probable chimeras, all remaining clusters (starting with Cluster2) were compared with more frequent ones, in order to identify the most similar variant. The clusters were assigned to two intra-amplicon categories when compared to their most similar cluster: either one or two base pair differences (‘1-2 bp diff’) or more than two base pair differences (‘>2 bp diff’). ‘1-2 bp diff’ are likely to be polymerase errors that got amplified during PCR, and ‘>2 bp diff’ probably represent more complex kinds of artefacts that are hard to be defined, such as mosaics of more than two fragments or chimeras plus polymerase errors (we have found examples of both kinds). After finalizing step I for all individuals, all data was organized in a local PostgreSQL database (http://www.postgresql.org/) in order to facilitate the subsequent comparisons and to keep all information organized and promptly available.
The second step (step II) of the workflow (Figure 2) aims to recognize most of the putative artefacts, by comparing both MID combinations (i.e. amplicon replicates) of each individual, as well as amplicons from different individuals. It begins with checking for each intra-amplicon cluster classification the occurrence of a given variant in the second amplicon. The absence of variants labelled as ‘1-2 bp diff’ (IIa) or ‘chimera’ (IIc) in the second amplicon classifies those as ‘putative artefacts’. The same is not true for ‘>2 bp diff’, which is classified as ‘putative artefact’ at this step only if it is not present in any other individual. Finally, whenever a variant is classified in both amplicons as a chimera, it is also classified as a ‘putative artefact’ (IIc).
All the ‘putative artefacts’ classified until now will be used in step III to help defining ‘putative alleles’, based on their intra-amplicon frequencies. Variants grouped into the categories ‘1-2 bp diff’ (IIIa) and ‘>2 bp diff’ (IIIb) which were present in both amplicon replicates but less frequent than any annotated artefact were labelled as ‘unclassified variants’, since they are not amongst the most frequent clusters but are unlikely to be artefacts if they appear in both replicates in at least two copies each (i.e. in clusters). Those present in both amplicons and more frequent than all annotated artefacts are considered as ‘putative MHC alleles’ (IIIa, IIIb-1). Variants labelled as ‘>2 bp diff’ not detected in the second amplicon were further checked for presence in other individuals (IIIb-2). If the variant is considered as a ‘putative allele’ or ‘unclassified’ in another individual it was labelled as ‘unclassified variant’, otherwise it was considered as a ‘putative artefact’ (IIIb-2). In our analysis, none of the chimeras were grouped to a different category in the amplicon replicate, and therefore all were considered as ‘putative artefacts’. We have, however, designed further classification steps for those cases where variants do not appear as ‘chimeras’ in both replicates. In this case, a variant will be assumed to be a ‘putative allele’ if it is present in other individuals, and will be considered as a natural recombinant.
At the end of step III, all variants are classified either as ‘putative alleles’, ‘putative artefacts’ or ‘unclassified variants’ (Figure 2) for each amplicon in which they occur. Variants classified as ‘putative alleles’ and/or ‘unclassified’ were further evaluated in order to identify alleles with inconsistent frequencies among amplicons. Variants recognized as ‘putative alleles’ in some individuals but more frequently classified as ‘unclassified variants’, because of low frequencies (compared to artefacts) or absence in one of the individual amplicons, were identified as ‘low frequency putative alleles’.
Alternative ready-to-use tools
Most of our analysis was done without the use of specifically designed software packages. All data were organized in a PostgreSQL database and analyses were mainly done with either self-coded Python scripts or SQL queries. However, there are a number of software packages available that facilitate following our workflow (Figure 2) without the need of coding. The jMHC software [39] and SESAME (SEquence Sorter & AMplicon Explorer) [40], for example, allow 454 filtered data to be separated by barcode (including 2-sided barcodes) and summarizes all information about variables present in one or more amplicons. Alignments for single amplicons ordered based on variant frequency can be saved in single files. Visualization of alignments for chimera identification may be done with MEGA5 [41] for instance, as well as the construction of pair-wise sequence comparison matrices, useful to distinguishing closely related sequences (1-2 bp) from more unrelated ones (>2 bp). All data originating from jMHC can be imported in a table format and organized using commonly available spreadsheet softwares. Any extra information (e.g. intra-amplicon and inter-amplicon analysis) may be entered in the same table, which continues to be incremented as one moves forward throughout our workflow (Figure 2).
Estimation of allele’s amplification efficiency
The reliability of the coverage of genotypes depends on allele amplification efficiencies. We derived a maximum likelihood optimization approach allowing the estimation of the relative amplification efficiency of each allele. Importantly, this method assumes that each allele can be characterized by a single efficiency value i.e. that the amplification efficiency of a given allele is 1) independent from the genotype and 2) similar among PCR samples with identical conditions. The method uses the density function of the multinomial distribution that provides the probability of an event that would have led to the number of reads observed for a given amplicon and for a given set of amplification efficiency values. Multiplying the probabilities associated with each amplicon over the entire dataset (or summing them on a log-scale), one can therefore obtain the (log)likelihood of the data given the amplification efficiency considered. Since the (log)likelihood is a function of the amplification efficiency of each allele, one can estimate the amplification efficiencies maximizing this function using an optimization algorithm. It provides amplification efficiency values that are the most likely to have resulted to the observed dataset. The method has been implemented in the language R [42], a free open source statistical program, and the script is provided and detailed in additional material (Additional file 4: R-Codes).
The estimated amplification efficiencies obtained by this process are relative values. Consequently, we then estimated the standardised amplification efficiency for each MHC allele using the amplification efficiency of MHC allele Desu-DRB*001 (Additional file 1: Figure S2, Additional file 2: Table S2) as reference, i.e. considering its standardised amplification efficiency to be one. This allele was chosen as the reference because the MHC forward primer was developed based on the DNA sequence of a longer fragment obtained originally with a pair of degenerated primers and that corresponds to Desu-DRB*001. The reverse primer used in this study remained degenerated. To obtain standardised amplification efficiency values, we therefore recomputed the amplification efficiency of other alleles by dividing their relative amplification efficiency by the efficiency of the reference allele. Importantly, this standardization is necessary for identifying potentially duplicated alleles (i.e. those with a standardised efficiency ≥ 2) but it plays no role in the study of the variation in amplification efficiency between alleles, nor for coverage analyses described below.
Estimation of the minimum number of reads needed-estimation of Galan’s T1 considering equal and different allele amplification efficiencies
To compare the minimum number of reads needed to reach a certain coverage under the assumption of equal amplification efficiencies between alleles (T1Galan Simul) and taking variation in amplification efficiency into account (T1Simul VarAmplEff), we used a simulation approach based on random drawings in a multinomial distribution. In both conditions, we estimated for each genotype the minimum number of reads so that all alleles were represented by at least two reads in 99.9% of 10,000 simulations. We replicated the estimation of T1 100 times for each genotype and took the median values to generate T1Galan Simul and T1Simul VarAmplEff (Additional file 4: R-Codes).
To evaluate our previous assumption that the amplification efficiency of a given allele is independent from the individual and genotype, we allowed each allele to have a different probability in different amplicons. To do so, we used the allele frequencies observed in each amplicon as the expected amplification efficiencies (i.e. number of reads representing one allele divided by the sum of all reads representing alleles in one single amplicon). We simulated the distribution of the number of reads among alleles for each genotype by performing a random sampling with replacement of the observed reads and estimating T1Resampled as the number of reads sampled required for reaching the threshold for genotype coverage. For each genotype we performed 1000 simulations per number of reads and T1Resampled was estimated as the lower number of simulated reads which identified all putative alleles in at least 99.9% out of the 1,000 simulations. Again, this computation was performed 100 times for each genotype and the median of the T1Resampled values was retained. The entire procedure was performed on both amplicon replicates of each individual. The dedicated R script is provided in the Additional file 4: R-Codes.
Finally, we estimated the required number of reads for different values of minimal amplification efficiency and number of alleles. To do so, we simulated number of reads for a given number of alleles by assuming that the amplification efficiency of all but one allele is equal to one. For the remaining alleles, we set the amplification efficiency to the minimum value investigated and considered for this latter all values between 0.01 and 1. T1 was computed 100 times for each number of alleles and minimum amplification efficiency and 10,000 sets of reads were simulated for each run. The final T1 value considered was again the median value among the 100 T1 values computed for each combination of number of alleles and minimum amplification efficiency.