Genome-wide analysis reveals the extent of EAV-HP integration in domestic chicken

Background EAV-HP is an ancient retrovirus pre-dating Gallus speciation, which continues to circulate in modern chicken populations, and led to the emergence of avian leukosis virus subgroup J causing significant economic losses to the poultry industry. We mapped EAV-HP integration sites in Ethiopian village chickens, a Silkie, Taiwan Country chicken, red junglefowl Gallus gallus and several inbred experimental lines using whole-genome sequence data. Results An average of 75.22 ± 9.52 integration sites per bird were identified, which collectively group into 279 intervals of which 5 % are common to 90 % of the genomes analysed and are suggestive of pre-domestication integration events. More than a third of intervals are specific to individual genomes, supporting active circulation of EAV-HP in modern chickens. Interval density is correlated with chromosome length (P < 2.31−6), and 27 % of intervals are located within 5 kb of a transcript. Functional annotation clustering of genes reveals enrichment for immune-related functions (P < 0.05). Conclusions Our results illustrate a non-random distribution of EAV-HP in the genome, emphasising the importance it may have played in the adaptation of the species, and provide a platform from which to extend investigations on the co-evolutionary significance of endogenous retroviral genera with their hosts. Electronic supplementary material The online version of this article (doi:10.1186/s12864-015-1954-x) contains supplementary material, which is available to authorized users.

the mean depth of coverage was calculated to be 61.3 X with a mean insert size of 204.
To evaluate the effects of using different depths of coverage with the pipeline, the initial mapping to Galgal4 was down-sampled by 50%. The 50% down-sampling was repeated on the resulting BAM file, and again on the BAM file generated from the additional down-sampling, resulting in 4 datasets: the original at 61.3X depth of coverage, followed by coverage at 30.65 X, 15.32 X, and 7.66X. The BAM files were converted to FASTQ files using Picard, which retains only read-pairs, and used as inputs for the EAV-HP mapping pipeline.
In addition to testing the effects of coverage, the filters within the pipeline pertaining to mapping quality (MQ) and read count (RC) were also evaluated. The MQ filter is intended to avoid the risk of generating multiple hits arising from interspersed repeats throughout the genome, for instance CR1 elements. The BWA-MEM aligner marks alignments with equally high mapping scores as having MQ=0, thus, if a read aligns equally well to multiple genomic intervals i.e. due to the presence of CR1 elements, then MQ=0. By filtering out reads with MQ=0 we reduce the risk of false positives arising from intervals located within these interspersed repetitive elements. To assess this, the pipeline was evaluated to include intervals containing reads with MQ=0 and separately to include only intervals with MQ≥20. A further filter within the pipeline, relating to the RC supporting an interval, is correlated with sequencing depth of coverage. To evaluate this filter three different settings were tested: RC=3, RC=10, and RC=0.25μXi in which Xi is the depth of coverage of the bird/line (i) being analysed.
To establish the sensitivity and precision of the analyses, the results were compared to EAV-HP LTR alignments to Galgal4 identified by BLAT (stand-alone version, default parameters) assuming a minimum score of 20 (Supplementary Table S7). Pipeline results were considered true positives (TP) if the interval was within 500 bp of a BLAT alignment, this distance was specified to accommodate read length and insert size. The false negative rate (FNR) was recorded as the fraction of BLAT alignments that were undetected by the mapping pipeline.

Sensitivity (SENS) was recorded as the number of TP / (TP + number of false negatives). Precision (PREC) was
recorded as the TP / (TP + number of false positives). False discovery rate was recorded as 1-PREC. The results of the different analyses are presented in Table 1.
Setting the MQ filter to include reads with MQ=0 results in the greatest sensitivity, however as previously mentioned there is a risk with this in that a single integration might be associated with multiple genomic intervals.
Consequently, the false discovery rate (FDR) increases with increasing depths of coverage whilst the FNR decreases. The most consistent results across all depths of coverage were observed at RC= 0.25μXi where at MQ=20 sensitivity and precision were averaged 59% and 98%, respectively, whilst risking MQ=0 would result in an average sensitivity of 97% and precision of 95%.