Exome capture from saliva produces high quality genomic and metagenomic data
- Jeffrey M Kidd†1, 2,
- Thomas J Sharpton†3, 4,
- Dean Bobo5,
- Paul J Norman6,
- Alicia R Martin1,
- Meredith L Carpenter1,
- Martin Sikora1,
- Christopher R Gignoux7,
- Neda Nemat-Gorgani6,
- Alexandra Adams1,
- Moraima Guadalupe8,
- Xiaosen Guo9,
- Qiang Feng9,
- Yingrui Li9,
- Xiao Liu9,
- Peter Parham6,
- Eileen G Hoal10,
- Marcus W Feldman11,
- Katherine S Pollard3, 12,
- Jeffrey D Wall12,
- Carlos D Bustamante1 and
- Brenna M Henn1, 5Email author
© Kidd et al.; licensee BioMed Central Ltd. 2014
Received: 18 March 2014
Accepted: 28 March 2014
Published: 4 April 2014
Targeted capture of genomic regions reduces sequencing cost while generating higher coverage by allowing biomedical researchers to focus on specific loci of interest, such as exons. Targeted capture also has the potential to facilitate the generation of genomic data from DNA collected via saliva or buccal cells. DNA samples derived from these cell types tend to have a lower human DNA yield, may be degraded from age and/or have contamination from bacteria or other ambient oral microbiota. However, thousands of samples have been previously collected from these cell types, and saliva collection has the advantage that it is a non-invasive and appropriate for a wide variety of research.
We demonstrate successful enrichment and sequencing of 15 South African KhoeSan exomes and 2 full genomes with samples initially derived from saliva. The expanded exome dataset enables us to characterize genetic diversity free from ascertainment bias for multiple KhoeSan populations, including new exome data from six HGDP Namibian San, revealing substantial population structure across the Kalahari Desert region. Additionally, we discover and independently verify thirty-one previously unknown KIR alleles using methods we developed to accurately map and call the highly polymorphic HLA and KIR loci from exome capture data. Finally, we show that exome capture of saliva-derived DNA yields sufficient non-human sequences to characterize oral microbial communities, including detection of bacteria linked to oral disease (e.g. Prevotella melaninogenica). For comparison, two samples were sequenced using standard full genome library preparation without exome capture and we found no systematic bias of metagenomic information between exome-captured and non-captured data.
DNA from human saliva samples, collected and extracted using standard procedures, can be used to successfully sequence high quality human exomes, and metagenomic data can be derived from non-human reads. We find that individuals from the Kalahari carry a higher oral pathogenic microbial load than samples surveyed in the Human Microbiome Project. Additionally, rare variants present in the exomes suggest strong population structure across different KhoeSan populations.
Sampling of saliva or via buccal cell extractions is a widely employed, non-invasive method of collecting human DNA for both biomedical and ancestry experiments. DNA extracted from saliva fluid has been used on single nucleotide polymorphism chip arrays, methylation arrays, targeted resequencing, exome, and whole genome sequencing [1–7]. However, the low total yield of DNA from a single sample and the presence of many non-human DNA fragments make next-generation sequencing of saliva samples impractical for some applications. Targeted enrichment strategies, such as hybridization methods designed to capture the exons of annotated genes (the ‘exome’) prior to sequencing, offer a way to circumvent some of the limitations posed by saliva-derived DNA samples. We demonstrate the successful sequencing of multiple human exomes from saliva-derived samples using commercially available reagents for exome capture.
Exome sequencing and other capture methods permit the high-coverage sequencing of a small portion of the genome. This approach represents a trade off between depth of coverage vs. breadth of the genome that is interrogated, and has the potential to revolutionize genomic medicine [8, 9]. In addition to direct applications to human disease, exome sequencing of a modest number of individuals can reveal important aspects of human evolution [10–12]. The capability to apply these approaches to DNA derived from saliva, which is more easily obtained and less invasive than blood or other tissue collection, will greatly facilitate the detailed examination of genetic variants that may be associated with specific traits or have experienced adaptive evolution [13, 14].
We focus on a unique set of DNA samples from the ≠Khomani KhoeSan of South Africa to illustrate the utility of exome sequencing via saliva. African genetic diversity remains poorly understood, in part because many regions of the continent lack adequate healthcare infrastructure, which can make blood collection impractical. The indigenous KhoeSan peoples of southern Africa are a collection of hunter-gatherer and pastoralist groups who speak “click languages”, classified into three distinct language families. The genetic diversity of these, and related populations, remains under-ascertained. The genome of one Tuu-speaking San (“!Gubi”) has been fully sequenced and found to contain over 700,000 novel polymorphisms . Gronau et al. showed that this San genome was highly divergent among known genomes, even compared to other African individuals . They estimated the population divergence between western African individuals and the San to be about 110,000-130,000 years ago, over twice as old as the divergence between western Africans and Eurasians. Additionally, single nucleotide polymorphism (SNP) array data demonstrated that the ≠Khomani San population had the lowest levels of linkage disequilbrium (LD) of any population surveyed and thus the largest effective population size . However, in order to test hypotheses regarding population sub-structure, natural selection and biomedically relevant variants in Africa, it is essential to have both large sample sizes and genomic data that are un-biased with regard to ascertainment schemes.
Fifteen human saliva samples were selected for exome sequencing. Samples were split into two batches (“Pilot 1” and “Pilot 2”), representing samples enriched using the Agilent SureSelect 50 Mb human All-Exon design and sequenced with the Illumina GAII machine and a replication batch enriched using the Agilent SureSelect 44 Mb human All-Exon design and sequenced using Illumina HiSeq. We included a familial quartet with two daughters (Family 1), an extended pedigree of first cousins and half-siblings (Family 2), and eight purportedly unrelated individuals (Additional file 1: Figure S1). Family 1 displayed complex ancestry from KhoeSan, European and both eastern and western African populations (see ). Family 2 and the un-related individuals self-reported their ancestry as being from only KhoeSan populations (Nama- or N|u-speakers). We obtained 3-25ug total DNA from each saliva sample. Each aliquot was processed using the Agilent SureSelectXT library preparation kit followed by enrichment with the SureSelect 44 Mb or SureSelect 50 Mb human All-Exon capture probes. Using standard Illumina post capture barcodes, libraries were sequenced on either an Illumina GAII or HiSeq machine. Aliquots from two samples (SA1000 and SA1025) were also sequenced without exome capture, using the Illumina TruSeq library preparation kit (SA1000) and the Illumina Nextera library preparation kit (SA1025). The whole genome sequence (WGS) libraries were then sequenced on two lanes of an Illumina HiSeq.
Summary statistics for KhoeSan exomes
% Un-mapped reads
% PCR duplicates
% Mapped on target
Median target coverageb
% of variants coveredc
Pilot 1 mean
Pilot 2 mean
Two samples (SA006 and SA035) displayed a high percentage of duplicate reads (54% and 78%) (Additional file 1: Figure S2, Table 1). To understand whether SA006 and SA035 had high duplicate rates due to low human DNA input or whether there were other issues with read data, we examined the distribution of mapping quality for all uniquely mapped reads for each sample. These two samples had the lowest numbers of mapped reads and the lowest proportion of reads with mapping qualities ≥ 30 (35.2% and 68.6%, respectively, Additional file 1: Figure S3). The remaining Pilot 1 samples had higher effective coverage and ~80% of reads with mapping qualities ≥30. This difference is unlikely to be due to divergence from the reference because we observed no systematic differences in mapping quality metrics between the European- and Bantu- admixed Family 1 and the KhoeSan Family 2. Due to lower mapping rates, SA006 and SA035 displayed overall lower mapped coverage than the other samples. However, 90% of target sites were covered at a depth of at least 10x for all individuals except SA006 and SA035 (Additional file 1: Figure S2). The average percent of unmapped reads was higher for saliva-derived exomes compared to six HGDP San samples sequenced using DNA obtained from cell lines (Additional file 1: Table S1). However, the primary difference in sequencing efficiency between saliva- and cell-line derived DNA results from differences in the mean rate of duplicate reads: Pilot 1, 34.4%; Pilot 2, 12.9%; HGDP, 9.8%. Pilot 1 likely has a higher duplicate rate due to lower DNA quality (see below).
Genotype and variant statistics
Variants were called using the Genome Analysis Tool Kit (GATK) and selected using the Variant Quality Score Recalibration (VQSR) procedure with cutoffs set such that 99% of variants also found in the 1000 Genomes Omni2.5 and HapMap3 SNP training set were retained [22–24]. We identified 82,093 variants, with a transition/transversion ratio of 3.14. On average, within the target regions, each individual had a genotype call at 98% of sites variable in the 15 sample dataset (Table 1). Singleton counts varied from 657 to 3,286 autosomal sites, excluding the two daughters in Family 1 (Table 1). We computed genotype concordance for 12 individuals (sufficient DNA was not available for SA011, SA012, SA051) based on data from the Illumina OmniExpress or 550 K.v2 SNP arrays . Non-reference (NR) concordance, that is concordance only at heterozygous or non-reference homozygous genotypes, was calculated using GATK [24, 25] and concordance exceeded 98% for all individuals genotyped.
Novelty compared to 1000 genomes project
Population differentiation of the KhoeSan
HLA and KIR
HLA and KIR validation
Standard genotyping (excluding SA006 & SA035)
SNPs vs HG19a
SNPs vs HG19
Concordance rate (%)
SNPs vs HG19
Concordance rate (%)
KIR (13 genes)
HLA class 1 A
HLA class 1 B
HLA class 2 C
HLA and KIR validation for SA006 and SA035
Although exome capture proved an efficient method of sequencing primarily human DNA, each sample also contained more than a million unmapped reads (Table 1). We hypothesized that these unmapped reads might represent non-human DNA carried through the saliva extraction. Although we obtained useful results, with high concordance to SNP genotyping arrays, such microbial contamination may contribute to lower effective coverage levels. We therefore subjected these unmapped reads to an independent quality control procedure and used a fragment recruitment approach described by Rusch et al.  to identify homologs of non-human reference genomes among a combined pool of 24,139,131 high-quality unmapped reads (Figure 1). To estimate the number of species that are detected, we applied a recruitment threshold based on the 95% average nucleotide identity threshold that is commonly used to define microbial species .
KhoeSan saliva microbiome abundance by read threshold
75% Read coverage2
Some of the abundant KhoeSan saliva microbiota are known contributors to oral disease. For example, Prevotella melaninogencia (recruits 5.9% of unmapped reads after correcting for genome length) is associated with rapidly progressing periodontitis lesions . Similarly, Streptococcus parasanguinis (6.3%) is a primary colonizer of human teeth and contributes to dental plaque formation . Granulicatella elegans (2.7%), an oral commensal associated with infective endocarditis , is also found in high abundance among the KhoeSan. We also specifically ascertained the presence of several biomedically important organisms, some of which may exist at relatively low abundance. For example, the Porphyromonas gingivalis genome, which represents organisms implicated in periodontal disease and has been linked to rheumatoid arthritis  and heart disease , recruits a relatively large fraction of reads from all individuals (1.68%). Conversely the Campylobacter rectus genome, which is also associated with periodontitis , recruits a relatively small fraction of reads (0.24%). Only 8 reads (2.3 × 10-4% of genome length-corrected recruitments) were recruited with high fidelity to the genome of Mycobacterium tuberculosis, the causative agent of tuberculosis, a disease that is common in the Northern Cape region of South Africa . These reads map with equally high fidelity to the genomes of other Actinobacteria, suggesting that they may be homologs of ancient and highly conserved Actinobacteria sequences and are not necessarily representatives of the M. tuberculosis genome. Robust detection of M. tuberculosis from saliva-derived exome capture sequence data requires additional experimentation and validation.
North American versus south African oral microbiomes
We then compared the diversity of the KhoeSan oral microbiome to the diversity observed in a recent and extensive survey of healthy North Americans in the Human Microbiome Project (N = 294) . This prior HMP work was conducted through analysis of small subunit ribosomal RNA (i.e., 16S rRNA) gene sequences that were taxonomically annotated to the genus level. We used these sequences to calculate genus-level, genome length-normalized relative abundances for each North American microbiome. We used the taxonomy associated with each genome in our fragment recruitment database to calculate genus-level, length-normalized relative abundances for each KhoeSan microbiome. Comparing each population’s median relative abundance for each genus, we find that most taxa exist at similar abundance levels in the two populations (Spearman’s rho = 0.91, p-value < 2.2e-16). However, there are five genera that are present in relatively high abundance (Bonferroni-corrected Wilcoxon rank sum test p < 0.01) in the KhoeSan and effectively undetected among the North Americans given the level of discovery in the HMP (Figure 6C): Rothia, Granulicatella, Haemophilus, Eubacterium, and Filifactor. Most notable among these is Rothia, which is the third most abundant genus in the KhoeSan and contains Rothia mucilaginosa, a known oral opportunistic pathogen that has been linked to systemic diseases [42, 43].
The extremely high genetic diversity in the KhoeSan, estimated from genome-wide SNP arrays and the “Bushman” genome, has renewed interest in understanding the population history of southern Africans [2, 15, 26, 27]. Comparatively few genomic sequences are publicly available (6 individuals total) from the KhoeSan, and ascertainment bias on many of the standard SNP arrays may strongly skew estimates of genetic diversity in these populations. We have generated 15 exomes and 2 genomes from the South African ≠Khomani San greatly expanding the number of genomic sequences available. Estimates of genetic diversity from these South African individuals are comparable to genetic diversity from the Yoruba from Nigeria or Luhya from Kenya (Figure 3). While we do not find a higher number of private SNPs in the KhoeSan, this may be biased due to endogamy among the ≠Khomani San and differences in coverage or SNP calling/imputation pipelines between 1000 Genomes and our procedure (Figure 1). Heterozygosity and singleton identification remain highly sensitive to coverage and calling pipelines thus making direct cross-study comparisons difficult. However, for common SNPs, we show that the KhoeSan strongly differentiate from all other human populations in structure analyses; the KhoeSan and Europeans fall at opposite ends of the 1st principal component, while western and eastern Africans fall at intermediate points on this axis. Furthermore, we find substantial sub-structure among the South African and Namibian KhoeSan, despite recent gene flow from Bantu-speaking groups and Europeans into the ≠Khomani, !Kung and Tuu populations.
Two of our samples had demonstrably lower mapping quality and coverage, SA006 and SA035. We consider three possibilities for these characteristics. First, it is difficult to identify the proportion of human DNA versus microbial or other non-human DNA in a saliva aliquot. If these two samples had by chance a lower volume of human DNA input for the exome capture reaction, then there would be fewer opportunities for human DNA to bind to the specific probes and the library would likely result in a higher number of duplicate read pairs. SA006 and SA035 do display an increased duplicate rate (54%, 78% respectively), but SA008 also displays high duplicate rate with minimal effect on mapping quality. Additionally, poorer mapping quality might be expected if the microbial reads map to the human genome, perhaps due to near sequence identity between some portion of the human and microbial genomes .
A second possibility is that the total amount and quality of the human DNA input initially may have been sufficient, but the presence of non-human substances such as residual tobacco or bacterial DNA may have acted as inhibitors, preventing normal binding to human probes. Third, the DNA in these two samples may have been more degraded than the other six Pilot 1 samples. However, although we do observe an increase in substitutions at the start of the reads for SA006 and SA0035, we find no evidence of an ancient DNA degradation pattern in the post-capture sequence data. While the listed possibilities appear unlikely, it is possible other patterns of degradation occur, in relatively young DNA extractions, which have not been reported in the literature.
Oral microbiome from exome sequencing
Approximately 5.1% of the sequence data generated did not map to the human genome. Using a phylogenetically diverse set of reference genomes and a fragment recruitment approach, we identified those unmapped reads that are homologs of regions in non-human genomes. We find that most of the reads map to genomes of well-described commensal microorganisms of the human mouth, suggesting that this sequencing platform produces relevant information about the human oral microbiome. We also find that analysis of exome-capture metagenomes produces microbiome diversity estimates consistent with those obtained from non-exome-capture metagenomes, indicating that this platform can be used to reliably quantify microbiome diversity and abundance. We note that other capture technologies or probe designs may result in fewer off-target reads, and a corresponding reduction in the ability to analyze the microbiome [45, 46]. Additionally, different saliva collection kits or the use of pre-collection mouth washes may effect the yield of microbial-derived sequences.
The large fraction of non-human sequences that do not map to our reference genomes are likely low quality and degraded sequences or are reads from organisms that are outside of the bounds of the phylogenetic diversity sampled in our reference database, such as viral genomes. The size of this fraction may be exacerbated by the relatively conservative alignment thresholds applied during our analysis. Our ability to detect oral commensals indicates that this human exome sequencing platform provides the added benefit of being able to assay biogeographic patters of oral microbiome diversity. Given that many of the non-human reads can be mapped with high stringency to genomes of known pathogens, we hypothesize that this sequencing platform may be useful as a diagnostic tool for the detection of disease and that the data obtained may be used for inferring cryptic phenotypes of the sampled individuals (e.g., periodontitis status). Future studies that focus on the sensitivity and specificity of pathogen detection will be required to test this hypothesis.
As a cautionary note, one genome that recruits a substantial number of reads (9.4% of total reads) is Beggiatoa sp. PS. Beggiatoa have been found in sulphur springs, sewage contaminated water, and hydrothermal vents ; to date, no one has described the presence of Beggiatoa in the human mouth. We found that the Beggiatoa-recruited reads map to short, unassembled contigs that exhibit significant similarity to clone libraries of the human genome. Thus, we suspect that our detection of Beggiatoa is the result of low quality human reads that fail to align to the human genome reference sequence but do align to regions of the Beggiatoa genome. This observation highlights the importance of considering the effect of human genome contamination when using fragment recruitment to study the human microbiome.
KhoeSan microbiome diversity
Understanding KhoeSan microbiome diversity and structure provides insight into the co-evolution of the human microbiome, given the ancient divergence of KhoeSan from other African populations. It additionally clarifies the effect of lifestyle on microbiome composition as most studies focus on individuals living contemporary Western lifestyles. Similar to studies conducted in Western populations [48, 49], we find that the KhoeSan salivary microbiome is dominated by a small number of taxa, with the Firmicutes or Proteobacteria predominating, and exhibits high diversity within and between individuals. These observations suggest that the general structure of the KhoeSan salivary microbiome is generally similar to that found in Western individuals.
However, when evaluating differences in the relative abundance of genera associated with the KhoeSan and a population of healthy Americans, we identified several abundant taxa in the KhoeSan that were at very low abundance or undetected among the Americans. These differences in microbiome structure may be due to differences in (1) the evolutionary history of the populations, (2) demographics, or (3) host environment or lifestyle, including diet and access to health care. Given that we find many known pathogens among the most abundant members of the KhoeSan microbiome and that many of the differentially detected genera contain known oral pathogens (e.g., Rothia, Granulicatella, Filifactor), we speculate that the relatively limited access to dental care, antibiotics and/or absence of water fluoridation among the KhoeSan is driving most of the observed differences between populations. However, the biology of several of the differentially abundant genera is not well understood, especially in the context of the commensal oral microbiome (e.g., Mobiluncus), or is principally limited to the pathogenic members of the genus; such genera may contain species that played an important role in the coevolution between the KhoeSan and their salivary microbiome. This may include pathogenic organisms, such as Aggregatibacter actinomycetemcomitans, the causative agent of adolescent periodontal disease, which is common in those of African descent  and a member of a relatively abundant genus in the KhoeSan. Further study of the microbiomes associated with the KhoeSan and other diverse human populations (e.g., ), the microbiomic differences between these populations (e.g., [52, 53]), especially across a variety of host physiological conditions, and the biology of commensal microbiota that are underrepresented in Western populations is needed to comprehensively differentiate the sources of variations observed between populations and to understand the coevolution between humans and their microbiome.
We have demonstrated the ability to obtain high quality exome sequence data from saliva-derived human DNA. We show that even samples with low human DNA presence can be successfully captured using exome in-solution target probes. Additionally, after examining some of the most diverse human loci, we find that exon-capture is able to enrich and facilitate high-resolution analysis of highly polymorphic HLA and KIR genes from DNA extracted from human saliva. We also demonstrated that exon-captured DNA sequencing of saliva reveals insight into the structure and diversity of the oral microbiome.
Sampling of the ≠Khomani KhoeSan in Upington, South Africa and neighboring villages occurred in 2006. Institution Review Board (IRB) approval was obtained from Stanford University. Individuals who were still living in 2011 were re-consented under a modified protocol (IRB approved from Stanford University and Stellenbosch University, South Africa). ≠Khomani N|u-speaking individuals, local community leaders, traditional leaders, non-profit organizations and a legal counselor were all consulted regarding the aims of this research, prior to collection of DNA. All individuals consented orally to participation, with a second, local native speaker witnessing and were re-consented with written consent. DNA via saliva (Oragene® kits) and ethnographic information regarding self-identified ancestry (N|u, Nama, or ‘Coloured’), language and parental place of birth were collected for all participants.
Library preparation and exome enrichment were performed as described in the Agilent SureSelectXT Target Enrichment System for Illumina Paired-End Sequencing Library (Version 1.1.1, January 2011). First, purified DNA from saliva samples was concentrated to a volume compatible with the library preparation protocol. 3 μg of concentrated genomic DNA was fragmented to a median size of 200 bp using the Covaris-S2 instrument with the following settings: duty cycle 10%, intensity 5, cycles per burst 200, and mode frequency sweeping for 180 s at 4°C. The fragmentation efficiency was evaluated on the Agilent Bioanalyzer using DNA1000 chips. After end-repair and A-tailing, sequencing adapters were ligated onto the DNA fragments, followed by size-selection using SPRI beads (Agencourt AmPure XP) and PCR amplification. The amplification product was purified with SPRI beads and the quantity and quality was assessed using the Bioanalyzer DNA1000 chip. Five hundred nanograms of the adapter-ligated DNA library were concentrated to 3.4 ml, mixed with hybridization buffer and DNA blocker mix, and added to the SureSelect 50 Mb All-Exon capture probe library. The mixture was incubated for 24 hours at 65°C in a thermal cycler. The hybridization mixture was added to streptavidin-coated M-280 Dynabeads (Invitrogen) and incubated for 30 min at room temperature, with mixing. The beads were washed with 500 ml SureSelect wash buffer #1 for 15 min. at room temperature, and three times with 500 ml SureSelect wash buffer #2 for 10 min at 65°C. DNA was eluted with 50 ml SureSelect elution buffer for 10 min at room temperature and neutralized with 50 ml of SureSelect neutralization buffer. The captured product was purified with SPRI beads and amplified by PCR. The quality and concentration of the sequencing libraries was verified by the Bioanalyzer High Sensitivity DNA kit (Agilent). Indexed samples were pooled in an equimolar ratio and sequenced on the Illumina HiSeq2000 according to standard protocols. A similar procedure was followed for the Pilot 2 samples with the SureSelect 44 Mb All-Exon capture probe library.
Read mapping and SNP calling
Illumina sequencing reads were mapped to the human genome reference sequence (GRCh37) following a standard pipeline informed by the best-practices as described by the 1000 Genomes project [24, 54] (Figure 1). Pilot 1 reads were trimmed to be 75 bp in length; Pilot 2 reads were 101 bp in length. Reads were mapped and paired using bwa version 0.6.2 . Unmapped reads were identified at this stage and processed via the metagenomic pipeline. Duplicate read pairs were identified using Picard (http://picard.sourceforge.net/). Base qualities were empirically recalibrated and indel realignment was performed jointly across all samples using the Genome Analysis Tool Kit (GATK) v1.6 . BAM files containing only uniquely mapped reads with duplicates removed were analyzed by the program SAMStat . Fraction of reads on target was determined using snpEff.
Sequencing reads from the samples described in Schuster et al.  were obtained from the short read archive and remapped to the GRCh37 assembly. The exome capture data from Schuster et al. was single end sequences obtained from the 454 pyrosequencing technology. Reads were mapped using the bwasw option in bwa version 0.5.9. Processing was performed as described above, with the exception of omitting the ‘homopolymer’ recalibration covariate and skipping the indel realignment step which is not supported for 454 reads.
Read substitution bias
For Pilot 1, rates of nucleotide substitutions at each position along the reads were determined by comparing the mapped reads to their aligned human genome reference sequence. We analyzed the first 1 million reads mapped to chr1 for each sample, using only reads without any alignment indels or clipping (with a CIGAR string of ‘75 M’ in the BAM file). For each read, we retrieved the corresponding aligned reference sequence using its mapped chromosomal position in the BAM file. The rates for each nucleotide substitution type were then calculated as the ratio of the total number of observed changes of that type and the total number of reads, for each position along the reads. Because reads mapping to the reverse strand of the reference are reverse complemented in the BAM files, we performed the analysis separately for forward and reverse strand mapping reads. Reverse mapping reads therefore show the complementary substitution patterns at the 3′ end to the forward mapping reads at the 5′ end.
To perform principal component analysis we used SNP genotypes for individuals from several populations and the EIGENSOFT software . We used 11 KhoeSan individuals from our dataset (excluding SA011 and SA012 from Family 1 and SA052 and SA054 from Family 2), 4 Namibian KhoeSan individuals from Schuster et al., 6 Namibian San (Ju|’hoansi) from the Human Genome Diversity Project (Martin et al, in prep. SRP036155) , and 13 individuals from each of the ASW, GBR, LWK, and YRI populations from the 1000 Genomes Project . Closely related individuals were excluded from all datasets. Sample ‘ABT’ was excluded from Schuster et al.’s dataset since it clustered with the Bantu-speaking populations in their analyses. Individuals selected from the 1000 Genomes Project all had more than 20x coverage for at least 70% of exome targets. To account for differences in coverage and target regions, variants included in this analysis had genotype information for at least 95% of the individuals for a given analysis. VCFtools  was used to count the number of shared and private SNPs between populations.
To analyze the whole-exome data, all read-pairs that mapped within hg19 coordinates, chr6:28702021-33392022, chr19:55228188-55383188 and chr19_gl000209_random, were extracted using SAMtools 0.1.18  and split into separate fastq files for each individual. Read-pairs having more than five bases of quality score ≤3 were removed (FASTX Toolkit 0.0.13 [http://hannonlab.cshl.edu/fastx_toolkit/]). The analysis pipeline was designed to detect all known and any novel HLA class I and KIR SNP variants. Using Bowtie (version 0.12.7)  read-pairs were harvested by mapping with low-stringency to a given HLA or KIR gene (positive filter). To ensure specificity, pairs that mapped to any homologous gene or pseudogene were removed (negative filter). The remaining reads were then aligned to a final reference sequence and the SNP variants ascertained using SAMtools/bcf. Data used to generate filters and reference sequences was obtained from the ImmunoPolymorphism Database and a set of fully-sequenced KIR haplotypes [61–63]. To accommodate the high divergence of HLA exons 2 and 3, the final alignments were made to reference sequences matching individual HLA-A, -B and -C genotypes. HLA-A, -B and -C reference alleles were determined using bead-based sequence specific oligonucleotide probe hybridization and were described in . The “-phase” function of SAMtools was used to attribute phase for local alignments where possible due to the close proximity of exons and/or presence of highly heterozygous sequence (e.g. exons 2 and 3 of HLA class I). Post-filtered read depth was used to determine presence or absence of the variable-content KIR genes. The KIR genes present and their alleles were determined for comparison of eight of the individuals using pyrosequencing methods as previously described . Individual SNP genotypes were confirmed visually from independent alignments of the filtered reads, which were created using MIRA 3 [65, 66]. All newly-discovered variants were confirmed for sequence and phase using standard Sanger sequencing plus one or more of pyrosequencing, DNA cloning or segregation in families.
We searched for genetic signatures of non-human organisms by adopting the fragment recruitment approach outlined by Rusch et al.  (Figure 1). We first trimmed reads and removed low-quality (i.e., reads that meet any of the following conditions: mean quality score less than 25, length less than 50 bp, presence of ambiguous bases) and exact duplicate reads from the set of those that did not map to the human genome using prinseq . We then compared the remaining high-quality reads that did not map to the human genome to 1,285 genomes (Additional file 2: Table S4) obtained from the Joint Genomes Institute’s Integrated Microbial Genomes (IMG) database . In the case of species that have multiple genome-sequenced individuals, we randomly selected a single individual genome to represent the species group. Each read was aligned to each genome using blast (blastall -p blastn -z 16300000000 -e 0.01 -m 8) and the resulting alignment summary statistics were used to infer each read’s taxonomy . We explored several classification thresholds, including alignment e-value, alignment percent identity, and the ratio between the alignment length and the read length (i.e., coverage). We adopted several levels of threshold stringency to recruit reads to genomes for the purposes of inferring taxonomic diversity. Our thresholds were similar to those used in Rusch et al. , with modifications to account for the short length of our sequences.
In the lenient case (i.e., distant homology), a read was recruited to a genome if the two sequences shared a local alignment having at least 50% sequence identity. Using these parameters we identified 5,060,454 unmapped sequences (20.9% of total unmapped reads) that exhibit significant similarity to the collection of reference genomes. In the stringent case (i.e., recent homology), a read was recruited to a genome if the alignment covered at least 75% of the read and the sequences had at least 80% identity. Applying these thresholds found that 16.8% of the reads (N = 4,064,899) can be recruited by non-human genomes.
To conduct species-level binning, we applied the aforementioned coverage thresholds, but required that the read and target genome share at least 95% identity. In all cases of classification, we applied an e-value threshold of 10-3. We inferred a read’s taxonomy by transferring the taxonomic annotation of the genome sequence that produced the best alignment score while also passing the classification thresholds. If a read could not be placed into a species group based on the reference genomes, it was discarded from the subsequent diversity analyses. The IMG taxonomic annotations associated with the reference database genomes were used to assign species-level binned reads into genera and phyla.
To quantify genus-level saliva microbiome abundances among healthy Americans, we downloaded high-quality, taxonomically annotated V35 16S rRNA Roche amplicon sequences associated with 294 saliva samples from the Human Micorbiome Project (HMP) Data Analysis and Coordination Center (http://www.hmpdacc.org/). A prior study used the Ribosomal Database Project classifier (v2.2) with the default 032010 training set and taxonomy to annotate these sequences . Genus-level taxonomic assignments were extracted for each sequence having a bootstrap statistic greater than 80%.
Availability of supporting data
VCF files are available at http://ecoevo.stonybrook.edu/hennlab/data/. Raw read data can be downloaded from the short-read archive (SRP038015 for saliva derived exomes and genomes, and SRP036155 for HGDP San exomes). SNP variants have been deposited in dbSNP (SS 974432427-SS974514519) Novel KIR alleles have been deposited in Genbank and assigned Immuno Polymorphism Database nomenclature as follows:
JX523651 (3DL3*057), GQ924778 (3DL3*037), GQ924779 (3DL3*038), GQ924781 (3DL3*040), HM235773 (3DL3*041), JX523631 (2DL2*012), JX523638 (2DL5B*00803), JX523639 (2DL5B*018), JX523640 (2DS3*007), JX523642 (2DS5*012), HM358896 (2DS5*0502), JX523648 (2DP1*00103), JX523646 (2DP1*00202), JX523647 (2DP1*011), JX523644 (2DP1*012), JX523645 (2DP1*013), JX523643 (2DP1*014), JX523630 (2DL1*026 N), GU323355 (2DL1*022), JX523652 (3DP1*011), JX523655 (3DP1*012), JX523653 (3DP1*013), JX523654 (3DP1*014), JX523634 (2DL4*024), JX523637 (2DL4*027), GQ890695 (3DL1*070), GQ890697 (3DL1*071), GU323347 (3DL2*052), GU323348 (3DL2*053), GU323349 (3DL2*054), JX523649 (3DL2*063)
We extend our gratitude to Blanca Herrera for assistance with genome sequencing. We thank Joanna Mountain, Julie Granka, Marlo Möller, Cedric Werely for their help with sample collection. Finally, we express our appreciation to the ≠ Khomani San community for participation in our research projects. B.M.H. and C.D.B. are supported by NIH grant R01HG003229. C.R.G. is supported by the UCSF Dissertation Year Fellowship and NIH grants T32GM007175 and T32HG000044. T.J.S and K.S.P. are supported by NSF grant DMS-1069303, the San Simeon Fund, and institutional funding from Gladstone Institutes. P.J.N, N.N-G and P.P. were supported by NIH grant AI17892. J.D.W. was supported by NIH grant R01HG400409. J.M.K was supported by NIH grant 1DP5OD009154. A.R.M. was supported by NIH training grant GM007790.
- Liu J, Morgan M, Hutchison K, Calhoun VD: A study of the influence of sex on genome wide methylation. PLoS One. 2010, 5 (4): e10028-10.1371/journal.pone.0010028.PubMed CentralPubMedView ArticleGoogle Scholar
- Henn BM, Gignoux CR, Jobin M, Granka JM, Macpherson JM, Kidd JM, Rodríguez-Botigué L, Ramachandran S, Hon L, Brisbin A, Lin AA, Underhill PA, Comas D, Kidd KK, Norman PJ, Parham P, Bustamante CD, Mountain JL, Feldman MW: Hunter-gatherer genomic diversity suggests a southern African origin for modern humans. Proc Natl Acad Sci U S A. 2011, 108 (13): 5154-5162. 10.1073/pnas.1017511108.PubMed CentralPubMedView ArticleGoogle Scholar
- Kurek KC, Luks VL, Ayturk UM, Alomari AI, Fishman SJ, Spencer SA, Mulliken JB, Bowen ME, Yamamoto GL, Kozakewich HP, Warman ML: Somatic mosaic activating mutations in PIK3CA cause CLOVES syndrome. Am J Hum Genet. 2012, 90 (6): 1108-1115. 10.1016/j.ajhg.2012.05.006.PubMed CentralPubMedView ArticleGoogle Scholar
- Deng X: SeqGene: a comprehensive software solution for mining exome- and transcriptome- sequencing data. BMC Bioinforma. 2011, 12: 267-10.1186/1471-2105-12-267.View ArticleGoogle Scholar
- Shearer AE, Hildebrand MS, Smith RJ: Solution-based targeted genomic enrichment for precious DNA samples. BMC Biotechnol. 2012, 12: 20-10.1186/1472-6750-12-20.PubMed CentralPubMedView ArticleGoogle Scholar
- Kitzman JO, Snyder MW, Ventura M, Lewis AP, Qiu R, Simmons LE, Gammill HS, Rubens CE, Santillan DA, Murray JC, Tabor HK, Bamshad MJ, Eichler EE, Shendure J: Noninvasive whole-genome sequencing of a human fetus. Sci Transl Med. 2012, 4 (137): 76-Google Scholar
- Patel ZH, Kottyan LC, Lazaro S, Williams MS, Ledbetter DH, Tromp H, Rupert A, Kohram M, Wagner M, Husami A, Qian Y, Valencia CA, Zhang K, Hostetter MK, Harley JB, Kaufman KM: The struggle to find reliable results in exome sequencing data: filtering out Mendelian errors. Front Genet. 2014, 5: 16-PubMed CentralPubMedView ArticleGoogle Scholar
- Teer JK, Mullikin JC: Exome sequencing: the sweet spot before whole genomes. Hum Mol Genet. 2010, 19 (R2): R145-R151. 10.1093/hmg/ddq333.PubMed CentralPubMedView ArticleGoogle Scholar
- Bamshad MJ, Ng SB, Bigham AW, Tabor HK, Emond MJ, Nickerson DA, Shendure J: Exome sequencing as a tool for Mendelian disease gene discovery. Nat Rev Genet. 2011, 12 (11): 745-755. 10.1038/nrg3031.PubMedView ArticleGoogle Scholar
- Bustamante CD, Fledel-Alon A, Williamson S, Nielsen R, Hubisz MT, Glanowski S, Tanenbaum DM, White TJ, Sninsky JJ, Hernandez RD, Civello D, Adams MD, Cargill M, Clark AG: Natural selection on protein-coding genes in the human genome. Nature. 2005, 437 (7062): 1153-1157. 10.1038/nature04240.PubMedView ArticleGoogle Scholar
- Tennessen JA, Madeoy J, Akey JM: Signatures of positive selection apparent in a small sample of human exomes. Genome Res. 2010, 20 (10): 1327-1334. 10.1101/gr.106161.110.PubMed CentralPubMedView ArticleGoogle Scholar
- Yi X, Liang Y, Huerta-Sanchez E, Jin X, Cuo ZX, Pool JE, Xu X, Jiang H, Vinckenbosch N, Korneliussen TS, Zheng H, Liu T, He W, Li K, Luo R, Nie X, Wu H, Zhao M, Cao H, Zou J, Shan Y, Li S, Yang Q, Asan , Ni P, Tian G, Xu J, Liu X, Jiang T, Wu R, et al: Sequencing of 50 human exomes reveals adaptation to high altitude. Science. 2010, 329 (5987): 75-78. 10.1126/science.1190371.PubMed CentralPubMedView ArticleGoogle Scholar
- Rylander-Rudqvist T, Håkansson N, Tybring G, Wolk A: Quality and quantity of saliva DNA obtained from the self-administrated oragene method–a pilot study on the cohort of Swedish men. Cancer Epidemiol Biomarkers Prev. 2006, 15 (9): 1742-1745. 10.1158/1055-9965.EPI-05-0706.PubMedView ArticleGoogle Scholar
- Hansen TV, Simonsen MK, Nielsen FC, Hundrup YA: Collection of blood, saliva, and buccal cell samples in a pilot study on the Danish nurse cohort: comparison of the response rate and quality of genomic DNA. Cancer Epidemiol Biomarkers Prev. 2007, 16 (10): 2072-2076. 10.1158/1055-9965.EPI-07-0611.PubMedView ArticleGoogle Scholar
- Schuster SC, Miller W, Ratan A, Tomsho LP, Giardine B, Kasson LR, Harris RS, Petersen DC, Zhao F, Qi J, Alkan C, Kidd JM, Sun Y, Drautz DI, Bouffard P, Muzny DM, Reid JG, Nazareth LV, Wang Q, Burhans R, Riemer C, Wittekindt NE, Moorjani P, Tindall EA, Danko CG, Teo WS, Buboltz AM, Zhang Z, Ma Q, Oosthuysen A, et al: Complete Khoisan and Bantu genomes from southern Africa. Nature. 2010, 463 (7283): 943-947. 10.1038/nature08795.PubMed CentralPubMedView ArticleGoogle Scholar
- Gronau I, Hubisz MJ, Gulko B, Danko CG, Siepel A: Bayesian inference of ancient human demography from individual genome sequences. Nat Genet. 2011, 43 (10): 1031-1034. 10.1038/ng.937.PubMed CentralPubMedView ArticleGoogle Scholar
- Asan , Xu Y, Jiang H, Tyler-Smith C, Xue Y, Jiang T, Wang J, Wu M, Liu X, Tian G, Wang J, Wang J, Yang H, Zhang X: Comprehensive comparison of three commercial human whole-exome capture platforms. Genome Biol. 2011, 12 (9): R95-10.1186/gb-2011-12-9-r95.PubMed CentralPubMedView ArticleGoogle Scholar
- Clark MJ, Chen R, Lam HY, Karczewski KJ, Chen R, Euskirchen G, Butte AJ, Snyder M: Performance comparison of exome DNA sequencing technologies. Nat Biotechnol. 2011, 29 (10): 908-914. 10.1038/nbt.1975.PubMed CentralPubMedView ArticleGoogle Scholar
- Briggs AW, Stenzel U, Johnson PL, Green RE, Kelso J, Prüfer K, Meyer M, Krause J, Ronan MT, Lachmann M, Pääbo S: Patterns of damage in genomic DNA sequences from a Neandertal. Proc Natl Acad Sci U S A. 2007, 104 (37): 14616-14621. 10.1073/pnas.0704665104.PubMed CentralPubMedView ArticleGoogle Scholar
- Stoneking M, Krause J: Learning about human population history from ancient and modern genomes. Nat Rev Genet. 2011, 12 (9): 603-614. 10.1038/nrg3029.PubMedView ArticleGoogle Scholar
- Ginolhac A, Rasmussen M, Gilbert MT, Willerslev E, Orlando L: mapDamage: testing for damage patterns in ancient DNA sequences. Bioinformatics. 2011, 27 (15): 2153-2155. 10.1093/bioinformatics/btr347.PubMedView ArticleGoogle Scholar
- Altshuler DM, Gibbs RA, Peltonen L, Altshuler DM, Gibbs RA, Peltonen L, Dermitzakis E, Schaffner SF, Yu F, Peltonen L, Dermitzakis E, Bonnen PE, Altshuler DM, Gibbs RA, de Bakker PI, Deloukas P, Gabriel SB, Gwilliam R, Hunt S, Inouye M, Jia X, Palotie A, Parkin M, Whittaker P, Yu F, Chang K, Hawes A, Lewis LR, Ren Y, International HapMap 3 Consortium, et al: Integrating common and rare genetic variation in diverse human populations. Nature. 2010, 467 (7311): 52-58. 10.1038/nature09298.PubMedView ArticleGoogle Scholar
- Abecasis GR, Auton A, Brooks LD, DePristo MA, Durbin RM, Handsaker RE, Kang HM, Marth GT, McVean GA, 1000 Genomes Project Consortium: An integrated map of genetic variation from 1,092 human genomes. Nature. 2012, 491 (7422): 56-65. 10.1038/nature11632.PubMedView ArticleGoogle Scholar
- DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis AA, del Angel G, Rivas MA, Hanna M, McKenna A, Fennell TJ, Kernytsky AM, Sivachenko AY, Cibulskis K, Gabriel SB, Altshuler D, Daly MJ: A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011, 43 (5): 491-498. 10.1038/ng.806.PubMed CentralPubMedView ArticleGoogle Scholar
- McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA: The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010, 20 (9): 1297-1303. 10.1101/gr.107524.110.PubMed CentralPubMedView ArticleGoogle Scholar
- Schlebusch CM, Skoglund P, Sjödin P, Gattepaille LM, Hernandez D, Jay F, Li S, De Jongh M, Singleton A, Blum MG, Soodyall H, Jakobsson M: Genomic variation in seven Khoe-San groups reveals adaptation and complex African history. Science. 2012, 338 (6105): 374-379. 10.1126/science.1227721.PubMedView ArticleGoogle Scholar
- Pickrell JK, Patterson N, Barbieri C, Berthold F, Gerlach L, Güldemann T, Kure B, Mpoloka SW, Nakagawa H, Naumann C, Lipson M, Loh PR, Lachance J, Mountain J, Bustamante CD, Berger B, Tishkoff SA, Henn BM, Stoneking M, Reich D, Pakendorf B: The genetic prehistory of southern Africa. Nat Commun. 2012, 3: 1143-PubMed CentralPubMedView ArticleGoogle Scholar
- Patterson N, Price AL, Reich D: Population structure and eigenanalysis. PLoS Genet. 2006, 2 (12): e190-10.1371/journal.pgen.0020190.PubMed CentralPubMedView ArticleGoogle Scholar
- Parham P: MHC class I molecules and KIRs in human history, health and survival. Nat Rev Immunol. 2005, 5 (3): 201-214. 10.1038/nri1570.PubMedView ArticleGoogle Scholar
- Parham P, Norman PJ, Abi-Rached L, Hilton HG, Guethlein LA: Review: immunogenetics of human placentation. Placenta. 2012, 33 (Suppl): S71-S80.PubMed CentralPubMedView ArticleGoogle Scholar
- Rusch DB, Halpern AL, Sutton G, Heidelberg KB, Williamson S, Yooseph S, Wu D, Eisen JA, Hoffman JM, Remington K, Beeson K, Tran B, Smith H, Baden-Tillson H, Stewart C, Thorpe J, Freeman J, Andrews-Pfannkoch C, Venter JE, Li K, Kravitz S, Heidelberg JF, Utterback T, Rogers YH, Falcón LI, Souza V, Bonilla-Rosso G, Eguiarte LE, Karl DM, Sathyendranath S, et al: The sorcerer II global ocean Sampling expedition: Northwest Atlantic through eastern tropical Pacific. PLoS Biol. 2007, 5 (3): e77-10.1371/journal.pbio.0050077.PubMed CentralPubMedView ArticleGoogle Scholar
- Konstantinidis KT, Ramette A, Tiedje JM: The bacterial species definition in the genomic era. Philos Trans R Soc Lond B Biol Sci. 2006, 361 (1475): 1929-1940. 10.1098/rstb.2006.1920.PubMed CentralPubMedView ArticleGoogle Scholar
- Yanagisawa M, Kuriyama T, Williams DW, Nakagawa K, Karasawa T: Proteinase activity of prevotella species associated with oral purulent infection. Curr Microbiol. 2006, 52 (5): 375-378. 10.1007/s00284-005-0261-1.PubMedView ArticleGoogle Scholar
- Peng Z, Fives-Taylor P, Ruiz T, Zhou M, Sun B, Chen Q, Wu H: Identification of critical residues in Gap3 of Streptococcus parasanguinis involved in Fap1 glycosylation, fimbrial formation and in vitro adhesion. BMC Microbiol. 2008, 8: 52-10.1186/1471-2180-8-52.PubMed CentralPubMedView ArticleGoogle Scholar
- Ohara-Nemoto Y, Kishi K, Satho M, Tajika S, Sasaki M, Namioka A, Kimura S: Infective endocarditis caused by Granulicatella elegans originating in the oral cavity. J Clin Microbiol. 2005, 43 (3): 1405-1407. 10.1128/JCM.43.3.1405-1407.2005.PubMed CentralPubMedView ArticleGoogle Scholar
- Gibson FC, Hong C, Chou HH, Yumoto H, Chen J, Lien E, Wong J, Genco CA: Innate immune recognition of invasive bacteria accelerates atherosclerosis in apolipoprotein E-deficient mice. Circulation. 2004, 109 (22): 2801-2806. 10.1161/01.CIR.0000129769.17895.F0.PubMedView ArticleGoogle Scholar
- Zeituni AE, Carrion J, Cutler CW: Porphyromonas gingivalis-dendritic cell interactions: consequences for coronary artery disease. J Oral Microbiol. 2010, 2: 5782-View ArticleGoogle Scholar
- Ihara H, Miura T, Kato T, Ishihara K, Nakagawa T, Yamada S, Okuda K: Detection of Campylobacter rectus in periodontitis sites by monoclonal antibodies. J Periodontal Res. 2003, 38 (1): 64-72. 10.1034/j.1600-0765.2003.01627.x.PubMedView ArticleGoogle Scholar
- Chimusa ER, Zaitlen N, Daya M, Möller M, van Helden PD, Mulder NJ, Price AL, Hoal EG: Genome-wide association study of ancestry-specific TB risk in the South African Coloured population. Hum Mol Genet. 2013, 23: 796-doi: 10.1093/hmg/ddt462PubMed CentralPubMedView ArticleGoogle Scholar
- Koren O, Knights D, Gonzalez A, Waldron L, Segata N, Knight R, Huttenhower C, Ley RE: A guide to enterotypes across the human body: meta-analysis of microbial community structures in human microbiome datasets. PLoS Comput Biol. 2013, 9 (1): e1002863-10.1371/journal.pcbi.1002863.PubMed CentralPubMedView ArticleGoogle Scholar
- Human Microbiome Project Consortium: A framework for human microbiome research. Nature. 2012, 486 (7402): 215-21. 10.1038/nature11209.View ArticleGoogle Scholar
- Hodzic E, Snyder S: A case of peritonitis due to Rothia mucilaginosa. Perit Dial Int. 2010, 30 (3): 379-380. 10.3747/pdi.2009.00146.PubMedView ArticleGoogle Scholar
- Pinsky RL, Piscitelli V, Patterson JE: Endocarditis caused by relatively penicillin-resistant Stomatococcus mucilaginosus. J Clin Microbiol. 1989, 27 (1): 215-216.PubMed CentralPubMedGoogle Scholar
- Liu Y, Li J: Short regions of sequence identity between the genomes of bacteria and human. Curr Microbiol. 2011, 62 (3): 770-776. 10.1007/s00284-010-9783-2.PubMedView ArticleGoogle Scholar
- Bodi K, Perera AG, Adams PS, Bintzler D, Dewar K, Grove DS, Kieleczawa J, Lyons RH, Neubert TA, Noll AC, Singh S, Steen R, Zianni M: Comparison of commercially available target enrichment methods for next-generation sequencing. J Biomol Tech. 2013, 24 (2): 73-86.PubMed CentralPubMedGoogle Scholar
- Parla JS, Iossifov I, Grabill I, Spector MS, Kramer M, McCombie WR: A comparative analysis of exome capture. Genome Biol. 2011, 12 (9): R97-10.1186/gb-2011-12-9-r97.PubMed CentralPubMedView ArticleGoogle Scholar
- Larkin JM, Strohl WR: Beggiatoa, Thiothrix, and Thioploca. Annu Rev Microbiol. 1983, 37: 341-367. 10.1146/annurev.mi.37.100183.002013.PubMedView ArticleGoogle Scholar
- Lazarevic V, Whiteson K, Hernandez D, François P, Schrenzel J: Study of inter- and intra-individual variations in the salivary microbiota. BMC Genomics. 2010, 11: 523-10.1186/1471-2164-11-523.PubMed CentralPubMedView ArticleGoogle Scholar
- Consortium HMP: Structure, function and diversity of the healthy human microbiome. Nature. 2012, 486 (7402): 207-214. 10.1038/nature11234.View ArticleGoogle Scholar
- Henderson B, Ward JM, Ready D: Aggregatibacter (Actinobacillus) actinomycetemcomitans: a triple A* periodontopathogen?. Periodontol 2000. 2010, 54 (1): 78-105. 10.1111/j.1600-0757.2009.00331.x.PubMedView ArticleGoogle Scholar
- Nasidze I, Li J, Schroeder R, Creasey JL, Li M, Stoneking M: High diversity of the saliva microbiome in Batwa Pygmies. PLoS One. 2011, 6 (8): e23352-10.1371/journal.pone.0023352.PubMed CentralPubMedView ArticleGoogle Scholar
- Nasidze I, Li J, Quinque D, Tang K, Stoneking M: Global diversity in the human salivary microbiome. Genome Res. 2009, 19 (4): 636-643. 10.1101/gr.084616.108.PubMed CentralPubMedView ArticleGoogle Scholar
- Yatsunenko T, Rey FE, Manary MJ, Trehan I, Dominguez-Bello MG, Contreras M, Magris M, Hidalgo G, Baldassano RN, Anokhin AP, Heath AC, Warner B, Reeder J, Kuczynski J, Caporaso JG, Lozupone CA, Lauber C, Clemente JC, Knights D, Knight R, Gordon JI: Human gut microbiome viewed across age and geography. Nature. 2012, 486 (7402): 222-227.PubMed CentralPubMedGoogle Scholar
- Consortium GP: A map of human genome variation from population-scale sequencing. Nature. 2010, 467 (7319): 1061-1073. 10.1038/nature09534.View ArticleGoogle Scholar
- Li H, Durbin R: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009, 25 (14): 1754-1760. 10.1093/bioinformatics/btp324.PubMed CentralPubMedView ArticleGoogle Scholar
- Lassmann T, Hayashizaki Y, Daub CO: SAMStat: monitoring biases in next generation sequencing data. Bioinformatics. 2011, 27 (1): 130-131. 10.1093/bioinformatics/btq614.PubMed CentralPubMedView ArticleGoogle Scholar
- Cann HM, de Toma C, Cazes L, Legrand MF, Morel V, Piouffre L, Bodmer J, Bodmer WF, Bonne-Tamir B, Cambon-Thomsen A, Chen Z, Chu J, Carcassi C, Contu L, Du R, Excoffier L, Ferrara GB, Friedlaender JS, Groot H, Gurwitz D, Jenkins T, Herrera RJ, Huang X, Kidd J, Kidd KK, Langaney A, Lin AA, Mehdi SQ, Parham P, Piazza A: A human genome diversity cell line panel. Science. 2002, 296 (5566): 261-262.PubMedView ArticleGoogle Scholar
- Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, Handsaker RE, Lunter G, Marth GT, Sherry ST, McVean G, Durbin R, 1000 Genomes Project Analysis Group: The variant call format and VCFtools. Bioinformatics. 2011, 27 (15): 2156-2158. 10.1093/bioinformatics/btr330.PubMed CentralPubMedView ArticleGoogle Scholar
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, 1000 Genome Project Data Processing Subgroup: The sequence alignment/map format and SAMtools. Bioinformatics. 2009, 25 (16): 2078-2079. 10.1093/bioinformatics/btp352.PubMed CentralPubMedView ArticleGoogle Scholar
- Langmead B, Trapnell C, Pop M, Salzberg SL: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009, 10 (3): R25-10.1186/gb-2009-10-3-r25.PubMed CentralPubMedView ArticleGoogle Scholar
- Wilson MJ, Torkar M, Haude A, Milne S, Jones T, Sheer D, Beck S, Trowsdale J: Plasticity in the organization and sequences of human KIR/ILT gene families. Proc Natl Acad Sci U S A. 2000, 97 (9): 4778-4783. 10.1073/pnas.080588597.PubMed CentralPubMedView ArticleGoogle Scholar
- Pyo CW, Guethlein LA, Vu Q, Wang R, Abi-Rached L, Norman PJ, Marsh SG, Miller JS, Parham P, Geraghty DE: Different patterns of evolution in the centromeric and telomeric regions of group A and B haplotypes of the human killer cell Ig-like receptor locus. PLoS One. 2010, 5 (12): e15115-10.1371/journal.pone.0015115.PubMed CentralPubMedView ArticleGoogle Scholar
- Robinson J, Mistry K, McWilliam H, Lopez R, Marsh SG: IPD–the immuno polymorphism database. Nucleic Acids Res. 2010, 38 (Database issue): D863-D869.PubMed CentralPubMedView ArticleGoogle Scholar
- Norman PJ, Hollenbach JA, Nemat-Gorgani N, Guethlein LA, Hilton HG, Pando MJ, Koram KA, Riley EM, Abi-Rached L, Parham P: Co-evolution of human leukocyte antigen (HLA) class I ligands with killer-cell immunoglobulin-like receptors (KIR) in a genetically diverse population of sub-Saharan Africans. PLoS Genet. 2013, 9 (10): e1003938-10.1371/journal.pgen.1003938.PubMed CentralPubMedView ArticleGoogle Scholar
- Chevreux B, Pfisterer T, Drescher B, Driesel AJ, Müller WE, Wetter T, Suhai S: Using the miraEST assembler for reliable and automated mRNA transcript assembly and SNP detection in sequenced ESTs. Genome Res. 2004, 14 (6): 1147-1159. 10.1101/gr.1917404.PubMed CentralPubMedView ArticleGoogle Scholar
- Staden R, Beal KF, Bonfield JK: The Staden package, 1998. Methods Mol Biol. 2000, 132: 115-130.PubMedGoogle Scholar
- Schmieder R, Edwards R: Quality control and preprocessing of metagenomic datasets. Bioinformatics. 2011, 27 (6): 863-864. 10.1093/bioinformatics/btr026.PubMed CentralPubMedView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.