Graph mining for next generation sequencing: leveraging the assembly graph for biological insights
© Warnke-Sommer and Ali. 2016
Received: 9 January 2016
Accepted: 22 April 2016
Published: 6 May 2016
The assembly of Next Generation Sequencing (NGS) reads remains a challenging task. This is especially true for the assembly of metagenomics data that originate from environmental samples potentially containing hundreds to thousands of unique species. The principle objective of current assembly tools is to assemble NGS reads into contiguous stretches of sequence called contigs while maximizing for both accuracy and contig length. The end goal of this process is to produce longer contigs with the major focus being on assembly only. Sequence read assembly is an aggregative process, during which read overlap relationship information is lost as reads are merged into longer sequences or contigs. The assembly graph is information rich and capable of capturing the genomic architecture of an input read data set. We have developed a novel hybrid graph in which nodes represent sequence regions at different levels of granularity. This model, utilized in the assembly and analysis pipeline Focus, presents a concise yet feature rich view of a given input data set, allowing for the extraction of biologically relevant graph structures for graph mining purposes.
Focus was used to create hybrid graphs to model metagenomics data sets obtained from the gut microbiomes of five individuals with Crohn’s disease and eight healthy individuals. Repetitive and mobile genetic elements are found to be associated with hybrid graph structure. Using graph mining techniques, a comparative study of the Crohn’s disease and healthy data sets was conducted with focus on antibiotics resistance genes associated with transposase genes. Results demonstrated significant differences in the phylogenetic distribution of categories of antibiotics resistance genes in the healthy and diseased patients. Focus was also evaluated as a pure assembly tool and produced excellent results when compared against the Meta-velvet, Omega, and UD-IDBA assemblers.
Mining the hybrid graph can reveal biological phenomena captured by its structure. We demonstrate the advantages of considering assembly graphs as data-mining support in addition to their role as frameworks for assembly.
KeywordsNext generation sequence assembly Graph mining Metagenomics Crohn’s disease
Next Generation Sequencing (NGS) technologies have made it possible to directly sequence environmental samples to detect and analyze the components of biological communities. As a growing number of exciting discoveries are being made in this field of metagenomics, it is becoming increasingly clear that we are intricately connected to and influenced by the host of microorganisms known as the human microbiome. The importance of the human microbiome has been recognized, so much so that it has been referred to as the forgotten organ of the human body . The commensal and pathogenic microorganisms populating the human body have been found to play major roles in metabolism [2, 3], immune system maturation and modulation [4, 5], and even in the development of various types of cancers [6, 7].
In metagenomics studies, NGS machines produce short DNA sequences called reads which are randomly sampled at a very high coverage from environmental DNA. These reads are extremely short in comparison to the bulk DNA amount in environmental samples. Illumina technologies currently are capable of producing NGS reads anywhere from 125 bps to 300 bps in length with output capabilities ranging from 25 million to 3 billion reads per run . Reads produced by the 454 technologies are up to 1000 bps in length with data set sizes reaching 1 million reads . More recently introduced technologies such as PacBio are able to produce reads at longer read lengths exceeding 10 k bps in length .
The short length of current NGS reads makes it difficult to extract any useful information from any read individually. Therefore, multiple analytical approaches have been developed to organize, aggregate, and analyze short read sequences. The high coverage of next generation sequencing technologies means that the reads in a data set will be sampled from a biological sample such that many of them overlap. These overlap relationships can be used to order the reads into a representation of the original sequence region. For the purpose of facilitating downstream analysis, many applications require that read sets are assembled into longer stretches of sequence called contigs. Assembly tools often rely on the mathematical structure called a graph to organize and model the short sequencing reads. Two graph theoretic approaches are typically followed in assembly, overlap graph based approaches and de Bruijn graph based approaches . In the overlap graph based approaches, each read is mapped to a unique node in the overlap graph. If two reads overlap, then their corresponding nodes will be connected by an edge. An ordering of the reads is found by traversing the nodes in the overlap graph. The second graph-based approach relies on the de Bruijn graph as its graph theoretical foundation. In this approach each read is broken into all possible k-mers. The k-mers become edges in the de Bruijn graph. For each k-mer, its left and right k-1-mers become nodes in the de Bruijn graph. In this approach, a read ordering is found by traversing the edges in the de Bruijn graph.
Numerous different assembly tools have expanded on these two graph theoretic foundations introduced in the previous paragraph. The assemblers IDBA  and SPAdes  build and integrate de Bruijn graphs for multiple values of k. When k is too small this often results in many branches in the de Bruijn graph; however, when it is too large this results in gaps in the de Bruijn graph . These iterative de Bruijn graph approaches mitigate this problem by taking advantage of all values of k, resulting in longer produced contigs [12, 13]. Similarly, other assemblers such as SGA  have made use of the string graph, which simplifies the overlap graph by eliminating redundant edges [15, 16].
Assemblers optimized for single genome assembly are unlikely able to handle the complexities of metagenomics data sets. Metagenomics-specific assembly tools have been developed to address some of the challenges of metagenomics assembly including the presence of conserved and repetitive sequence regions, which introduce branching paths and tangles within the assembly graph. Assembly tools typically extend contigs along a maximal non-branching path in the assembly graph. A branch point in a path forces an assembler to either terminate contig extension or to select a branch with which to continue extension - the branch selected may or may not be correct, introducing error into the assembled contig. The metagenomics assemblers Omega and UD-IDBA analyze the read and kmer coverage differences between paths that compose branch points in the assembly graph for the purpose of resolving them [17, 18]. The assembler MAP integrates mate-pair information into the assembly graph to resolve these branch points . Machine learning has been used by the assembler MetaVelvet-SL to distinguish chimeric nodes from non-chimeric nodes .
The assembly of short reads is an aggregative process during which the global and local read relationship and therefore global and local genome architecture information is lost as reads are merged into flat contigs. In contrast, assembly graphs are information rich models that can capture features of the global architecture of the input genomic sequence  and have been mentioned in passing to be capable of capturing biological features such as conserved regions, rRNA operons, and horizontally transferred sequences . However, there have been a very limited number of studies demonstrating the assembly graph’s power as an information rich data-mining support especially in metagenomics. Instead, the primary goal of most assembly tools is to improve the assembly process to produce longer and more complete assemblies. In this research, an expanded assembly graph, which is called the hybrid graph, is shown to be an excellent data-mining support that can be used to extract structural signatures associated with biological features and make novel biological discoveries.
We have developed a novel hybrid graph model that represents different regions of sequence data at different levels of granularity . This hybrid graph model forms the foundation of the assembly and analysis pipeline called Focus. The model is constructed by creating a set of graphs produced by successive graph coarsening initialized on the original overlap graph. Nodes are integrated from different levels of the graph set into a hybrid graph to create a concise yet feature rich view of the input data set. Repeats and conserved intergenomic regions are reduced within the hybrid graph, while global architecture is preserved. Local read overlap relationships are maintained in earlier levels of the coarsened graph set.
Demonstrate that repetitive and transposable elements are associated with node characteristics. To facilitate efficient extraction of meaningful graph structures in this study, each node in the hybrid graph is assigned a Shannon’s index score to reflect the local diversity of the various species or sequence regions in which the sequence represented by the given node is conserved or repeated. The Shannon’s index captures the number of edges incident to a given node as well as their evenness or how equally their weights are distributed. In the hybrid graph, each node represents a contiguous sequence region. Edges represent overlap relationships that this sequence region has with other contiguous sequence regions. An edge between two given nodes is weighted according to the summation of the read overlap lengths between reads composing the first sequence region with the reads composing the other sequence region. If a sequence region is repeated multiple times in a single genome or is present in multiple species, it might follow that its representative node in the assembly graph will have multiple in and out edges representing different sequence regions or species. In contrast, a node that is part of a single path in the assembly graph might be representative of a unique genomic region.Bacterial transposons are mobile DNA segments that can independently replicate and insert themselves within the same chromosome or plasmid or into a different chromosome or plasmid . They have been implicated in the horizontal transfer of genes between different bacterial species. Transposase and integrase sequences are often a part of transposable elements and are commonly involved in their transfer. The simplest of bacterial transposons is the insertion sequence (IS) element shown in Fig. 1a, which is composed of two inverted repeats flanking genes necessary for transposition. The rDNA operon is a prevalent large repeat class in microbial genomes, ranging from 1–15 copies per genome . In this manuscript, it is shown that nodes assigned with transposase/integrase genes and rRNA operon DNA had a greater proportion of high Shannon’s index scores in comparison to nodes assigned with other gene categories from the SEED subsystems (q = 2.44 × 10−04; paired Wilcoxon tests).
Identify and characterize the phylogenetic distribution of antibiotic resistance gene classes associated with transposase/integrase sequences in healthy individuals and individuals with Crohn’s disease. The human microbiome has been described as a reservoir for antibiotic resistance genes [25, 26] and as a hot spot for horizontal gene transfer  between bacterial taxa. Antibiotic resistance genes are often found in bacterial composite transposons, which are composed of two IS elements flanking a protein coding sequence region as shown in Fig. 1b , allowing their rapid spread between bacterial groups. Crohn’s disease is a chronic disorder where the gastrointestinal tract is inflamed . Horizontal gene transfer has been suggested to be increased between pathogenic and commensal bacteria in inflamed gastrointestinal systems . Furthermore, this population is more likely to be treated with antibiotic regimens for secondary complications such as bacterial overgrowth and abscesses . Antibiotic use has been shown to increase antibiotic resistance in those to whom it is prescribed . Due to these issues, it is important to understand the antibiotic resistance genes present in populations affected by Crohn’s disease.
In a novel graph-mining approach, the structure of the hybrid graph is used to identify transposase/integrases sequences that might be located in multiple sequence regions (i.e. repeated in the same genome or distributed across multiple species) according to their assigned Shannon’s index score. Local graph exploration of the neighborhood surrounding these transposase and integrase sequences reveal associated antibiotic resistance genes. Clustering transposase sequences based upon their phylogenetic distribution obtained from the hybrid graph revealed several differences between the Crohn’s disease and healthy data set. Most transposase genes in the healthy data sets were clustered into a large Bacteroides group significantly enriched for tetracycline, macrolide-lincosamide-streptogramin B, and beta-lactamase antibiotic resistance genes. Transposase genes in the Crohn’s disease data sets were more diverse across phylogenetic groups including an Enterococcus cluster significantly enriched for aminoglycoside, macrolide, and streptogramin antibiotics resistance genes. This approach reveals clusters of genera for which transposase associated antibiotic classes are enriched and may provide insight into candidate bacterial groups in which horizontal gene transfer has occurred.
Perform a competitive assembly evaluation of the assembler against other well-known assembly tools. In addition to being a data-mining support, Focus is a strong assembly algorithm. In this study, a subset of the metagenomics data sets is assembled with Focus in a comparative study against the Omega , IDBA-UD , and MetaVelvet  assemblers. These assemblers were chosen for the comparison because they were metagenomics-specific assemblers. Furthermore, two of the assemblers, IDBA-UD and MetaVelvet, were based on the de Bruijn graph approach for assembly. One of these assemblers, IDBA-UD, is based on an iterative de Bruijn graph approach. The final assembler, Omega, is an overlap graph based assembler. These assemblers represent a wide range of graph-based approaches to which we compared Focus.
Results demonstrate the knowledge that can be obtained from structural features of the assembly graph. Nodes annotated with several genetic features that are distributed across multiple species or are often present in multiple copies (rRNA) have a significantly greater proportion of high Shannon’s index scores than other nodes in the hybrid graph. This reflects a greater number of unique sequences that overlap with the genomic regions of these particular nodes. Graph mining is also useful for comparative studies, allowing for the identification of distinct differences in composition of transposase associated antibiotic resistance genes in the Crohn’s disease and healthy data sets. The ability of the hybrid graph to reveal multiple genera that a given transposase sequence is present within may provide insights into the flow of horizontal gene transfer and antibiotic resistance gene spread in metagenomics samples. Graph mining is a powerful method of next generation sequencing data analysis in addition to assembly and read mapping methods.
Read preprocessor: The Focus preprocessor generates reverse complements of the input read data set and splits the reads into subsets for parallel read alignment. The preprocessor also provides options for fixed-length and quality based read trimming.
Pairwise read alignment: In the read alignment step of algorithm, Focus performs pairwise comparison of the read subsets generated by the preprocessor to search for potential alignments. Any overlap alignments found in the pairwise read alignment stage are used to create the initial overlap graph.
Multilevel graph set: The next step of the algorithm is the construction of the multilevel graph set. In this step, Focus uses heavy edge matching and node merging to create a set of graphs G0, G1 … Gnrepresenting increasingly coarser levels of information granularity.
Hybrid graph: In the fourth step, Focus backtracks through the multilevel graph set starting with the most reduced graph Gn to select nodes that have been determined to be the best representatives of their corresponding read clusters by local assembly analysis. These representative nodes are used to construct a hybrid graph set G’ n, G’ n-1 … G’ 0 where G’ 0 contains all of the representative nodes selected and integrated from the multilevel set Gn, Gn-1 … G0 . We call G’ 0 the hybrid graph.
Hybrid graph trimming: The hybrid graph G’0 is processed with a graph-filtering algorithm to remove transitive edges and nodes whose corresponding read clusters assemble into contigs that are contained in or are identical to other contigs represented in the hybrid graph. The final trimmed hybrid graph G’ 0 provides a concise but highly accurate and feature rich representation of the structure of the read data set .
Read preprocessor and input format
Focus accepts both fasta and fastq formatted reads. Focus requires the user to specify the number of subsets to divide the read file into for parallel read alignment. Once Focus receives the input reads and specified number of subsets, it generates the reverse complements of the input reads. The preprocessor also includes both fix-length and quality based read trimming. While generating the reverse complements, the preprocessor will first trim the 5’ and 3’ ends of each read and the corresponding 3’ and 5’ ends of its generated reverse complement with fixed lengths l 1 and l 2 respectively that have been provided by the user. We provide this option so that the user can remove any known adapters or tags, which may or may not be the same length, present on the 5’ or 3’ end of the reads. After fixed length trimming is completed on a read, the preprocessor will then apply quality based trimming to its 3’ end and to the corresponding 5’ end of its generated reverse complement. Given a user-provided window length of w and minimum average quality value q, the preprocessor will slide the window from the 3’ end to the 5’ end of the read until the average quality value of the window is greater than q. The read will be trimmed from the right endpoint of the sliding window to its 3’end. Following read trimming, the input reads and their generated reverse complements are divided evenly into the specified number of subsets. The reads in the subsets are then concatenated and indexed by a succinct dictionary structure . In this structure, each nucleotide and corresponding quality value are compressed into a single byte. The read subsets are now ready for processing by the parallel read aligner.
Parallel pairwise read alignment
The read aligner processes pairs of read subsets at a time. One of the read subsets Rq is designated as the query subset and the other Rr is designated as the reference read set. The reference read subset Rr is indexed by a suffix array  to facilitate the search for short seed matches shared between reads. Each read in Rq is visited sequentially and is scanned with a window of size k at step size w specified by the user to generate k-mer seeds. These seeds are used to query the reference read data set for exact matches. These exact matches are used to seed a banded Needleman alignment between the query read and reference reads. If an overlap relationship meeting user criteria for identity and length is found, the query read and reference read ids, overlap length, and overlap identity are recorded for the construction of the initial overlap graph. This process can be conducted in parallel, with different pairs of read subsets being sent to multiple processors for independent read alignment.
Multilevel graph Set
After the matching process is completed on G0, nodes that are a part of the matching are merged to their selected partners to form super nodes in the graph G1. Nodes that were unmatched in G0 are also mapped to new nodes in G1. Edges that were selected during the matching process are removed in G1 since their endpoints are merged into a single super node. Any parallel edges in G1 are combined into a single edge and their edge weights are added together. As follows, each edge in the multilevel graph set will represent the summed weight of the inter-cluster edges of the clusters in G0 represented by the endpoints of that edge. Heavy edge matching and node merging is applied on G1 to produce G2. This process continues until the ratio of nodes matched to graph size falls beneath a user threshold, producing a multilevel set of graphs G0, G1 … Gn. The graph Gn is used to relabel the nodes in G0 to form a new overlap graph Gfinal: any nodes co-occurring in a cluster represented by a super node in G1, G2 … Gn will be consecutively labeled in Gfinal. This allows the nodes in Gfinal belonging to a cluster represented by a super node in G1, G2 … Gn to be loaded into memory concurrently by the algorithm for processing.
After the graph coarsening process is completed, the algorithm will have produced a graph set G0, G1 … Gn representing the original read data set at different levels of information. However, not all sequence regions will be best represented at all graph levels. A node from a reduced graph found later in the multilevel graph set might be sufficient for representing a simple unique genomic region. In contrast, more complex genomic structures might be better represented by the more detailed graphs earlier in the graph set. For example in Fig. 3a, a branch point in original overlap graph is over reduced in graph G3 in the multilevel graph set. However, this branch point is captured at more granular graph levels. To address this issue, best representative super nodes are selected and integrated from multiple graph levels to create a new hybrid graph that is a highly concise yet accurate representation of the input data set. This section describes how a hybrid graph set G’ 0, G’ 1 … G’ n is constructed from the multilevel graph set G0, G1 … Gn. The algorithm creates the hybrid graph set by selecting best representative super nodes from the original multilevel graph set beginning with Gn and iterating to G0. A best representative super node is defined as a node selected from the most reduced graph level as possible whose corresponding cluster of reads assemble into a single contiguous contig. If a read cluster does not assemble into a single contig, it might not be well represented by its current graph level. Backtracking to earlier graph levels may provide better node representatives of the reads in that cluster. To select the best representatives, the algorithm first iterates through Gn. For each super node in Gn, its corresponding cluster subgraph in Gfinal is loaded into memory. Focus employs graph-cleaning techniques first introduced by  and used commonly by many assembly tools. Short dead-end branches that are shorter than a user provided threshold are removed from the subgraph. Small bubbles in the graph, which are two distinct paths in the graph that have the same beginning and ending nodes, are also removed by eliminating the least weighted path. The subgraph is then transitively reduced following the approach in . If the resulting graph is a single path representing a contiguous contig, then the super node is selected as the best representative of that read cluster. The read cluster is assembled into a contig and recorded on file. After the iteration through the nodes of Gn is complete, all selected best representatives are mapped to nodes in G’ n. Nodes that were not selected as best representatives are also mapped to nodes in G’ n. After the best representative selection on G’ n is complete, the algorithm begins the super node iteration and assembly evaluation process on Gn-1. If a node in Gn-1 is a component node of a merged super node in Gn that was previously selected by the assembly algorithm as a best representative, it will not be evaluated or included in G’n-1 since its parent was already chosen as the best possible representative. The graph G’ n-1 is created from all of the best representatives selected from Gn and Gn-1 as well as from the nodes that were not selected in Gn-1. Contigs assembled from the best representatives in Gn-1 are recorded to file. The graph G’ n-2 will be composed of best representative nodes selected from Gn, Gn-1, Gn-2 and the nodes that were not selected in Gn-2. This process is continued for Gn-3 … G0. The final graph G’ 0 will contain all best representatives selected from Gn, Gn-1 … G0. We call this graph the hybrid graph since it is the integration of all graph information levels. As in the multilevel graph set, each edge in the hybrid graph set represents the summed total of the edge weights of the inter-cluster edges of the two clusters in Gfinal corresponding to the endpoints of that edge. Please see Fig. 3b for an example and  for more algorithmic details regarding the construction of the multilevel graph set and hybrid graph set.
Hybrid graph filter
Hybrid graph data-mining
The hybrid graph is used for mining and extraction of biologically significant features, since it provides the most concise, yet accurate structural view of the read data set that could be obtained from integrating the multilevel graph set. In this graph, the degree of a given node can provide much information about the characteristics of the sequence region from which its corresponding reads were derived. If a node has a single pair of in and out edges, it is possible that this node is from a uniquely represented genomic region. In contrast, if a node has several in and out edges, this might indicate that the node represents a sequence region that is repeated throughout a genome or is shared between multiple species. The number of edges incident to a node might reflect the number of diverse sequences that its corresponding genomic region is present within.
The second aim of this paper is to extract transposase genes that are present in multiple sequence regions and to identify which genera they are distributed in. Antibiotic resistance genes associated with these transposase sequences are also mined from the hybrid graph. In this section, it is discussed how, for each given transposase gene present in multiple sequence regions, the hybrid graph can be used to identify which genera the transposase gene is distributed in. This section also discusses how the hybrid graph can be used to obtain transposase associated antibiotic resistance genes through local exploration in the hybrid graph. Observe that in Fig. 6a, the distribution of genera that a transposase sequence is shared across can be obtained by taxonomically classifying the sequences of the adjacent node neighbors of its corresponding node in the hybrid graph. Similarly, any antibiotics resistance genes that are associated with a given transposase sequence can be found by exploring the graph locally around its corresponding node. Figure 6b provides an example of how local graph exploration can reveal antibiotics resistance genes associated with transposase sequences.
Demonstrate that repetitive and transposable elements are associated with node characteristics
Identify and characterize the phylogenetic distribution of antibiotic resistance gene classes associated with transposase/integrase sequences in healthy individuals and individuals with Crohn’s disease.
Perform a competitive assembly evaluation of the assembler against other well-known assembly tools.
The results are divided into four sections. First, a general overview of the data sets and an analysis of the distribution of genera present in the Crohn’s disease and healthy data sets are provided. This study was conducted to evaluate the characteristics of the data sets in the context of previous research. Statistically significant differences were found in the relative abundances of prevalent genera in the Crohn’s and healthy data sets.
Second, aim 1 is addressed. For aim 1, it is shown that a greater proportion of nodes annotated with repetitive and mobile elements are assigned high Shannon’s index scores compared to nodes annotated with other gene categories. First an analysis and discussion regarding the distribution of Shannon’s index scores across the nodes of the hybrid graphs of the thirteen data sets is presented. This is followed by briefly exploring the characteristics of features associated with nodes with high Shannon’s index scores. The most common blastx hits to the NCBI blast database  for extremely high scoring nodes (the two highest scoring nodes for each data set) were to transposases and integrases (33.3 % of all predicted genes). We then used gene and rRNA operon predictions, SEED subsystems , and ACLAME library  to examine biological features associated with the remaining graph nodes. Nodes assigned with transposase/integrase genes and rRNA operon DNA had a greater proportion of high Shannon’s index scores in comparison to nodes assigned with other gene categories from the SEED subsystems (q = 2.44 × 10-04; paired Wilcoxon tests).
Third, addressing aim 2, a comparative study of antibiotics resistance genes associated with transposase/integrase sequences present in multiple sequence regions in the Crohn’s and healthy data sets was conducted. In aim 1, it is demonstrated that a greater proportion of nodes annotated with mobile genetic elements and rDNA operons had high Shannon’s index scores compared to nodes annotated with other gene categories. For aim 2, transposase/integrase sequences found on nodes with high Shannon’s index scores are analyzed since they are likely to be present in multiple sequence regions. We identify all high degree nodes with Shannon’s index scores greater than one that had hits to transposases, identify which genera their corresponding contig sequences are present in, cluster the transposases according to their phylogenetic distribution, and determine if sequence regions associated with the transposases in the resulting clusters are enriched for antibiotic resistance genes. The transposase nodes in the Crohn’s data sets clustered into twenty sets and the nodes in the healthy data set clustered into ten sets. For each of these clustered sets, predicted genes in associated contigs were extracted and DIAMOND  was used to align the predicted genes to the CARD database of antibacterial resistance genes . Fisher’s exact test with FDR corrected p-values was applied to determine if any clusters were enriched with classes of antibiotics. Several of the transposase clusters generated in the Crohn’s disease and healthy control data sets were enriched with various classes of antibiotic resistance genes. This comparative study provides insight into the differences in the distribution and species composition of resistance genes in healthy individuals and Crohn’s patients, whose disease is associated with gut microbiome perturbation  and is often treated with antibiotic regimens for secondary complications such as bacterial overgrowth and abscesses .
Finally the results are concluded by a competitive assembly evaluation of Focus against metagenomics assemblers, IDBA-UD, Omega, and MetaVelvet.
Data set characteristics
SRR49544 SRR497943 SRR497952
SRR497643 SRR497648 SRR497650
SRR497646 SRR497657 SRR504939
SRR497946 SRR497948 SRR497949
SRR497645 SRR497652 SRR497654
SRR063543 SRR063544 SRR063545
SRR063587 SRR063588 SRR063589
SRR063539 SRR063548 SRR063549
SRR063553 SRR063554 SRR063555
SRR063583 SRR063584 SRR063585
Repetitive and transposable elements are associated with node characteristics
Shannon’s index score distribution and functional gene categories
The SEED  is an organizational database system that provides five levels of hierarchical gene functional categorization with the first level being the most general level of classification. The FigFams, which form the leaves of this hierarchy, are sets of proteins that share the same function and are similar at the sequence level. FragGeneScan  was used to predict genes in contigs for all data sets. We downloaded the SEED protein database and used DIAMOND, which was chosen because of its scalability to large data sets and similar degree of sensitivity as BLASTX, to align the predicted genes to the SEED FigFams at a 40 % identity threshold. The SEED subsystems database was used to assign each gene to a level 1 functional categorization if possible. Most of the predicted genes were located on contigs whose corresponding nodes fell into the .6 - .7 score range as well, as shown by Fig. 8. However, there are many outlier genes that have a much greater Shannon’s index score, indicating that they might be found on contigs whose nodes represent repetitive sequence or sequence that is shared between two or more species. In the following section, we first provide a brief characterization of the most extreme outlier genes, showing that many of these genes are transposase and integrases. We then demonstrate that nodes annotated with repetitive and mobile genetic elements have a greater proportion of high Shannon index scores compared to nodes annotated with other gene categories.
Characterization of biological features on outlier nodes
Sequence features found on nodes with the highest Shannon’s index scores
Shannon's Index Score
Transporter, RelB/DinJ, Transposase
5e-15, 5e-32, 4e-32
Phosphatase, Histidine phosphotransferase
Transposase, IS4 family
Major facilitator transporter
0.00e + 00
DEAD/DEAH box helicase
30S ribosomal protein S12
Tetratricopeptide repeat protein
ATP-dependent DNA helicase RecQ
Putative transposase, Major Facilitator Superfamily protein, Glycosyltransferase, Group 1 family protein
5e-12, 1e-24, 8e-13
Transposase family protein, DNA polymerase IV
Selection of a threshold for high Shannon’s index scores
In the previous section we examined the biological features on a small subset of nodes with extreme Shannon’s index scores. Next, we demonstrate that nodes annotated with repetitive and mobile elements have a greater proportion of high Shannon’s index scores. However, minimum threshold for a Shannon’s index score to be considered high must be defined. Recall that for a given node with n edges, the maximum Shannon’s index score that can be assigned to that node is ln(n). An appropriate threshold will exclude nodes that possess a single entering and exiting edge as these nodes might be more likely to be part of unique genomic region. The minimum threshold that would eliminate these nodes is ln(2) ≈ .69 as this is the maximum Shannon’s index score that could be assigned to a node with two edges. However, a node could possess two evenly weighted edges and a third spurious edge that has a small edge weight, pushing this node past the minimum threshold. Thus the minimum threshold is raised to ln(3), which is the maximum score a node with three evenly weighted edges could be assigned. For the sake of simplicity ln(3) ≈ 1.1 is rounded to one.
Characterization of biological features on high scoring nodes
In this section, we demonstrate that nodes annotated with repetitive and mobile genetic elements have a greater proportion of high Shannon’s index scores. To achieve this, we compare the proportion of nodes assigned high Shannon index scores for each of the SEED functional categories to the proportion of nodes assigned high Shannon index scores for rRNA operon and transposase/integrase sequences.
The Meta-RNA  software tool was used to predict rDNA operon sequences in all of our contig sets. Meta-RNA was chosen because of its ability to detect rRNA sequences in fragmented metagenomics data. To further investigate the distribution of transposase and integrase sequences across nodes, the protein sequences of all transposases and integrases were downloaded from the ACLAME database. We used DIAMOND to align the predicted genes to the transposase and integrase protein sequences from both the ACLAME library and SEED FigFams at a 40 % identity threshold. For each read set in the Crohn’s disease and healthy control data sets, the proportion of nodes with Shannon’s index scores greater than one for each of the SEED functional categories, the rRNA operon sequences, and the transposase and integrase sequences was determined. The paired Wilcoxon test was applied to compare the high scoring node proportions for each SEED functional category pooled from the Crohn’s and healthy data sets against the pooled rRNA operon high scoring node proportions followed by the pooled transposase and integrase sequence high scoring node proportions. The paired Wilcoxon tests with FDR correction showed that both the transposases and integrases and rRNA operons had a significantly higher proportion of nodes with Shannon’s index scores greater than one than the SEED functional categories (q = 2.44 × 10−04; Additional file 2).
Mining and characterization of transposase associated antibiotics resistance genes
In the transposase clusters generated from the Crohn’s disease data sets, there were several clusters that were enriched for antibiotic resistance gene classes. In particular, there was an Enterococcus phylogenetic transposase cluster that was not found in the healthy control data set, shown in Fig. 10a (c). The node set obtained from the 5-neighborhood of all of the transposase associated nodes in the Enterococcus cluster was enriched with aminoglycoside, macrolide, and streptogramin resistance gene classes. The aminoglycoside resistance gene class was enriched at the .001 significance level and represented hits to the intrinsic Enterococcus Faecium aac(6')-Ii gene. The macrolide and streptogramin classes were enriched at the .01 level of significance and represented hits to the intrinsic Enterococcus Faecium msrC gene. A single hit to the tetracycline resistance gene class was most similar to the tet(L) gene and aligned to a Enterococcus plasmid. Two clusters whose transposase-associated nodes had many neighbors with contigs classified to Lactobacillus were also significantly enriched with antibacterial resistance gene classes, shown in Fig. 10a (d,e). The tetracycline class hits were most similar to tet(W) genes found in Bifidobacterium, Lactobacillus, and Streptococcus. The streptogramin class hits were to the vat(E) gene found in Enterococcus Faecium and some Lactobacillus plasmids. The Ruminococcus group shown in Fig. 10a (i) was enriched with tetracycline resistance genes with hits to tet(O) and tet(W).
In the healthy control data sets, resistance genes were most prevalent in transposase clusters associated with Bacteroides and Prevotella. The Bacteroides cluster, Fig. 10b (h) was also the largest cluster in the group. Figure 10b (d, e, h, i, j) were all enriched resistance genes from the beta-lactam, lincosamide, macrolide, streptogramin, and tetracycline resistance gene classes. The enrichments for the lincosamide, macrolide, streptogramin resistance gene classes were due to hits to the ermG and ermF macrolide-lincosamide-streptogramin B resistance proteins. The ermB, ermF, ermG, and ermS genes are common sources of resistance in Bacteroidales strains found in the intestine . The enrichments for the beta-lactam class of resistance genes were due to hits to class A beta-lactamases which are found in strains of Bacteroides and Prevotella. Tetracycline class enrichments were from hits to the tet(Q) resistance gene, also found in Bacteroides and Prevotella. The transposase cluster associated with Bacteroides and Alistipes, Fig. 10b (b), was enriched for class A beta-lactamase and tet(Q) resistance genes. Antibiotics resistance gene hits with gene descriptions can be found in Additional file 5.
Comparative assembly results
Number of Contigs ≥500 bps
(%) reads mapped to contigs
Contigs with taxonomic assignment
Contigs with unknown assignemnt
The Focus and IDBA-UD assemblers performed the best on these data sets in terms of N50 length, NG50 length, and percentage of reads that were successfully mapped back to their assemblies. Read mapping was conducted with BWA. Omega had the largest N50 length; however, the percentage of reads that were successfully mapped back to contigs was very low. This indicates that Omega only assembled a very small fraction of the input data set. The size of each of the Omega assemblies was so small that the NG50 statistic could not be calculated for any of the data sets using the estimated total genome length. MetaVelvet had a smaller N50 statistic and a lower percentage of mapped reads indicating that a low percentage of the input data set was assembled into small fragmented contigs. Focus and IDBA-UD had similar N50 statistics and percentage of reads successfully mapped back to contigs. For three of the data sets, Focus had a slightly larger N50 length than IDBA. Focus had larger NG50 lengths than IDBA for two of the data sets. Chimeric contigs are contigs in which at least 25 % of the reads do not map to the genus to which the contig was assigned. Results in Table 3 show that each assembler had a very small fraction of detectable chimeric contigs. A large number of the contigs produced by each assembler could not be assigned to any genus. These results demonstrate that Focus is capable of producing assembly results that are competitive with and exceed existing tools.
We have developed a novel graph mining and assembly algorithm that is capable of extracting useful biological information and producing high quality assembly results. Our algorithm captures genome structural information using a hybrid graph. The initial overlap graph is incrementally reduced using heavy edge matching and node merging to create a graph spectrum, G0, G1, … Gn that represents a read data set at multiple levels of information. To provide the most accurate yet succinct representation of the input data set, nodes from each graph level are selected as best representatives of their corresponding read clusters and combined into a single hybrid graph G’ 0. Each node in this graph represents either a unique region, repetitive element, or region conserved between multiple species. We assigned a Shannon’s index score to each node to numerically describe the number of incident edges and the evenness of their weights. We show that repetitive elements, in particular rRNA operons and transposase genes, are associated with higher Shannon’s index scores. We then extract transposase genes whose corresponding nodes had high Shannon’s index scores in five read data sets obtained from the gut microbiome of individuals with Crohn’s disease and eight read data sets obtained from the gut microbiome of healthy controls. We clustered the resulting transposase genes into groups determined by the distribution of genera that the contigs obtained from the adjacent neighbors of their corresponding nodes were classified too. We then test for enrichment of antibiotic resistance genes in the 5-neighborhood of the nodes in each transposase cluster. Distinct differences were apparent in the Crohn’s disease and control data set clustering results. An enterococcal transposase cluster that was enriched with various antibacterial resistance gene classes was present in the Crohn’s disease clustering results while being absent from the healthy control clustering results. Enterococcus species are often overrepresented in Crohn’s disease data sets. Other sources of antibacterial resistance genes were from Lactobacillus associated transposase clusters. Origins of antibiotic resistance in healthy individuals were heavily biased towards Bacteroidales species. The distribution of the number of transposases was relatively even across the Crohn’s disease clusters, while in the healthy disease data sets most transposases were found in a Bacteroides associated cluster.
This paper highlights the ability of the assembly graph to be a powerful data-mining support that can capture meaningful biological information and patterns in its structural features. Our graph theoretic model is concise yet feature rich, allowing for the efficient detection of biologically meaningful graph structures. We foresee the expansion of our model and the development of novel domain-specific graph mining techniques for other next generation sequencing applications. For example, in cancer research genomic rearrangements, copy number variations, and fusion genes are prevalent . These biological features are likely to be reflected in the structure of the assembly graph for input data sets. We are also exploring further applications of our model for metagenomics data, such as graph-based read filtering of target species from metagenomics samples.
In conclusion, we have developed a powerful graph theoretic model that is capable of capturing key biological information. We applied our model on five gut microbiome read data sets from patients with Crohn’s disease and eight gut microbiome data sets from healthy individuals. Focus produced excellent assembly results in an assembly comparison against the IDBA-UD, MetaVelvet, and Omega metagenomics assemblers. Graph mining revealed graph structural characteristics associated with biological features including rRNA operons and transposase sequences. A comparative study between the Crohn’s disease and healthy data sets revealed considerable differences in the phylogenetic distribution of conserved transposase sequences and associated antibiotics resistance genes. Previously the assembly graph has predominantly been used as a scaffold for the assembly process. In this study, we demonstrate that there is rich structural information contained within the overlap graph that can be extracted to make novel biological discoveries.
Availability of data and materials
The data sets supporting the results found in this research article can be found in the SRA database under the accession numbers: SRR49544, SRR497943, SRR497952, SRR497643, SRR497648, SRR497650, SRR504939, SRR497646, SRR497657, SRR504939, SRR497946, SRR497948, SRR497949, SRR497645, SRR497652, SRR497654, SRR063543, SRR063544, SRR063545, SRR063587, SRR063588, SRR063589, SRR063903, SRR063904, SRR061730, SRR061731, SRR063539, SRR063548, SRR063549, SRR063553, SRR063554, SRR063555, SRR063583, SRR063584, and SRR063585.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- O’Hara AM, Shanahan F. The gut flora as a forgotten organ. EMBO Rep. 2006;7:688–93.View ArticlePubMedPubMed CentralGoogle Scholar
- Karlsson FH, Tremaroli V, Nookaew I, Bergström G, Behre CJ, Fagerberg B, Nielsen J, Bäckhed F. Gut metagenome in European women with normal, impaired and diabetic glucose control. Nature. 2013;498:99–103.Google Scholar
- Turnbaugh PJ, Ley RE, Mahowald MA, Magrini V, Mardis ER, Gordon JI. An obesity-associated gut microbiome with increased capacity for energy harvest. Nature. 2006;444:1027–131.View ArticlePubMedGoogle Scholar
- Chung H, Pamp SJ, Hill JA, Surana NK, Edelman SM, Troy EB, Reading NC, Villablanca EJ, Wang S, Mora JR et al. Gut immune maturation depends on colonization with a host-specific microbiota. Cell. 2012;149:1578–93.Google Scholar
- Ivanov II, Littman DR. Modulation of immune homeostasis by commensal bacteria. Curr Opin Microbiol. 2011;14:106–14.View ArticlePubMedPubMed CentralGoogle Scholar
- Schwabe RF, Jobin C. The microbiome and cancer. Nat Rev Cancer. 2013;13:800–12.View ArticlePubMedPubMed CentralGoogle Scholar
- Ahn J, Sinha R, Pei Z, Dominianni C, Wu J, Shi J, Goedert JJ, Hayes RB, Yang L. Human gut microbiome and risk of colorectal cancer. J Natl Cancer Inst. 2013;105(24):1907–11. djt300.Google Scholar
- Illumina. [http://systems.illumina.com/systems/sequencing.html]
- 454 Sequencing. [http://454.com/products/index.asp]
- PacBio. [http://www.pacb.com]
- Paszkiewicz K, Studholme DJ. De novo assembly of short sequence reads. Brief Bioinform. 2010;11(5):457–72. bbq020.View ArticlePubMedGoogle Scholar
- Peng Y, Leung HC, Yiu S-M, Chin FY. IDBA–a practical iterative de Bruijn graph de novo assembler. In Research in Computational Molecular Biology. Berlin Heidelberg: Springer; 2010. p. 426–40.Google Scholar
- Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol. 2012;19:455–77.View ArticlePubMedPubMed CentralGoogle Scholar
- Simpson JT, Durbin R. Efficient construction of an assembly string graph using the FM-index. Bioinformatics. 2010;26:i367–73.View ArticlePubMedPubMed CentralGoogle Scholar
- Nagarajan N, Pop M. Sequence assembly demystified. Nat Rev Genet. 2013;14:157–67.View ArticlePubMedGoogle Scholar
- Myers EW. The fragment assembly string graph. Bioinformatics. 2005;21:ii79–85.PubMedGoogle Scholar
- Haider B, Ahn T-H, Bushnell B, Chai J, Copeland A, Pan C. Omega: an Overlap-graph de novo Assembler for Metagenomics. Bioinformatics. 2014;30(19):2717–22.View ArticlePubMedGoogle Scholar
- Peng Y, Leung HC, Yiu S-M, Chin FY. IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics. 2012;28:1420–8.View ArticlePubMedGoogle Scholar
- Lai B, Ding R, Li Y, Duan L, Zhu H. A de novo metagenomic assembly program for shotgun DNA reads. Bioinformatics. 2012;28:1455–62.View ArticlePubMedGoogle Scholar
- Afiahayati, Sato K, Sakakibara Y. etaVelvet-SL: an extension of the Velvet assembler to a de novo metagenomic assembler utilizing supervised learning. DNA Res. 2014: 22(1):69-77.Google Scholar
- Namiki T, Hachiya T, Tanaka H, Sakakibara Y. MetaVelvet: an extension of Velvet assembler to de novo metagenome assembly from short sequence reads. Nucleic Acids Res. 2012;40:e155.View ArticlePubMedPubMed CentralGoogle Scholar
- Warnke J, Ali H. Focus: a new multilayer graph model for short read analysis and extraction of biologically relevant features. In Proceedings of the 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics. New York, NY, USA: ACM; 2014. p. 489-98.Google Scholar
- Mahillon J, Chandler M. Insertion sequences. Microbiol Mol Biol Rev. 1998;62:725–74.PubMedPubMed CentralGoogle Scholar
- Klappenbach JA, Saxman PR, Cole JR, Schmidt TM. rrndb: the Ribosomal RNA Operon Copy Number Database. Nucl Acids Res. 2001;29:181–4.View ArticlePubMedPubMed CentralGoogle Scholar
- Sommer MO, Dantas G, Church GM. Functional characterization of the antibiotic resistance reservoir in the human microflora. Science. 2009;325:1128–31.View ArticlePubMedPubMed CentralGoogle Scholar
- Sommer MO, Church GM, Dantas G. The human microbiome harbors a diverse reservoir of antibiotic resistance genes. Virulence. 2010;1:299–303.View ArticlePubMedGoogle Scholar
- Kurokawa K, Itoh T, Kuwahara T, Oshima K, Toh H, Toyoda A, et al. Comparative Metagenomics Revealed Commonly Enriched Gene Sets in Human Gut Microbiomes. DNA Res. 2007;14:169–81.View ArticlePubMedPubMed CentralGoogle Scholar
- Huddleston JR. Horizontal gene transfer in the human gastrointestinal tract: Potential spread of antibiotic resistance genes. Infect Drug Resist. 2014;7:167–76.View ArticlePubMedPubMed CentralGoogle Scholar
- Ogura Y, Bonen DK, Inohara N, Nicolae DL, Chen FF, Ramos R, et al. A frameshift mutation in NOD2 associated with susceptibility to Crohn’s disease. Nature. 2001;411:603–6.View ArticlePubMedGoogle Scholar
- Stecher B, Denzler R, Maier L, Bernet F, Sanders MJ, Pickard DJ, et al. Gut inflammation can boost horizontal gene transfer between pathogenic and commensal Enterobacteriaceae. Proc Natl Acad Sci. 2012;109:1269–74.View ArticlePubMedPubMed CentralGoogle Scholar
- Bermejo F, Garrido E, Chaparro M, Gordillo J, Mañosa M, Algaba A, López-Sanromán A, Gisbert JP, García-Planella E, Guerra I et al. Efficacy of different therapeutic options for spontaneous abdominal abscesses in Crohn’s disease: are antibiotics enough? Inflamm Bowel Dis. 2012;18:1509–14.Google Scholar
- Costelloe C, Metcalfe C, Lovering A, Mant D, Hay AD. Effect of antibiotic prescribing in primary care on antimicrobial resistance in individual patients: systematic review and meta-analysis. BMJ. 2010;340:c2096.View ArticlePubMedGoogle Scholar
- Vigna S. Broadword implementation of rank/select queries. Experimental Algorithms [Internet]. Springer; 2008. p. 154–68. Available from: http://link.springer.com/chapter/10.1007/978-3-540-68552-4_12. [cited 2016 Mar 23]
- Larsson NJ, Sadakane K. Faster suffix sorting. Theor Comput Sci. 2007;387:258–72.View ArticleGoogle Scholar
- Warnke J, Ali HH. An efficient overlap graph coarsening approach for modeling short reads. Bioinformatics and Biomedicine Workshops (BIBMW), 2012 IEEE International Conference on: 4-7 October 2012. 2012. p. 704–11. doi: 10.1109/BIBMW.2012.6470223.
- Karypis G, Kumar V. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J Sci Comput. 1998;20:359–92.View ArticleGoogle Scholar
- Zerbino DR, Birney E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 2008;18:821–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Chao A, Shen T-J. Nonparametric estimation of Shannon’s index of diversity when there are unseen species in sample. Environ Ecol Stat. 2003;10:429–43.View ArticleGoogle Scholar
- Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–402.Google Scholar
- Overbeek R, Begley T, Butler RM, Choudhuri JV, Chuang H-Y, Cohoon M, de Crécy-Lagard V, Diaz N, Disz T, Edwards R et al. The subsystems approach to genome annotation and its use in the project to annotate 1000 genomes. Nucleic Acids Res. 2005;33:5691–702.Google Scholar
- Leplae R, Hebrant A, Wodak SJ, Toussaint A. ACLAME: A CLAssification of Mobile genetic Elements. Nucleic Acids Res. 2004;32(Database issue):D45–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Buchfink B, Xie C, Huson DH. Fast and sensitive protein alignment using DIAMOND. Nat Methods. 2015;12:59–60.View ArticlePubMedGoogle Scholar
- McArthur AG, Waglechner N, Nizam F, Yan A, Azad MA, Baylay AJ, Bhullar K, Canova MJ, Pascale GD, Ejim L, Kalan L, King AM, Koteva K, Morar M, Mulvey MR, O’Brien JS, Pawlowski AC, Piddock LJV, Spanogiannopoulos P, Sutherland AD, Tang I, Taylor PL, Thaker M, Wang W, Yan M, Yu T, Wright GD. The Comprehensive Antibiotic Resistance Database. Antimicrob Agents Chemother. 2013;57:3348–57.Google Scholar
- Morgan XC, Tickle TL, Sokol H, Gevers D, Devaney KL, Ward DV, Reyes JA, Shah SA, LeLeiko N, Snapper SB et al. Dysfunction of the intestinal microbiome in inflammatory bowel disease and treatment. Genome Biol. 2012;13:R79.Google Scholar
- NCBI SRA [http://www.ncbi.nlm.nih.gov/sra]
- Li H, Durbin R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics. 2009;25:1754–60.View ArticlePubMedPubMed CentralGoogle Scholar
- Consortium HMJRS et al. A catalog of reference genomes from the human microbiome. Science. 2010;328:994–9.View ArticleGoogle Scholar
- Nagalingam NA, Lynch SV. Role of the microbiota in inflammatory bowel diseases. Inflamm Bowel Dis. 2012;18:968–84.View ArticlePubMedGoogle Scholar
- Willing BP, Dicksved J, Halfvarson J, Andersson AF, Lucio M, Zheng Z, Järnerot G, Tysk C, Jansson JK, Engstrand L. A pyrosequencing study in twins shows that gastrointestinal microbial profiles vary with inflammatory bowel disease phenotypes. Gastroenterology. 2010;139:1844–54.Google Scholar
- Mondot S, Kang S, Furet J-P, Aguirre de Cárcer D, McSweeney C, Morrison M, Marteau P, Dore J, Leclerc M. Highlighting new phylogenetic specificities of Crohn’s disease microbiota. Inflamm Bowel Dis. 2011;17:185–92.Google Scholar
- Gevers D, Kugathasan S, Denson LA, Vázquez-Baeza Y, Van Treuren W, Ren B, et al. The Treatment-Naive Microbiome in New-Onset Crohn’s Disease. Cell Host & Microbe. 2014;15:382–92.View ArticleGoogle Scholar
- Cho I, Blaser MJ. The human microbiome: at the interface of health and disease. Nat Rev Genet. 2012;13:260–70.PubMedPubMed CentralGoogle Scholar
- Rho M, Tang H, Ye Y. FragGeneScan: predicting genes in short and error-prone reads. Nucleic Acids Res. 2010;38(20):e191.View ArticlePubMedPubMed CentralGoogle Scholar
- Huang Y, Gilna P, Li W. Identification of ribosomal RNA genes in metagenomic fragments. Bioinformatics. 2009;25:1338–40.View ArticlePubMedPubMed CentralGoogle Scholar
- Nakano V, do Nascimento e Silva A, Merino VRC, Wexler HM, Avila-Campos MJ. Antimicrobial resistance and prevalence of resistance genes in intestinal Bacteroidales strains. Clinics (Sao Paulo). 2011;66:543–7.View ArticleGoogle Scholar
- Earl D, Bradnam K, John JS, Darling A, Lin D, Fass J, et al. Assemblathon 1: a competitive assessment of de novo short read assembly methods. Genome Res. 2011;21:2224–41.View ArticlePubMedPubMed CentralGoogle Scholar
- Meyerson M, Gabriel S, Getz G. Advances in understanding cancer genomes through second-generation sequencing. Nat Rev Genet. 2010;11:685–96.View ArticlePubMedGoogle Scholar