Comprehensive whole genome sequence analyses yields novel genetic and structural insights for Intellectual Disability

Background Intellectual Disability (ID) is among the most common global disorders, yet etiology is unknown in ~30% of patients despite clinical assessment. Whole genome sequencing (WGS) is able to interrogate the entire genome, providing potential to diagnose idiopathic patients. Methods We conducted WGS on eight children with idiopathic ID and brain structural defects, and their normal parents; carrying out an extensive data analyses, using standard and discovery approaches. Results We verified de novo pathogenic single nucleotide variants (SNV) in ARID1B c.1595delG and PHF6 c.820C > T, potentially causative de novo two base indels in SQSTM1 c.115_116delinsTA and UPF1 c.1576_1577delinsA, and de novo SNVs in CACNB3 c.1289G > A, and SPRY4 c.508 T > A, of uncertain significance. We report results from a large secondary control study of 2081 exomes probing the pathogenicity of the above genes. We analyzed structural variation by four different algorithms including de novo genome assembly. We confirmed a likely contributory 165 kb de novo heterozygous 1q43 microdeletion missed by clinical microarray. The de novo assembly resulted in unmasking hidden genome instability that was missed by standard re-alignment based algorithms. We also interrogated regulatory sequence variation for known and hypothesized ID genes and present useful strategies for WGS data analyses for non-coding variation. Conclusion This study provides an extensive analysis of WGS in the context of ID, providing genetic and structural insights into ID and yielding diagnoses. Electronic supplementary material The online version of this article (doi:10.1186/s12864-017-3671-0) contains supplementary material, which is available to authorized users.


Background
Intellectual Disability (ID) affects 1-3% of the global population. A significant proportion of ID is caused by genetic defects, yet despite extensive testing including by clinical chromosomal microarray (CMA),~30% of cases remain idiopathic [1].
Genome-wide sequencing can identify previously unknown genes causative for ID. Whole exome sequencing (WES) is limited by poor ability or inability to detect non-coding and structural variation, and capturing less than 100% of the exome [2]. In contrast, whole genome sequencing (WGS) offers a comprehensive screen of a variety of DNA variation types. Current evidence suggests WGS is able to detect coding variants in 42% of cases missed by WES [2].
We report comprehensive WGS analyses for eight patients with ID and brain malformations, whose family history suggested a de novo mutation. Despite a diagnostic odyssey, including genome-wide clinical and research CMA, they were idiopathic. WGS was conducted on trios composed of the affected child and both unaffected parents (average 34X coverage), and data was analyzed using both alignment and assembly approaches to detect all possible causative genetic changes-single nucleotide variants (SNVs and indels), copy number variants (CNVs) and structural variants (SVs) (Fig. 1). We validated our findings using WES data from an independent positive control cohort of 2081 patients with ID and other neurocognitive phenotypes, and a negative control WGS cohort of 2535 normal subjects. In addition we probed molecular themes indicated by our discovery cohort findings in the positive control cohort, leveraging its large size. We also conducted a screen for de novo variants in possible regulatory sequences of known and hypothesized pathogenic genes.

Subjects
Patients were enrolled from the British Columbia Children's and Women's Hospital Provincial Medical Genetics Program after obtaining informed consent. This study is approved by the British Columbia Children's and Women's hospital research ethics boards. All patients presented with ID (moderate to severe) and brain morphological defects detected by MRI or CT scan. Patients had no family history of ID, and all were products of normal pregnancies with no reported teratogenic exposures as ascertained by clinical assessment by board certified Medical Genetics specialists at the recruiting facility. Saliva samples were collected and DNA extracted using DNA Genotek® collection kits, reagents and protocols from child, father and mother.

Methods
WGS methods, variant calling protocols, verification methods, and secondary control study methods including bootstrap analysis, are summarized below and detailed in Additional file 1. Briefly; DNA was extracted using DNA-Genotek® extraction kits. Paired-end WGS libraries were prepared using Illumina's PCR-free protocol (TruSeq DNA Sample prep kit -Illumina Catalogue Number FC-121-1002). Sequencing was by IlluminaHiSeq 2500 platform (v3 chemistry) generating 100 bp paired-end reads, using three lanes per sample (34X average coverage across all samples). Alignment and variant calling was by Canada's Michael Smith Genome Science Center standard pipelines (Additional file 1, reference genome -hg19).
Variants were identified and filtered as follows, briefly; putative SNVs were identified using SAMtools mpileup version 0.1.17 run on each sample separately. Relatedness was tested for each trio by comparing SNP concordance between child, mother and father using vcftools-0.1.14 [3] (Additional file 2: Table S1). De novo variants were selected by intersecting the child's SNVs with that of each parent, and selecting variants only present in the child and not in either parent. For variants in the coding region, we selected de novo missense, nonsense and splicing variants, i.e., functional variants. We next selected rare variants by excluding alleles with minor allele frequency >1% in dbSNPv135 (excluding disease associated variants), Exome Variant Server, Exome Aggregation Consortium (ExAC), and in-house databases of >7430 exomes, and >3000 genomes (at Canada's Michael Smith Genome Sciences Center and the British Columbia Children's Hospital Research Center, available via open-source access [4]). We then used pathway enrichment analyses to prioritize de novo rare variants; selecting SNVs in genes enriched in pathways involved in brain development and function conducted using QIAGENs In-genuity® Pathway Analysis (IPA), DAVID (https://davidd.ncifcrf.gov/ 6.7) and Panther (http://pantherdb.org/). For those variants passing the pathway enrichment screen, pathogenicity predictions and conservation scores were annotated using SIFT [5], PhyloP [6], PolyPhen [7], Muta-tionTaster [8] and CADD [9] scores. These steps yielded de novo, functional, rare variants, that are highly conserved and predicted to be damaging and in biologically relevant pathways. In addition to the above prioritization, the rare functional variants were subsequently also screened under a series of additional genetic models (e.g. compound heterozygous, de novo heterozygous, homozygous recessive, hemizygous recessive), and manually checked for alignment quality with Integrated Genomic Viewer (IGV, https://www.broadinstitute.org/software/igv). SNVs that were highly conserved and were predicted to be damaging by at least one pathogenicity prediction software, were selected for verification by Sanger sequencing in the child, mother and father.
CNV analyses was conducted using FREEC [10], CNAseq [11], DELLY [12] and ABySS [13]. The first three algorithms align reads to the reference genome while ABySS uses de novo assembly to reconstruct the patient's genome. SV analyses was conducted using only DELLY and ABySS. First, de novo CNVs/SVs were identified by comparing the child's data to that of either parent (Additional file 1). De novo CNVs from each algorithm were filtered by manual assessment of local read configuration on IGV, and genuine ones were prioritized based on functional relevance of the included/CNV-affected genes. SVs, i.e., translocations and inversions, were filtered by either IGV read visualization and then by using QC metrics specific to each algorithm; QC metrics generated by the program were used for DELLY, and checking of BLAT scores for breakpoint-junction contigs and number of supporting reads were used for ABySS. Candidate CNVs/SVs that were selected from the above filtering were verified using an independent method as detailed below.
All de novo variants, i.e., SNVs, CNVs and SVs, were verified by Sanger sequencing of PCR-captured amplicons of the affected sequence, either bearing the SNV or spanning the breakpoint junction (in the case of CNVs and SVs) in the trio, with forward and reverse primers (Additional file 3: Table S2). All verified candidate SNVs were subjected to genotype-phenotype correlations assessment as per the guidelines of the American College of Medical Genetics (ACMG) [14].
Secondary control study -WES data from the UK10K project [15] for 2081 patients with neurofunctional phenotypes (available clinical data for the projects that comprise this cohort is found in Additional file 4: Table S3), and WGS data from 2535 normal individuals from the 1000 Genomes project [16]; a publicly available repository of variation in healthy individuals, was obtained. 'Possibly damaging SNVs' (PDSs), were extracted from these datasets (as detailed in Additional file 1 and Additional file 5: Figure S2), and a gene-wise PDS burden for all genes in the human genome was determined in both the positive and negative control cohorts. Subsequently the gene-wise PDS burden only in our candidate genes was compared between the positive and negative control cohorts. We further bootstrapped the positive control cohort to determine if the PDS burden in our six candidate genes could be due to random sampling. Finally, we tested to see what functional pathways genes with PDS in the positive control cohort were involved in, and conducted a Kyoto Encyclopedia of Genes and Genomes (KEGG [17]) pathway enrichment analyses, testing which of the total 57 functional pathways from KEGG were most enriched for genes bearing PDS in this large dataset.
Regulatory region variation -For our regulatory region analysis we selected 'high confidence' de novo SNVs defined as having a mapping quality > 30 and read depth ≥ 10 and ≤ 100, and 'high confidence' de novo CNVs defined as those that were detected by two or more CNV detection algorithms. We then intersected both the de novo high confidence SNVs and CNVs with six non-coding sequence annotation datasets. Results from the above, i.e., de novo high confidence SNVs and CNVs with involvement in putative regulatory regions, were then intersected with candidate gene lists and appropriate flanking sequence (Additional file 1) to determine their possible association to a candidate known or hypothesized ID gene.

De novo SNVs identified by objective molecular pathway-based filtration
Genes with functional rare de novo SNVs were screened using three pathway analyses programs (IPA, DAVID and Panther) in order to refine candidates involved in brain development and function; IPA returned 17 candidate genes, DAVID returned 23, and Panther returned 9. A total of 23 unique genes involved in brain development and function were yielded by the combined analyses (i.e., found by at least one of the programs). From these, highly conserved and predicted damaging SNVs (11 SNVs in 11 genes in six patients) were Sanger tested, and six SNVs in six genes in five children were confirmed as heterozygous de novo (Table 1) A526N). The latter two were found in a single patient while the rest each appeared in a separate patient. As best practice, we also screened our de novo rare functional variants for location within published known [2] and candidate ID genes [18], however no new findings were yielded. Except ARID1B and PHF6, the other genes are novel for ID. Table 1 provides variant classification as per the ACMG variant interpretation guidelines [14] (Additional file 6: Table S4 for detailed classification of variants) and our interpretation of their causative effect. Brief genotype-phenotype correlations are given below; This single base deletion in exon 2 of the known ID gene ARID1B causes a frame-shift leading to predicted loss of function (LoF, Additional file 5: Figure S1). Our patient presents with ID, autism, absence of corpus callosum, absence of speech, feeding difficulties and failure to thrive ( Table 1). Haploinsufficiency of ARID1B was reported to cause corpus callosum abnormalities, ID, speech impairment and autism [19], suggesting the ARID1B LoF is causative and sufficient in this case.  [22] and reported for a female specific form of BFLS [23]. Roles for PHF6 are reported in the chromatin remodeling SWI/SNF complex [24], and in the NuRD epigenetic regulatory complex where it acts as a possible regulator for the latter in neurogenesis [25]. RNAi knock down of PHF6 profoundly impairs neuronal migration in vivo [26], thus leading to formation of white matter heterotopias. In keeping with this, this patient reports pachygyria, which results from abnormal migration of neurons in the developing brain. She also presents with an unusual asymmetrical growth phenotype that was reported in the one patient with the female specific BFLS [23]. These data indicate the variant is a good candidate in this case.

SPRY4 c.508 T > A (p.C170S) in Patient 59
SPRY4 encodes a specific inhibitor of the mitogen-activated protein kinase family. Spry4 is expressed in the mouse developing brain [27], and is essential for the normal morphogenesis and cytoarchitecture of the cerebellum [28]. Morphogenic changes in axon growth have been shown when the protein is down regulated both in vivo and in vitro [29]. In zebrafish, spry4 is a principal regulator of mid-brain development [30], and mediates hindbrain patterning [31]. These data support the notion that the SPRY4 missense variant may contribute to the brain morphological phenotype in this patient. Spry4 expression plays a role in Xenopus limb bud development [32], of note as our patient reports short and crowded toes.
CACNB3 encodes a regulatory subunit of a voltagedependent calcium channel (VDCC). Mice lacking Cacnb3 presented visual impairment [33], high pain threshold [34], and behavioral phenotypes [34], all of which features are seen in this patient. Mutations in other members of VDCC subunit encoding genes are known to cause neurological disease, including epilepsy [35] present in our patient. This variant is found in eight of 60,165 individuals in the ExAC database, where its non-absence disqualifies likely pathogenicity as per ACMG criteria, despite being de novo and deleterious by multiple lines of computational evidence. Neither does it meet criteria to be a benign variant, and therefore is of uncertain significance.
SQSTM1c.115_116delinsTA (p.A39*), UPF1 c.1576_1577delinsAA (p.A526N) and a 1q43(1:243282457-243447771, hg19) deletion CNV in Patient 51 The patient is severely affected, with significant ID and several major congenital anomalies ( Table 1). The heterozygous indel formed by two adjacent SNVs in SQSTM1 causes a stop-gain. SQSTM1 encodes p62, a regulatory factor in Nuclear Factor kappa-B (NF-kB) signaling, NF-E2-related factor 2 (NRF2) activation, ubiquitin-mediated authophagy, and transcription [36]. The SNV is located in the PB1domain, mutations of which cause Paget Disease of Bone (PDB) and Frontotemporal Dementia and/or Amyotrophic Lateral Sclerosis (FTLD/ALS) [MIM:607485,612069] [36]; both neurodegenerative conditions that include morphological brain changes. The adjacent SNVs in UPF1,  Age at examination is given in 5 year intervals in order to protect patient anonymity b Patient 51 also bears a de novo likely contributory CNV as detailed in the text, in addition to the SNVs given here together cause a likely pathogenic missense amino acid change (Table 1 and Additional file 5: Figure S1). UPF1 has an essential role in nonsense-mediated mRNA decay [37]. Interestingly, UPF1 has been shown to remarkably reduce ALS-associated neuronal toxicity in vitro [38] and to protect against motor dysfunction and forelimb paralysis in a rat model for ALS [39]. It is plausible haploinsufficiency of SQSTM1 may have caused neurofunctional defects, which the haploinsufficiency of UPF1 may have exacerbated. In this regard, it is notable that at 19 years of age, patient 51 presents significant motor deficits, being wheelchair bound, indicative of a possible early onset of ALS. While scoliosis and hearing loss, both among the presentation for PDS is already seen in her. These data support the notion that the SNVs in both genes maybe contributory toward her presentation. We further verified a de novo~165 kb heterozygous deletion that spans CEP170 [MIM:613023] in whole and SDCCAG8 [MIM:613524] in part (Fig. 2a and c) in this patient. CEP170 encodes a component of the centrosome [40]. SDCCAG8 is also involved in centrosome function [41], DNA damage response signaling [42] and neuronal migration [41]. Both genes are suggested as candidates for corpus callosum abnormalities via 1q43 microdeletion [43], however this has been contested [44] (Fig. 2c). Our patient presents partial phenotypic overlap with microdeletion 1q34 index cases. The demonstrated roles for SDCCAG8 in DNA-mismatch repair, and for both genes in cell cycle progression, supports the notion this CNV may be contributory. Notably, the haploinsufficiency of a DNA-mismatch repair gene could lead to the high mutation burden detected in this child (above SNVs, and vide section 'Genome Assembly Indels'). We also confirmed at least one maternally inherited balanced translocation (vide section on CNV/SVs), which is unlikely to be contributory. We investigated the candidacy of the above verified genes by assessment for incidence of damaging mutation in large positive and negative cohorts with comparable NGS data. We looked for 'potentially damaging SNVs' (PDSs) (Additional file 5: Figure S2 gives an example per patient PDS mutation burden) in our candidate genes, from WES of 2081 patients with neurodevelopmental and neurocognitive phenotypes from the UK10K cohort [15] (Additional file 4: Table S3 and Additional file 5: Figure S3) and compared that to incidence in WGS from 2535 healthy people from the 1000 Genomes project [16].
We first screened for the exact variant detected in our discovery cohort, and did not find any case of an exact match. We then conducted a gene-wise PDS screen and observed that incidence for PDS in ARID1B, SPRY4, CACNB3, SQSTM1 and UPF1 were significantly enriched in the positive versus negative control cohorts (Fig. 3a). There was no significance for PHF6; however, the two PDS found in 4616 people was insufficient for meaningful statistical assessment. The extremely high PDS burden in the positive control cohort for SQSTM1 and UPF1 is noteworthy, as these genes have previously not been reported for ID to our knowledge, and further, the indels in both are found in the same patient in our cohort.
While we do not have access to clinical data to conduct a classical genotype-phenotype correlation between cases in the positive control cohort and our patients who have the same gene affected, the large number of such cases in the positive control cohort also impedes such a study within the scope of this work. We therefore assessed if our findings could be due to random chance effect, by bootstrapping the UK10K cohort for PDS in six randomly selected genes each, a thousand times. We found from the bootstrap analysis that the mean and median gene-wise variant frequency for our six candidate genes was greater than that of the corresponding distribution, indicating that our findings were not likely due to chance (Fig. 3b & c). These data are consistent with an association of at least five of our candidate genes with neurodevelopmental abnormalities.
Novel candidate genes converge unto the ubiquitin proteasome pathway, which also bears significant mutation burden in 2081 positive control WES samples We investigated molecular links between our pathogenic and candidate genes; focused IPA and STRING pathway analyses revealed that all six connected to the ubiquitin a b c Fig. 3 Validation study. a showing incidence for potentially damaging SNVs (PDSs) in both the positive control (UK10K) and negative control (1000G) control cohorts. * denotes statistical signficance (at p < 0.05, Fisher's exact test) b and c Results of bootstrap analyses for PDSs in 6 randomly selected genes. Red vertical bar shows the mean and median result for PDSs in our 6 candidate genes proteasome degradation pathway (UPP) (Additional file 5: Figure S4) which has important roles in the structural development and function of the brain [45,46]. We assessed the relative importance of this pathway and found the UPP was among significantly enriched pathways for PDS when compared with all KEGG pathway categories (n = 55) (Additional file 7: Table S5), in the UK10K patient cohort (p = 0.031), substantiating the importance of the UPP pathway in brain development.

Mendelian inheritance and N of 1 analyses provides additional candidate variants
In addition to our in-silico refinement and test for candidate de novo SNVs above, we also conducted a classical series of N of 1 studies for these eight patients; manually assessing the possible candidacy of variants selected by all possible Mendelian inheritance patterns (Additional file 8:

Extensive copy number variant (CNV) and structural variant (SV) analyses identifies likely causative CNV missed by clinical CMA, and balanced benign translocation
We conducted both alignment-based (FREEC, CNAseq, DELLY) and de novo assembly-based (ABySS) CNV/SV analyses. CNVs, i.e., duplications (gains) and deletions (losses) were identified by all four platforms, while SVs, i.e. translocations and inversions, were identified by DELLY and ABySS (Table 2). An average of 58 de novo gain CNVs and 128 de novo loss CNVs across all eight patients were detected. However, only 46 CNVs were called by over one platform, and none were called by more than two (Fig. 4), with the majority of each algorithm's findings being unique. We carried out extensive visual in silico curation for all CNVs, and selected three to verify of which, only the previously discussed 1q43 loss CNV, Sanger verified as de novo-it was detected by FREEC and CNAseq, and is clearly visible on IGV (Fig. 2a). Breakpoint junction sequence reveals a complex architecture (Fig. 2b). Similar to our CNV results, SV results from DELLY and ABySS were divergent ( Table 2). Only one translocation (between chromosome 19 and 1) in patient 41, was called by both, and there was no concordance among inversions. Upon extensive manual in silico curation we selected 10 translocations and 1 inversion to verify (Additional file 9: Table S7), but none verified as de novo. Sanger verification for these lesions was challenging as breakpoints mapped to repeat-masked regions, nevertheless one translocation verified as maternally inherited; a chromosome X-2 (92696685:225020555, hg19) translocation not causing any gene-disruption, in patient 51. The breakpoint junction shows a single base addition (Fig. 6a).

Genome assembly yields small insertions/deletions (indels) missed by genome re-alignment
In contrast to the re-alignment based algorithms, ABySS [13] identified over 700 potential de novo indels (maximum size 100 bp), via genome assembly. Forty three indels were refined as likely true positives with a functional importance, due to having at least seven spanning reads, and producing a protein coding change; the majority being in patient 51. For consistency, we conducted a pathway analyses for the indel-bearing genes, and a manual curation, in order to select candidates for verification as we had done for our SNVs. This resulted in 14 indels that were Sanger tested (Additional file 1); however one was false positive, five were inherited, and eight did not pass PCR quality checks (Additional file 9: Table S7), indicating location to repeated DNA sequence, thus hampering any ability to amplify the region for Sanger sequencing.

Gene regulatory region variation identified in known and hypothesized ID genes
We investigated gene regulatory sequence variation which we term 'de novo variants in possible regulatory regions' (DVPRRs). We filtered the DVPRR for potentially pathogenic changes using two approaches: by screening for involvement in known ID genes, and on the basis of our hypothesized involvement of the UPP.
An average~30,000 de novo SNVs were found across our eight patients in the non-coding genome (Fig. 5a). Of these, an average 2909 located to transcription factor binding sites, an average 514 to putative gene promoters, an average 191 of those located to promoters were also located to transcription factor binding site regions, an average 210 located to regions annotated as enhancers by the FANTOM consortium [50], an average 263 belonged to 5′ or 3′ UTR regions and an average 58 located to highly conserved ultra-sensitive regions [51] we considered these to be DVPRR and therefore there were an average 3763 DVPRR across all eight patients (Fig. 5a). We then intersected DVPRRs with 995 genes known to cause developmental delay ('DDD genes') [52] in a disease gene screen approach, and with the total 137 genes of the UPP (KEGG), − as our candidate genes converged upon the UPP -in a hypothesis-driven approach. As a final step for enhancers and ultra-sensitive regions, we further selected DVPRR where it, and the candidate gene (DDD genes or UPP genes), were located within the same topological domain [53], postulating that their physical proximity would imply that the regulatory region in question did in fact impact the targeted gene. In summary we found an average of 56 and 11 DVPRR per patient in our gene-screen and hypothesis driven approach respectively, by these filtrations combined (Additional file 10: Table S8) (Fig. 5a). We also interrogated high-confidence CNVs in the same manner, but only found association to SDCCAG8, a known ID gene present in the previously discussed 1q43 microdeletion ( Fig. 5b and Additional file 10: Table S8).

Occurrence of de novo SNVs in non-coding RNAs (ncRNA)
We found an average of 241 high confidence de novo SNVs that located to sequence annotated as ncRNA across all eight patients. A majority of these (average 195) fall within introns while an average 39 are exonic, an average 0.25 are predicted in splice junction sequence and average 5 and 2 are located to 3′ and 5′ UTR respectively.

Selection of candidate SNVs: comparison of strategies
An effective strategy is essential to select causative SNVs from NGS data. Standard filtration approaches (e.g., variant quality, mapping quality, minimum read depth, and functional variants that are not common polymorphisms) yield potential de novo variants that then must be careful sifted for likely true candidates.
In keeping with others [54], we found an average of 6 +/− 2 candidate unverified de novo SNVs (Additional file 8: Table S6), and it was necessary to implement an effective prioritization approach for verification. Discovery WGS and WES studies published to date have used a large sample size [2], detailed pedigree information [55], or well characterized rare syndromes [56] as study cohorts, leveraging the power of numbers, inheritance pattern, and phenotypic commonality, respectively, as filtration strategies. In as much as we did not have a large cohort, all of our cases were sporadic, and none had a recognized dysmorphic syndrome, we refined SNVs objectively, by selecting genes known to be involved in brain development pathways. We reasoned that this systematic approach would reduce subjective bias inherent in an N of 1 genotype-phenotype correlation, and thereby identified potential candidates. However a subjective screen for SNVs yielded the likely damaging variant in SCN3A, which was not stratified by our objective approach -highlighting the limitation of pathway analyses programs that depend on available  Table S8) gene-functional annotations. Notably SCN3A was selected by a team of biochemical geneticists specifically with respect to the epilepsy presented by the child. Thus a subjective approach may also miss results detected from objective screening, as exemplified in this case, where the two analyses were done by independent members and each did not report the result of the other.
Interpreting detected variants; discovery study findings further inform genetic complexity for ID Variable expressivity and reduced penetrance are well known in the pathogenicity of ID, and it is increasingly recognized that a single mutation in a single gene may only rarely explain the full phenotypic spectrum [1]. Our results provide further indications of such complex heritability; in patient 51, the 1q43 deletion, and SNVs in SQSTM1 and UPF1 may act in concert to produce the complex and severe phenotype in this patient. While in patient 58, we have identified both compound heterozygous variants in the known ID gene AP4E1 that act in a recessive model, as well as de novo variant in a novel gene SPRY4, which has important functions in brain development. De novo mutation is recognized to play an important role particularly in the pathogenicity of ID [57], and it is difficult to determine to what extent each of these variants, if at all, contributes to disease burden in this patient. The same is true for patient 45 in whom de novo variants for two novel genes, CACNB3 and SCN3A were identified. We note that patient 51 who bears the most complex genotype, is the most severely affected in our cohort, and in this case, clinical severity does co-relate with number and complexity of genomic alterations, suggesting that gradation of clinical severity may provide useful toward assessing the contribution of genomic alterations.
It is recognized that genes responsible for ID converge onto common networks [1,58]. The candidate genes we identified converge onto the UPP, which is critically involved in neurodegenerative disease [45] and has important roles in neurodevelopmental disorders [45,46]. This observation is consistent with the notion that they may be good candidates, and exemplifies the usefulness of probing molecular links among novel findings.

Large secondary positive WES cohort analysis supports novel findings
Novel SNV findings from NGS studies require rigorous additional studies to support proof of pathogenicity [14]. In our case, several of our novel candidates cause missense variation, whose effect is difficult to model, as opposed to clear loss of function mutations which are amenable to functional studies in model organisms. Conversely we were unable to conduct traditional genotype-phenotype correlations studies as none of our patients had a recognizable syndrome to match with other patients. Therefore, our approach of using a large secondary positive control cohort, despite the phenotypic spectrums not matching our cases precisely, gave us sufficient ability to test the predicted causality of our candidate genes and was the best strategy available. We were hampered by the lack of an optimal comparison negative control cohort. We used WGS data from the 1000 genomes project, which we recognize is primarily comprised of low coverage samples whose phenotypic spectrum is poorly characterized (thus yielding likely false negative data or conversely identifying variants in 'normal' individuals who are in fact affected), yet the similarity of sample size between the two groups allowed us to explore the PDS distribution for these genes reasonably, providing a useful contributory analysis toward assessing their likely pathogenicity. Finally this large cohort enabled us to further probe the convergence of our candidate genes upon UPP, by assessing its contribution versus other biological pathways.
WGS is able to detect structural variants below the threshold of clinical CMA, and enables mechanistic insights into CNV formation By using WGS instead of WES, we were able to detect a CNV below clinical CMA resolution, isolate it's breakpoints, and uncover a possible complex genomic landscape in one patient. We wanted to conduct a comprehensive screen for CNVs and other structural variants to maximize sensitivity. Therefore we used four approaches that are fundamentally different; CNAseq and FREEC are sequence based copy-number estimators that use categorically different algorithmic approaches for background correction. DELLY is an alignment based assembler, whilst ABySS is a de-novo genome assembler. Since each algorithm was optimized differently, it therefore yielded different results. For example, CNAseq executes read-depth based binning, and hence aggregates results at telomeres and centromeres where a larger number of reads re-align due to pervasive repeat sequence (Additional file 5: Figure S5). The verified CNV we detected was only identified by DELLY and FREEC, but missed by the other algorithms. Therefore, we caution against using only one CNV detection algorithm as this would reduce sensitivity. The breakpoint junction sequence in the case of the confirmed 1q43 microdeletion is consistent with the notion that it could be caused by chromotripsis, a mechanism only recently reported in the constitutional genome [59], further demonstrating the utility of WGS data.
WGS enabled a de novo genome assembly that unmasked hidden genome complexity ABySS de novo assembly identified a translocation missed by DELLY, and also detected a higher than usual number of putative indels in patient 51, who was found to have a remarkably unstable genome masked by standard genome re-alignment based analysis (Fig. 6b). However, we experienced difficulty confirming these events via Sanger sequencing, which was due, in part, to the high degree of repeated sequence at breakpoint junctions. Genome assembly is able to call events in repetitive sequence better than alignment based algorithms [13], though conversely such events are harder to independently verify. We are among the first to use de novo assembly to interrogate patients with ID, and our findings suggest variation located to repeat enriched sequence is currently under-ascertained in the constitutional genome.
WGS is able to interrogate regulatory genomic sequence Meaningful interpretation of SNVs within regulatory sequence is hampered by the sparsity of annotations for the non-coding genome. We implemented two different filtering strategies in order to identify non-coding SNVs that could have a functional impact, and also used topological domain data to further refine good candidates. Though we were able to reduce the number of candidate DVPRR from an average >3700 to dozens in the case of our gene-screen approach and a handful in the case of our hypothesis driven approach, nonetheless without further focused studies, meaningful interpretations are precluded. In contrast, assessing the impact of CNVbased DVPRR is theoretically less challenging, as it is more straightforward to predict functional outcome for a complete loss or gain of a possible regulatory sequence. In summary, though clinically relevant conclusions for DVPRR will require a case-by-case analysis and extensive follow-up functional studies, nevertheless we note it is possible to stratify DVPRR in the context of known causative genes for ID using WGS.

WGS versus WES
WGS yields a comprehensive screen of the genome as, in addition to coding variation, it includes ability to investigate structural variation at a fine scale as discussed above, and also variation in possible gene regulatory sequence as well as 'non-coding genes' such as ncRNAs for which there is a paucity of information in the context of neurodevelopmental disease. While we show strategic stratification for DVPRR can yield results potentially relevant to ID causation, much less is possible for annotation of SNVs within ncRNA sequence, of which we detect an average 241 across our samples. Nevertheless, initial screens such as ours, importantly generate exploratory information for non-coding sequence variation possible only by WGS.
We note that all the SNVs we identified as involved in disease would have been possible to detect by WES. However, WGS yields a more complete view of possible pathogenic variation in each child. This is exemplified in the case of patient 51, for whom had only WES been performed, while the SQSTM1 and UPF1 SNVs would likely have been detected, the 1q43 microdeletion would not have been identified. In the case of this patient, it is unclear what the gene-effect size for each variant is. Conversely, in the case of patient 43, for whom we detected the SNV in ARID1B, we are more certain of the penetrance of this variant due to the normal results for other causative variation in their genome (i.e., that they do not have any CNVs or SVs) from our WGS data analyses. These data argue in favor of WGS over WES for clinical use.