Systematic identification of phenotypically enriched loci using a patient network of genomic disorders

Background Network medicine is a promising new discipline that combines systems biology approaches and network science to understand the complexity of pathological phenotypes. Given the growing availability of personalized genomic and phenotypic profiles, network models offer a robust integrative framework for the analysis of "omics" data, allowing the characterization of the molecular aetiology of pathological processes underpinning genetic diseases. Methods Here we make use of patient genomic data to exploit different network-based analyses to study genetic and phenotypic relationships between individuals. For this method, we analyzed a dataset of structural variants and phenotypes for 6,564 patients from the DECIPHER database, which encompasses one of the most comprehensive collections of pathogenic Copy Number Variations (CNVs) and their associated ontology-controlled phenotypes. We developed a computational strategy that identifies clusters of patients in a synthetic patient network according to their genetic overlap and phenotype enrichments. Results Many of these clusters of patients represent new genotype-phenotype associations, suggesting the identification of newly discovered phenotypically enriched loci (indicative of potential novel syndromes) that are currently absent from reference genomic disorder databases such as ClinVar, OMIM or DECIPHER itself. Conclusions We provide a high-resolution map of pathogenic phenotypes associated with their respective significant genomic regions and a new powerful tool for diagnosis of currently uncharacterized mutations leading to deleterious phenotypes and syndromes. Electronic supplementary material The online version of this article (doi:10.1186/s12864-016-2569-6) contains supplementary material, which is available to authorized users.


Background
Genomic Structural Variations are one of the main sources of human genome variation. Copy Number Variations (CNVs) naturally occur in the genome of healthy individuals [1,2], some of them leading to disease [3]. CNVs consist of thousands to millions of bp deletions, duplications, insertions or inversions, recurrent in the population either by inheritance or spontaneous occurrence (de novo) [4]. Although the discovery of CNVs was relatively recent, a plethora of genetic association studies have been carried out to understand their evolutionary [5], functional [6] and phenotypic effects [4]. It has been estimated that two genomes can differ approximately about 0.4 % due to CNVs [7] and that these variations have a considerable impact on human health. Several known chromosome imbalances causing complex genomic disorders have been characterized by different medical conditions such as developmental [8,9], neuropsychiatric [10][11][12], cancer [13], autoimmune diseases [14] and idiopathic learning disability [15]. However, recent genome wide association studies suggest that the lack of data for individual's medical records is an important limitation to fully understand the genetic basis for many genomic disorders [16,17]. Initiatives such as the Personal Genomes Project (PGP) [18], Genomics England (http://www.genomicsengland.co.uk/) and the Precision Medicine program [19] aim to provide descriptive records and associated genomic data accessible for research. These datasets, however, are still unavailable or pose different challenges when looking into genetic association studies: e.g., lack of sizable data (e.g., PGP) or too restrictive access (e.g., Genomics England). These shortcomings may encourage genetic association studies to oversimplify complex phenotypic profiles of individuals, focusing on the most representative clinical features [20]. This makes it more difficult to characterize pathophysiological associations between clinical features observed in studied individuals [20]. New systematic and standardized methods are thus required that make use of limited accessible clinical genotype and phenotype profiling datasets to enhance our understanding of the genetic impact of CNVs on human health [21]. The present work uses individual clinical and genetic information stored in the DE-CIPHER Database [21], a database of sub-microscopic chromosome abnormalities (deletions and duplications) observed in clinic with a potential pathogenic association. Data currently stored by DECIPHER add up to more than 45,000 patients (march 2015), of which more than 10,000 have given consent to share their medical data [22] under an ethically regulated data access protocol. We focus our study on a subset of these data of 9,186 unbalanced CNVs from 6,564 patients that included a heterogeneous set of pathophenotypes, including developmental delay, intellectual disability and congenital malformations. Network analyses has been used in previous studies to characterize affected pathways by CVNs in cancer [23]. Here we applied network medicine approaches, phenotypic enrichment analyses and genetic association studies to build patient networks to explore the similarities between reported genetic microvariations (CNVs) and pathological phenotypes. We represented patients as nodes connected with edges to other patients whose CNVs overlap. Our resulting networks allowed the systematic identification of genetically related clusters of patients by finding cliques [24,25]. A phenotypic enrichment analysis of patient clusters was performed to identify overrepresented phenotypes for each cluster. We named Phenotypically Enriched Locus (PEL) an affected genomic location showing significant associations with phenotypes. Significant genotype-phenotype associations were retrieved through the comparison of patients (cases) and healthy (controls) datasets, using a case-control association analysis. The combined use of these methods allowed us to build a high-resolution genotype-phenotype map that identifies a) already known, b) potentially novel genomic disorders and c) the additive phenotypic effects found in some proximal structural variations.

Case and control datasets Cases
Rare CNVs (frequency of <1 %) from patients with low prevalent genomic disorders were downloaded from DE-CIPHER database (08/05/2014; http://decipher.sanger.ac.uk/) through its Data Access Agreement. This dataset contains genotype-phenotype annotations of consented DECIPHER patients, including chromosome locations, type of structural variant (gain or loss), mode of inheritance (de novo, inherited from unaffected parent, inherited from affected parent and unknown) and clinical phenotypes observed by expert physicians. When available, patients in DECIPHER are assigned phenotypes from the Human Phenotype Ontology (HPO), a standard controlled vocabulary of pathological terms [26]. Patients not annotated with HPO phenotypes were removed from our study. To reduce heterogeneity among collected patient data from DECIPHER, we only selected CNVs originated from array CGH technology, which corresponds to the majority of the database's genotypic data. A final dataset of 6,564 patients with 9,186 CNVs presenting 1,860 non-redundant HPO terms was chosen for this study (Additional file 1: Table S1). Access to DE-CIPHER genomic coordinates of chromosomal microdeletions, microduplications and associated phenotypes were obtained through a Data Access Agreement. All data shared by the DECIPHER database have signed a consent form obtained by the submitting clinician. Those who carried out the original analysis and collection of the data bear no responsibility for the further analysis or interpretation of it by the Recipient or its Registered Users.

Controls
CNVs from healthy individuals were retrieved from the Database of Genomic Variants (DGV, http://dgv.tcag.ca/) [27], which provides a curated collection of human structural variations in control data from multiples studies. DGV offers information about CNVs of individual samples such as chromosome locations, type of structural variation (gain or loss) and reference (PubMed ID) of the study and the platform used in the analysis. The control structural variants dataset ("GRCh37_hg19_var-iants_2014-10-16.txt") was downloaded from DGV. This dataset combines CNV data from diverse studies. Using DGV as the control dataset has the caveat that it does not distinguish unrelated from related samples (i.e., the same patient CNV retrieved from different studies). Although in practice this overrepresentation of the same patient may seldom happen, it may still overestimate the number of so-called independent CNVs, affecting our final results. This overestimation of the frequency of CNVs in controls drove us to make a stricter assessment of the statistical significance of our predicted pathogenic CNVs. The types of effects this inflation of non-pathogenic CNVs may cause include an increase of the number of false negatives (i.e. true pathogenic CNVs that overlap with an overestimated number of control CNVs) and a reduction of the number of false positives (i.e. false pathogenic CNVs overlapping with an over-estimated number of control CNVs). Therefore, we have considered CNVs from DGV only as a quantitative control for preventing misclassifications of CNVs as pathogenic.
Building the genotype-based patient network We designed a workflow to systematically identify all the existing genotype-phenotype associations in the case dataset ( Fig. 1). First, the overlap between patient CNVs belonging to the same class (either gains or loses) was computed using the GRCh37/hg19 reference genome. For the purposes of this study, we assumed that two patient CNVs overlap if at least they share one common base pair. The resulting genetic relationships were used to build the network, where nodes are patients and edges represent the overlap between patient CNVs (Fig. 1).

Clustering of patients using cliques
Finding all the k-cliques associated with each patient provides all complete graphs from the resulting genotypebased patient network. These cliques correspond to sets of variable numbers of nodes where all are connected to all by edges [24]. To find all the cliques associated with each node from the patient network, we used the algorithm of the function "cliques_containing_node" available in the Python package named NetworkX. The minimum size of cliques was limited to three patients (k nodes ≥ 3) but no limitation was applied to maximum size clique detection. We then merged into one clique all those containing identical sets of patients with the aim of getting a unique list of cliques resulting from the patient network. This list of unique cliques is of high interest for our approach because it allows the systematic identification of the whole set of patients sharing similar genotypes by mining directly the clusters of the network. Taking into account that CNV lengths can be very variable across the case population, a large patient CNV can overlap with other patient CNVs at different genomic regions. These complex interactions in the patient network imply that some cliques might not necessarily represent a cluster of patients where all their CNV overlap. Thus, we selected only those cliques that were fully represented by patients with mutations on the same genomic region. The resulting cliques were used as the list of clusters of patients to be used for downstream analyses, i.e., phenotype enrichment analysis.

Phenotype enrichment analysis
The Human Phenotype Ontology (HPO) was used as a relational graph to identify common phenotypes among all the clique patients. The hierarchical organization of HPO terms (phenotypes) by parent-child relationships allows the detection of phenotype enrichments when their annotations co-occur at the same ontological level. We used this relational graph to detect the common phenotypes in a given cluster -or clique-of patients. To systematically assess the phenotype significance in each To carry out this test, we used the number of individuals per clique as the sample size, the number of patients in the samples presenting a phenotype as the observed cases, and the total number of patients in DECIPHER database presenting the phenotype as the population size. We selected clique-phenotype enrichment associations by applying three different thresholds: 1) P < 0.05 from hypergeometric test, 2) counting at least three patients annotated with the enriched phenotype, and 3) if at least 50 % of the patients in the clique are annotated with the enriched phenotype. Once this selection process ended, we found that many of these cliques were enriched with HPO terms that are closely related in the ontology (i.e. parent-child relationship), producing some redundancy that does not add information. In those cases, redundancies were removed by selecting the most significant (lowest P-values) HPO terms as the representative ones.

Characterizing phenotypically enriched loci (PELs)
We defined a phenotypically Enriched Locus (PEL) as the minimal common intersection among all the CNVs of patients in every clique that is significantly enriched with phenotypes ( Fig. 1). We studied PELs' incidence in patients (cases) by comparing them to a healthy population (control). Their statistical significance was assessed using a Fisher's exact test from a contingency table. This table consisted of a) the number of patients in a PEL associated with an enriched phenotype versus the total number of observed cases with that particular phenotype, and b) the number of healthy individuals -or samples from DGV dataset-with structural variants overlapping to this PEL versus the rest of observed controls (i.e., healthy population). We checked overlaps between PELs and individual control CNVs that overlapped at least 1 bp. After applying the Fisher's exact test, the P-values were adjusted using Benjamini & Hochberg and only those PEL sites with P < 0.05 were considered. This procedure allowed us to calculate the statistical significance of associations between enriched phenotypes (HPO-term) and a PEL compared to frequency of CNVs from the healthy population on the same locus. Finally, the penetrance of enriched phenotypes for each locus was calculated as the proportion of individuals showing the enriched phenotype -casesover the healthy population -control-, by using a similar approach to the one recently published by Cooper et al. [8,28].

Randomization analysis on case and control datasets
Five randomization analyses were designed to test different null hypotheses: (i) Arbitrarily selected CNVs from the control dataset without replacement and it was used to test if the frequency of detected PELs is lower than from a case population (DECIPHER) when using CNVs from a healthy population (DGV). This randomization analysis was named "random patient CNVs from DGV".
(ii) The second type of randomized case dataset was generated from arbitrary genomic regions while keeping the CNV length distribution and chromosome frequencies from the case dataset and it was named "random patient CNV location". This randomized dataset was used to test if the frequency of detected PELs is lower when individual case CNVs are randomly distributed across the genome compared to real patient CNVs from DECIPHER. (iii) A similar approach as mentioned above was used to generate the third type of randomized dataset but using the control dataset (DGV) instead of the case dataset. This randomization analysis, named "random control CNV location", was used to test if the frequency of PELs is lower when individual control CNVs are randomly distributed across the genome compared to real CNVs from DGV. (iv) The fourth type of randomization analysis was carried out by randomly shuffling the patient-CNVs relations (named as "rewiring patient-CNV") to test if the frequency of PELs is lower when using arbitrary phenotype-genotype relationships.
(v) Finally, randomized case datasets were built using arbitrary phenotype descriptions of patients while keeping the phenotype frequency, to ensure that the representativeness of phenotypes from the real data is preserved. This randomization analysis was used to test that the frequency of detected PELs is lower using arbitrary phenotype descriptions for patients. We carried out one thousand randomization experiments for each randomized dataset and counted the number of PELs as well as the significances derived from the phenotypic enrichment analysis (P-values < 0.05, hypergeometric test) and genetic association study (P-values < 0.05, Fisher's exact test).

Results and discussion
Phenotypic and genotypic features of patient population The subset of 6,564 patients from the DECIPHER database used in this study includes the CNVs and clinical features (i.e., HPO phenotypic terms) observed by expert physicians in these patients. Table 1 summarizes the data analyzed for case (patients) and control (healthy population) datasets. The distribution of different phenotypes (HPO terms) associated with patients ( Fig. 2a) showed that almost half of patients were annotated with just one HPO term, while the remaining cases showed more complex phenotypic profiles with two or more associated terms. The distributions of de novo and inherited patients were explored based on the complexity of their phenotypic profiles (Fig. 2b). It is observed that the de novo CNVs show a significant (P < 2.2E-16, Mann-Whitney U test) bias toward more complex -or diverse-phenotype profiles than the inherited group (Fig. 2b). The distribution of CNV lengths in patients is biased toward higher lengths as compared with those of control CNVs, something that should be expected if clinicians remove the nonpathological CNVs (Fig. 2c). Within the observed patient dataset, those including de novo CNVs showed the highest average length compared to the inherited set (Fig. 2d). These results suggest a positive relationship between CNV length and the complexity of annotated patient phenotypes. This is not a surprising observation, since larger CNVs affect more genes in the genome, producing an additive effect to observed clinical features.

Analysis of phenotypically enriched loci (PELs)
We built a patient network, consisting of 6,324 nodes (patients) connected by 89,526 interactions based on the genetic overlapping between patient CNVs, and we calculated some topological parameters ( Table 2). The resulting network showed low density, which means that the portion of potential interactions is low compared to the actual interactions in the network, and a high average clustering coefficient, which measures how nodes (patients) tend to cluster together. In addition, we also observed other properties such as a heterogeneous degree distribution, a small average shortest path length, and a high average clustering coefficient of network nodes, available in Additional file 2: Figure S1. These network properties suggest that the patient network appeared to show general features of most large realworld networks in contrast to random networks. From the patient network, we proceed to study PELs; i.e., significantly enriched genomic loci with phenotypes in patient clusters. We designed network-based and enrichment analyses to find genetically and phenotypically related clusters of patients (cliques; see Methods and Fig. 1). In total, 1,042 locus-phenotype associations between 487 PELs and 195 enriched phenotypes (HPO terms) were generated. We performed a genome-wide study of CNVs, using as control a dataset of healthy population, to evaluate the significance of genotype-phenotype associations in PELs. A Fisher's exact test (see Methods) related to previous works was applied [8]. However, our experiment defined genetic associations to exploit patient network relationships, evaluating each locus independently instead of using sliding windows as previous works. In addition, redundant and uninformative phenotypes were also removed according to their parent-child relationships (see Methods). Using this systematic approach, we reported 387 specific locus-phenotype associations between 336 PELs and 115 different phenotypes (HPO terms; Additional file 3: Table S2). Almost 70 % (336 of 487) PELs were To assess whether these loci are potentially pathogenic and that our results are not due to chance, we did several randomization analyses with the aim of comparing real and random results. Five different types of randomization analyses were designed using randomized case and control datasets to test if the frequency of detected PELs is lower than real cases (Fig. 3a): (i) we generated random datasets of mutations in patients from random sets of CNVs that were selected from the control dataset (DGV), we used random locations for (ii) patient CNVs and (iii) control CNVs by selecting random genomic regions while keeping CNV length distributions and chromosome frequencies, (iv) the rewiring of the patient-CNV relations, and, finally, (v) the rewiring of phenotype descriptions of patients conserving the phenotype frequencies (see Methods).
We found that the number of PELs identified by using the real data (336) was substantially higher compared to that resulted from the different randomization experiments (Fig. 3a). In addition, the significances (P-values < 0.05, Fisher's exact test) derived from the genetic association study are also higher in real than in randomized datasets (Fig. 3b). The small differences with respect the control dataset with random CNV locations suggest that there is a portion of CNVs in the control population (DGV) that are randomly distributed across the genome, something that might be expected in natural genetics populations (Fig. 3b). Overall these results reveal the existence of a fraction of PELs in DECIPHER that are consistently pathogenic, where both the number of resulting PELs and the median significance of Fisher's exact test are higher when using real data compared to random datasets ( Fig. 3a and b, respectively).
We then studied which annotations from diverse biomedical ontologies are associated with these loci using GREAT [29]. It was found that these regions are significantly enriched for human phenotypes (Fig. 3c), reinforcing the probable clinical implication of mutations affecting these genomic regions. In addition, we also found that these PELs are enriched for cis-regulatory domains involved in biosynthetic processes, regulatory elements and embryonic morphogenesis (Fig. 3d). The experimental and functional characterization of these genomic regions might improve our current understanding of the molecular basis of these genomic disorders.

Pathogenicity of PELs
With the aim to validate the resulting phenotypegenotype associations, we searched how many pathogenic PELs match with known genomic disorders in ClinVar [30]. For this we selected 2,243 pathogenic or likely pathogenic CNVs associated with any OMIM phenotype and other 75 genomic regions described as DECIPHER syndromes. We then studied if our method retrieves genomic syndromes from ClinVar or DE-CIPHER. First, we looked for those PELs overlapping known syndrome from both databases (Additional file 4: Table S3 and Additional file 5: Table S4 for ClinVar and DECIPHER respectively) and having the same type of mutation as the described for syndromes (i.e. deletions or duplications). The number of syndromes was determined and real results were compared versus random results ( Fig. 4a and b, for ClinVar and DECIPHER respectively). From the real datasets, we counted a total of 93 and 15 syndromes overlapping PELs from ClinVar and DECIPHER respectively. These numbers are higher than the ones obtained from the randomization experiments ( Fig. 4a and B), with the exception of those using control CNVs with random locations across the genome. The distributions of the randomizations were similar in ClinVar and DECIPHER but with considerable differences in the number of syndromes ( Fig. 4a and b). Although a higher number of known syndromes could be expected, it should be taken into account that DE-CIPHER includes several cohorts of patients with rare genomic disorders that have not been well characterized. This means that some cohorts of patients that have been already diagnosed for well-characterized syndromes have probably not been sent to the DECIPHER database. To study how the length of PELs could be affecting our approach, we compared their length distributions across the different subset of PELs (Fig. 4c). The average length of PELs overlapping known syndromes is slightly shorter than those classified as potential novel syndromes, and the length of raw CNV from DECIPHER are considerably longer (Fig. 4c). Subsequently, we compared the length of PELs and the number of patient CNVs and control CNVs overlapping these PELs ( Fig. 4d and e, for patients and controls, respectively). We observed that the frequency of patients overlapping a PEL is independent to their length (Fig. 4a). This effect could be also explained by the specific cohorts of patient CNVs that are collected in DECIPHER. However, it is observable that the frequency of controls that overlap PELs, despite being very low, increases with PEL length (Fig. 4b). This observation agrees with the random distribution of control CNVs across the genome. Overall, these results evidence that our approach is robust at finding phenotypically enriched loci (PELs) from a heterogeneous population of patients of different genomic disorders.
We also built a patient network from the genotype and phenotype data of individuals related to pathogenic PELs, revealing clusters of patients that correspond to cliques or sets of them. The resulting network represents a map of the most relevant genotype-phenotype associations that we found in the DECIPHER dataset (Fig. 5a). From ClinVar information, we identified patient CNVs with or without an overlap to known genomic disorders (grey and red nodes in Fig. 5a, respectively). A detailed exploration of these clusters of patients revealed that 164 (~50 %) of the pathogenic PELs (see previous section) overlapped pathogenic CNVs in ClinVar, indicating that PELs are potentially related to known genomic disorders (Table 3 and Additional file 5: Table S4). For instance, in Fig. 5b, the PEL associated with the 8p23.1 deletion coincides with the same genomic location as the genetic variants related to pulmonic stenosis (MIM 265500) in ClinVar. In this particular case, 15 out of 21 patients with deletions in this locus (Fig. 5b and PEL 22 from Table 3) were annotated with "Malformation of the heart and great vessels" (HP:0002564, P-value of the enrichment 8.3E-10), which is the primary cause of pulmonic stenosis. In addition, there was no healthy individual from the control dataset showing a deletion in this locus, suggesting a high penetrance of this phenotype associated to this locus ( Table 3).
Another example is retinoblastoma (HP:0009919, Pvalue of the enrichment 6.7E-16 and 3.7E-15 for PEL 1 and 2 respectively; Additional file 3: Table S2) where 6 out of the 7 cases from the patient dataset belong to the same PEL, consisting on deletions in 13q14.2 (chr13:48,544,437-50,206,474, see Fig. 5b). It has been documented that structural variations in this locus are associated with the 13q14 deletion syndrome in which the most representative clinical feature is retinoblastoma (MIM 180200) [31,32]. However, deletions in this locus   Table S2), suggesting a reduced penetrance for the retinoblastoma phenotype [33] where other factors might be influencing this medical condition. These results indicate that our method is able to identify and prioritize structural variants that are strongly associated with pathological phenotypes. In addition, several clusters of patients associated with pathogenic PELs that were found not to be apparently associated with known genomic syndromes but significantly enriched for highly specific clinical features such as ectrodactyly, malformations in the heart, defects in atrial septum, and anophthalmia (Table 4). More than 50 % (172 out of 336) of the pathogenic PELs do not overlap with any known genomic disorder in ClinVar so they can be candidates for novel syndromic loci. For instance, we detected a cluster of patients showing a severe medical condition that is known as split hand (HP:0001171) with duplications in 17p13.3 (Fig. 5b). The PEL associated with this cluster (PEL 52, P-value of 1.1E-13 for Fisher's exact test in Additional file 3: Table S2) shows a very high penetrance for this phenotype, but its patients display a broad spectrum of specific clinical outcomes that are associated with this medical condition. The phenotype "abnormality of the hand" (HP:0001155) was the most enriched HPO term (P-value of the enrichment 2.7E-07 for PEL 52 in Additional file 3: Table S2) associated with this PEL (Table 4). A priori this cluster of genetically and phenotypically related patients could be considered a novel genomic disorder. Indeed, after reviewing the available clinical literature we found evidence of syndromic presence in micro-duplications spanning this locus, related to a previous familiar study with a similar phenotype [34]. We distinguished seven broad domains of phenotypic abnormalities through the examination of the phenotypic relationships between patients from PELs (Additional file 3: Table S2): abnormality of the ocular region, abnormality of the limb bone morphology, abnormality of the skull, abnormality of the face, abnormality of the cerebrum, abnormality of the cardiovascular system  [8,28] and growth delay. Our results show that this approach provides a new tool for the characterization and the study of phenotype-genotype relationships in a systematic genome-wide manner. For instance, it is possible to characterize the pleotropic effects of pathogenic CNVs or to study mutations on different mutated genomic regions that are related to similar phenotypes.

Additive phenotypic effects of pathogenic CNVs
We observed that the length of CNVs is correlated to complex phenotypic profiles of DECIPHER patients, as shown in Fig. 2a. This complexity is here defined as the number of distinct clinical features that have been observed by a physician in a patient. Thus, it was explored if the length of significant PELs is associated with complex pathogenicity or adds more phenotypes according to the number of different genomic regions that are affected. To illustrate this effect, we analyzed the phenotypic relationships between significant PELs that are in close genomic regions. For instance, deletions in 10q25.13 (PEL 149) and 10q26.13 (PEL 239) are related to different phenotypes such as abnormalities of the cardiovascular system and the genitourinary system respectively (Fig. 6a). Most cases with deletions in Fig. 6 Illustrative examples of additive phenotypic effects of PELs 10q25.13 (5 of 7 cases) are associated with malformations of the heart and great vessels, denoting a very specific clinical feature. In addition, cases with deletions in 10q26.13 are related to defects in the genitourinary system (PEL 239 in Fig. 6a). The patient B14 (Fig. 6a) shows both phenotypes and has a deletion that overlaps both genomic loci (PEL 149 and PEL 239, Fig. 6a). This example illustrates an additive effect, accumulating specific clinical features according to the extension of structural variants with respect to the genome of reference. This effect is also noticeable for more complex genetic relationships among loci of patient CNVs associated with significant PELs as those represented in Fig. 6b. In this case, three different clusters (cliques) of highly interconnected patients were detected, indicating that some individuals are included in more than one cluster or PEL. These different PELs were found to be associated with abnormalities of the ocular region, aplasia/hypoplasia of the cerebrum and abnormalities of the skull (PEL 254, 211 and 462, respectively, Fig. 6b). All patients overlapping these regions from significant PELs show the phenotype if they have the structural variation, except for patient S15 who apparently does not have signs of hypoplasia of the cerebrum. Different PELs associated with the same phenotype (HPO terms) were found located in contiguous or even the same genomic region. In some other cases, distinct PELs were essentially the same clusters of patients except with variations in one or two individuals (they should be considered one PEL). Thus, despite the precise identification of genomic coordinates of individual CNVs being a technological limitation, the wide adoption of next generation sequencing methods by clinical studies may solve the current shortcomings in the array-based CNV data used for this analysis.

Conclusions
This work presents a combined analysis of networkbased approaches, phenotype enrichment and genetic association studies for patient CNVs in the DECIPHER database. A set of methods was developed to identify clusters of patients that are genetically and phenotypically related. The newly developed methods used here have potential usefulness for a wide range of applications, such as prediction of unknown syndromes, characterization of candidate pathogenic structural variants and the identification likely associated phenotypes with a specific locus. This procedure could be improved using more specific clinical features of the patients, so physicians should be encouraged to submit detailed phenotype data. This work evidences the need for advancement in consolidated standards and public repositories for genomic and medical records in genomic and personalized medicine.