Genes associated with macrocephaly have been claimed to play a role in shaping the modern human head [4, 5]. In humans, several amino acid substitutions, nonsense, and frameshift, of NSD1 cause an overgrowth and macrocephaly known as Sotos syndrome. First, we locate that with variations in exon 19 (SET and Post-SET domains), 22 and 20 [6, 7] (Fig. 1). These exons do not have nonsynonymous variations or disorders like Sotos syndrome in other primates and we suggest a selection in modern humans. Second, we investigated NSD1 selection comparing the human NSD1 with close and distant primates by detecting percentage of similarity, Ka/Ks ratios, nucleotide diversity and signatures of episodic selection. We found no Ka/Ks ratios (> 1) in our within-species comparisons but a high nucleotide diversity and detected selected sites of episodic selection in exon 5, 23, 10 and 6, and in human, galago, tarsier, macaque/colobus, macaque/colobus/tarsier and macaque/colobus/tarsier/galago and tarsier and tarsier/lemur/colobus branches. Finally, we categorized the sequence and structure conservation of NSD1 domains among distant primates into three groups and identified a progressive development in PWWP1.
NSD1 mutations in human and functional domains
NSD1 has 23 exons, three isoforms with 16 domains, and codes for H3K36me2 protein (Fig. 2). H3K36me2 recruits DNMT3A allowing the maintenance of DNA methylation in intergenic regions by colocalizing at noncoding regions of euchromatin. NSD1 gene is expressed in the brain, kidney, skeletal muscle, thymus, and peripheral blood leukocytes. 90 percent of Sotos syndrome are caused by NSD1 mutations. The most common types of mutation include missense, nonsense and frameshift deletions resulting in overgrowth with distinctive facial characteristics. In humans, most variations are located from exon 10 to 22 which are NLS3, PHD2, PHD3, PHD4, PWWP2, AWS, SET, and Post SET domains (Table 1, Fig. 1, Fig. 2). PHD and SET function in histone methylation and as ubiquitin E3 ligases. NLS domains play essential roles for the translocation and PWWP region is involved in DNA methylation, DNA repair and regulation of transcription [10]. Similarly, to NSD1, ASPM (abnormal spindle-like, microcephaly-associated; MCPH5) undergoes positive selection in selected exons throughout the primate lineage leading to humans and currently is associated to microcephaly [18]. Microcephalin (MCPH1), shows a strong signature of positive selection in specific exons primarily in the lineage leading from the ancestral primates to the great apes [19]. Both CDK5RAP2 (CDK5 regulatory-subunit-associated protein 2; MCPH3) and CENPJ (centromeric protein J; MCPH6) report higher rates of non-synonymous substitutions in primates than rodents, and CDK5RAP2 shows especially high rates in the human and chimpanzee terminal lineages [20].
Methyltransferases vary in evolution
One function of NSD1 includes the upstream binding to the bone morphogenetic protein 4 promoter, which increases H3K36 methylation and promotes bone morphogenetic protein 4 transcription [21]. NSD1 is a methyltransferase which regulates the epigenome during development, and we suggest it was implicated in the development of the human brain. We demonstrated the similarity between primates to be high due to the histone methyltransferases activity and consider that variations alter the functionality (Table 1). The high nucleotide diversity between primate’s exons is comparable with that of diversifying genes like polyubiquitin. Exon 10 (NLS3), 5, 9, 11 and 23 had more non-synonymous than synonymous between species. Mutations in exons 10 and 5 are detected in Sotos syndrome, however there are not many reported in exons 11, 9 or 23 (Fig. 2). We observed more variants in exon 22 associated with Sotos than exon 23. Exon 23 includes PHD5 and PHD6 which might be necessary for the proper function of NSD1, and the lack of this exon due to variants in exon 22 is pathogenic. Variants in exon 23 could retain functionality of PHD5 and PHD6.
NSD1 is a member of the NSD family of SET domain-containing histone methyltransferases (NSD1, NSD2 and NSD3). The NSD family have specific mono- and demethylase activities for H3K36, carry nonredundant roles during development and aberrant expression is associated with multiple diseases [22]. During mouse development, NSD1 is expressed in the telencephalic region of the brain and spinal cord [23]. After birth, NSD1 expression is predominantly neuronal within the cerebral cortex and in a smaller proportion in astrocytes and oligodendrocytes [24]. In humans, NSD1 variations are associated with overgrowth syndromes, with macrocephaly, and to the evolution of modern human brains and skull shape. NSD2 haploinsufficiency is associated with Wolf-Hirschhorn syndrome characterized by heart defects and severe mental and growth retardation [25]. SETD1A, a methyltransferase like NSD1, indirectly regulates neurogenesis through WNT/β-CATENIN signal with variations limited to modern human and absent in Neanderthal/Denisovan [5]. Mixed lineage leukemia protein-1 (MLL1) a member of the SET1 family of H3K4 methyltransferases highly conserved from yeast to humans. GLI3 and NFIX have been associated with evolution in the human lineage and currently to disease, and our project is the first to describe NSD1. Hypermethylation of NFIX in anatomically modern humans influenced the balance between lower and upper projection of the face compared to other species [26]. In humans, NFIX mutations are associated with impair speech capabilities, Marshall–Smith and Malan syndromes [26]. GLI3 regulators were found to show the signatures of positive selection, are unique to modern human lineage and possibly lead to the evolutionary human brain development. Selection may act to increase the frequency of de-novo beneficial mutations [27]. The phenotypic spectrum of GLI3 mutations includes autosomal dominant Greig cephalopolysyndactyly syndrome and Pallister–Hall syndrome [28].
Nucleotide variations in this complex have been reported to alter the function and are correlated with different evolutionary lineages [29]. We suggest that chromatin modifiers like NSD had relaxed selection towards brain development during modern human evolution, however nowadays the relaxed selection is associated with brain growth and facial disorders.
NSD1 family related genes and other genes are associated to evolution and disease in modern humans
Episodic selection is a process in which codons experience purifying selection with bursts of strong positive selection within certain lineages. The specific codons experience positive selection, followed by purifying selection maintaining the variant and plays a role in in adaptive evolution [30]. We identified sites under episodic selection in exon 5 (NLS 1 and 2 domain) and exon 23 (PHD5 and 6 domain) that could modify the function as seen in other SET1 proteins (Fig. 3, Table 1). We detected episodic selection in human and macaque branches and support the evidence of selection in humans and no other archaic hominid. Modern human brains and skull shape differ from other hominids as a result of nucleotide variations in regulatory regions during early cortical development [4, 5]. CASC5, required for the kinetochore-microtubule attachment, is associated with higher gray matter volume [31]. Higher expression of PTEN elevates Beta-Catenin signaling controls the correct neuron positioning, dendritic development, and synapse formation [32]. The transcription factor TCF3 represses Wnt-Beta-Catenin signaling and neuronal differentiation, increasing the neural stem cell population during neocortical development [33]. NFIX1 and NSD1 genes are associated with macrocephaly and Sotos syndrome and were described to be important in the shaping of the modern human head [34, 35]. Variation also had side effects such as neurodevelopmental disorders affecting brain growth and facial features [4]. Various genes enriched in modern humans are disease-relevant genes like CHD8 and CPEB4 in autism spectrum, HTT in Huntington’s disease, FOXP2 in language impairment [5]. Likewise, NSD1 is associated with Sotos and Weaver syndrome and only in specific exons. We did not infer the divergence time of NSD1 variation in primates however there is evidence of episodic selection during human brain evolution. Humans have heavier brains compared to other primates (1400 g vs 395-490 g) [36]. Human brain mass increased during the divergence of Australopithecines [37]. The neocortex enlarged in the archaic hominin lineage after the divergence of chimpanzees (6–7 million years ago). The cranial lobe size differs between anatomically modern humans and Neanderthals, which indicates unique neocortical regions evolving in humans [38]. We obtained the frequency of NSD1 variants in the human populations from The Genome Aggregation Database (gnomAD) to identify whether a specific group within humans are subject to selection. Nine variants had a frequency higher than 0.05 (5 synonymous, 2 located in introns and 2 missense) were classified as benign and not specific to a population (Supplementary Table 7).
Primate protein structure analysis prediction and evolutionary relations
Whole protein dendrogram demonstrated that the tarsius C. syrichta and the mouse are closely related, however the rest of the tree did not follow the conventional primate phylogeny and we divided the sequence into superdomains. (Fig. 3). Most of the species have a similar SD1 and SD2 structure, except for O. garnetti and C. syrichta (Fig. 5, Table 2). SD1 and SD2 phylogenetic analysis confirmed that humans diverged from the rest of non-hominoid primates. SD1 analysis outgrouped Homo sapiens from other primates suggesting a uniqueness in this species. SD1 consists of the PWWP domain named after the central core Pro-Trp-Trp-Pro, which functions as a transcription factor [39, 40]. SD2 contains PWWP2, PHD 1–4, AWS, SET and Post-SET domains. We identified that the functional domains inside the structure (PHD 1–4, PWWP2, AWS, and INHLOOP domains) that mainly act as nuclear signaling, DNA binding and interactions with other proteins are highly conserved. The plant homeodomain (PHD) is a zinc finger motif found in nuclear proteins involved in epigenetics and chromatin regulated transcription. PHD functions as a protein-protein interaction domain, and cooperates with BROMO domains for nucleosome binding in vivo [41,42,43]. The inhibitory loop (INHLOOP) is an amino acid sequence found between SET and Post-SET, normally inhibiting NSD1 function and is associated with abnormal expression of genes in cancer [44, 45]. AWS and SET domain interaction regulates gene expression by methylation lysine in proteins like histones [46]. SET and Post-SET domain sequences vary outside the functional amino acids in orangutan and tarsius. SET structure includes turns and loops and Post-SET is a cysteine rich sequence necessary for the SET domain catalysis [47,48,49]. Other regions not included in the superdomains (NLS1, NLS2, NLS3) were highly conserved (Fig. 5, Table 2). NLS functions as a nuclear signal [10]. We found that exon 10 has the highest non-synonymous variants, and the highest number of nonsense variants in humans. Exons 9 and 11 do not have a functional domain therefore variation is allowed in primates and is not as pathogenic in humans as other mutations. Episodic selection occurs preferably in exon 5 (NLS1 and NLS2) and 23 (PHD5 and PHD6) which are the longest exons. Most variants in humans are located in exon 5 and 23, however after normalizing the number of variations to the length of the exon the most common variants were in exon 19, 22, 20 and 18 (AWS, SET and Post-SET). One scenario could be that in other primates NLS - a nuclear domain- function is allowed to differ more than the SET domains. In contrast, human SET domains vary resulting in brain change. Another scenario could be that NLS variants in humans are damaging and individuals die in utero, therefore the frequency is underrepresented.
We analyzed the differences in structure and amino acid sequence and categorized three groups. The first group has a highly conserved sequence and structure, which are PDH4, PWWP2, AWS, and INHLOOP domains (Table 2, Fig. 5). The second group has a conserved structure but differs in aminoacid sequence and includes NSL1, NSL2, NSL3, PHD1, PHD2, PHD3, SET and Post-SET domains (Table 2, Fig. 5). The third includes PWWP1 and PHD5 domains, and are the least conserved in sequence and structure, especially for the prosimian species O. garnetti and C. syrichta (Table 2, Fig. 5). We suggest that amino acids change by exposure to biochemical niches in regions between species depending on their gene expression, cellular activity, and quantum dynamics. The developmental progression of PWWP1 from O. garnetti, C. syrichta to H. sapiens and the lack of PHD5 in O. garnetti, suggest a novel function in the upper clades of primates. The conserved regions must remain unchanged due to their important function like the catalytic performance and molecular signaling involved in chromatin remodeling, signaling, and protein interaction [22, 50]. The PWWP1 domain is found in transcription factors or proteins involved with nuclear regulation. This domain is involved in protein-protein interaction, DNA binding/recognition, controls the function of NSD1 and plays a central role on cellular growth and differentiation of neural crest cells. Modifications in PWWP alters the regulation and specialization of various nuclear processes. The progressive changes in this domain describe the molecular specialization of the chromatin regulation process during primates’ evolution [39, 40].