Phylogeny-wide conservation and change in developmental expression, cell-type specificity and functional domains of the transcriptional regulators of social amoebas

Background Dictyostelid social amoebas self-organize into fruiting bodies, consisting of spores and up to four supporting cell types in the phenotypically most complex taxon group 4. High quality genomes and stage- and cell-type specific transcriptomes are available for representative species of each of the four taxon groups. To understand how evolution of gene regulation in Dictyostelia contributed to evolution of phenotypic complexity, we analysed conservation and change in abundance, functional domain architecture and developmental regulation of their transcription factors (TFs). Results We detected 440 sequence-specific TFs across 33 families, of which 68% were upregulated in multicellular development and about half conserved throughout Dictyostelia. Prespore cells expressed two times more TFs than prestalk cells, but stalk cells expressed more TFs than spores, suggesting that gene expression events that define spores occur earlier than those that define stalk cells. Changes in TF developmental expression, but not in TF abundance or functional domains occurred more frequently between group 4 and groups 1–3, than between the more distant branches formed by groups 1 + 2 and 3 + 4. Conclusions Phenotypic innovation is correlated with changes in TF regulation, rather than functional domain- or TF acquisition. The function of only 34 TFs is known. Of 12 TFs essential for cell differentiation, 9 are expressed in the cell type for which they are required. The information acquired here on conserved cell type specifity of 120 additional TFs can effectively guide further functional analysis, while observed evolutionary change in TF developmental expression may highlight how genotypic change caused phenotypic innovation.


AATF and ARID/BRIGHT transcription factors
Apoptosis Antagonising Transcription Factor genes were retrieved from Dictyostelid genomes with the Interpro IPR012617 identifier and BlastP search with Ddis AATF. ARID/BRIGHT genes similarly were retrieved using the Interpro identifier IPR001606 and by BlastP search. The sequences corresponding to functional domains were aligned using Clustal Omega with five iterations (Sievers and Higgins 2014) and a phylogeny was constructed by Bayesian analysis (Ronquist and Huelsenbeck 2003) and decorated with the functional domain architecture of the proteins using SMART (Schultz et al. 1998). Gene IDs and locus tags are colour coded to reflect the taxon group of the host species, as indicated in the Amoebozoa phylogeny. Ddis sequences show names for known genes, which are framed in red the biological function of the gene in Ddis is known. Clades of orthologous genes or other groupings are annotated with relative transcript levels at specific developmental stages or specific cell types, shown as heat maps that represent fraction of maximum transcript number for the developmental profiles and fraction of the summed number for the cell types. The normalized transcript numbers were retrieved from published RNA sequencing experiments (Gloeckner et al. 2016;Kin et al. 2018;Parikh et al. 2010b) and unpublished experiments for Dlac spore and stalk and vegetative cells and are all listed in Supplemental_Table_S1.xlsx. Washed-out heat maps have maximum read counts equal to or below 10 AATF, also known as Che-1, is a transcription factor, which directly interacts with subunit 11 of human RNA polymerase II, and represses the suppression of growth by the Rb retinoblastoma protein (Fanciulli et al. 2000). A single ortholog was identified in each of the tested Dictyostelid genomes and the gene is most highly expressed during growth.
The ARID (AT-rich interaction domain) is a DNA-binding protein, also known as BRIGHT from the mouse B-cell regulator of IgH transcription (Herrscher et al. 1995). The ARID domain consists of six α-helices and a β-hairpin (Iwahara et al. 2002), and proteins that contain the ARID domain regulate gene expression throughout animal development and participate in chromatin remodelling (Wilsker et al. 2005). Ddis has two proteins with the ARID domain, one, RbbB, with an additional JmjC transcription factor domain and the other with a SAP30-Sin3 binding domain. The SAP30-Sin3 complex is associated with histone deacetylases and acts as a corepressor, capable of gene silencing (Grzenda et al. 2009). Both Ddis proteins are conserved throughout Dictyostelia and except Ddis, the tested species also have a third ARID domain containing protein. Sequences with AT-hook domains were retrieved from Dictyostelid genomes with the Interpro identifier (IPR017956). The sequences corresponding to the AT-hook domains and 20 AA of flanking sequence were aligned and a phylogeny was constructed using RAXML. Many tree nodes are unresolved, because due to its small size (11 or 12 AA) the AT-hook domain contains insufficient phylogenetic signal. For orthologous sets that appeared incomplete, missing proteins with retrieved by BlastP search using one of the orthologs as query. The hits often lacked an AT-hook domain recognized by SMART, although most of the underlying sequence was usually present. Minitrees were prepared for the full sets of orthologs, which were grafted onto the main tree to replace the incomplete set. The final tree was annotated with protein domain architectures and transcription profiles as described in the legend to figure S1.
The AT-hook is a small DNA-binding motif that preferentially binds to the minor groove of A/T rich DNA. They are present in mammalian HMGI/Y nuclear proteins that participate in inducible transcription and other interactions with chromatin (Reeves and Beckerbauer 2001), in plant DNA binding proteins (Nieto-Sotelo et al. 1994) and in hBRG1, a component of the SWI/SNF chromatin remodeling complex (Singh et al. 2006). Dictyostelia have a total of about 56 AT-hook proteins, amongst which several chromatin remodelling proteins, such as hdaC, snf2A, swi3 and chdB. In addition, cbfA, a jmjC type TF, contains AT-hook homologous sequence at its C-terminus, which is identical to the GRP domain that was previously shown to mediate cbfA binding to A/T rich regions in the C-module of DRE retrotransposons (Horn et al. 1999). No biological roles are known for the remaining AT-hook proteins. Figure S3. bZIP

Figure S3. Basic Leucin Zipper Transcription Factors (bZIPs)
Sequences with BRLZ domains were retrieved from Dictyostelid genomes with the Interpro BRLZ domain identifier (IPR004827) and by BlastP and tblastn queries with the sequences of the 19 Ddis bZIP factor proteins. The sequences corresponding to the BRLZ domains were aligned and a phylogeny was constructed and annotated with protein domain architectures and transcription profiles as described in the legend to figure S1.
The eukaryotic basic-leucine zipper (bZIP or BRLZ)) transcription factors are able to form both homo-and heterodimers (Hai and Curran 1991). bZIPs contain a basic region mediating sequence-specific DNA-binding followed by a leucine zipper region required for dimerisation. The dimerisation specificity of the leucine zipper allows for combinatorial interactions that can alter DNA binding and thus transcriptional regulation. Of the 19 bZIPs in Ddis, six have been functionally analysed. DimA and DimB function in DIF-1 signal transduction (Huang et al. 2006), BzpN regulates cell-density regulated proliferation (Phillips et al. 2011) and BzpF regulates spore maturation and stability . BzpF is also implicated as a cAMP response element-binding protein (CREB) by transcriptional network analysis, and binds to the canonical cAMP response element in vitro (Parikh et al. 2010a). The bzpG and bzpR genes have also been disrupted, but did not show any observable phenotype (Parikh et al. 2010a).
We find orthologs in Dfas, Ppal, Dlac and Dpur for 13 of the D.discoideum bZIPs. For all but three, bzpC, bzpN and bzpP, orthologs were present in all five Dictyostelid genomes. Several bZIPs (bzpM, bzpK, bzpI, bzpO, bzpT and bzpD) are only present in Ddis, while Dfas has an additional bzpl paralog and Ppal two additional dimA paralogs.  Sequences containing the C2H2_ZnF domain were identified from Dictyostelid genomes by the IPR013087 Interpro identifier and by BLAST queries with C2H2 domain sequences. The sequences corresponding to the 23 AA C2H2 domain were aligned and a pilot tree was constructed using RAXML, which was used to subdivide sequences into 3 sets of related proteins. Individual trees were inferred from these sets using RAXML or MrBayes. For orthologous sets that appeared incomplete, missing proteins with retrieved by BlastP search using one of the orthologs as query. Minitrees were prepared for the full sets of orthologs, which were grafted onto the main tree to replace the incomplete sets. The final tree was annotated with protein domain architectures and transcription profiles as described in the legend to figure S1.
Query with the C2HC5_ZnF Interpro identifier IPR009349, the CXC Interpro identifier IPR033467 and Blast searches revealed only a single well conserved gene for each in Dictyostelia.
The C2H2 (Cys2His2) zinc finger was first identified in TFIIIA (Miller et al. 1985) and is common to eukaryote transcription factors. However, C2H2 Zn finger proteins can also bind to RNA or to proteins, either alone, or in addition to DNA binding. The C2H2 Zn finger consists of two β-strands and an α-helix, which is stabilised by the binding of a zinc ion, with two conserved cysteine residues at one end of the β sheet and with two conserved histidine residues at the α-helix C-terminus. It associates with the major groove of DNA and commonly acts in tandem repeats in sequence-specific DNA binding (Iuchi 2001).
Dictyostelia contain 103 different genes with C2H2 domains and most of them are conserved in all 4 taxon groups. Only two genes, tacA (set 2) and gxcT (set 1) were functionally analysed in Ddis. TacA is a transcription factor which translocates to the nucleus in a Ca 2+ and calcineurin dependent manner. TacA silencing results in delayed development and formation of smaller fruiting bodies with only partially ascended spore heads (Thewes et al. 2012). The nucleotide exchange factor, gxcT, is involved in stabilizing PIP3 production in response to chemoattractants and hence controlling stable spatial sensing during chemotaxis. (Wang et al. 2013). This gene is obviously not a transcription factor, highlighting that some of the Ddis C2H2 proteins are likely not directly involved in gene regulation.
The C2HC5 zinc finger was previously detected in the thyroid receptor interacting protein 4, which becomes a strong transcriptional activator when fused to the yeast LexA repressor (Lee et al. 1995). Ddis has only a single well conserved C2HC5 containing protein, which has not been functionally analysed.

Figure S5. CBF/NF-Y/Archaeal Histone transcription factors
Sequences with NF-YA CCAAT and NF-YB/NF-YC CCAAT binding domains were identified with Interpro identifiers IPR001289 and IPR003958, respectively. Genomes were queried further by BlastP and tblastn with the retrieved sequences. Annotated trees were constructed as described in Figure S1.
The CCAAT-binding factor(CBF)/Nuclear transcription factor Y(NF-Y)/archaeal histone factor comprises of three subunits, NF-YA, NF-YB, and NF-YC, which specifically bind to CCAAT sequences in promoter regions. The NF-YB and NF-YC subunits dimerize through their histone-fold motifs, and associate with NF-YA to form the trimeric complex (Mantovani 1999). YBL1 and YCL1 share sequence similarity to NF-YB and NF-YC, respectively, but the YBL1-YCL1 dimer does not bind to NF-YA or interact with CCAAT sequence (Bolognese et al. 2000). The NF-YB/C-like proteins Dr1 and Drap1 associate to repress transcription by preventing formation of the preinitiation complex (Mermelstein et al. 1996).
The Ddis genome contains one nfyA homolog, which is conserved in all taxon groups and 8 nfyB/C homologs, amongst which nfyB, nfyC, ybl1, drap1 and dr1, all well conserved throughout Dictyostelia. There is no annotated ycl1, but sequence similarity suggests that DDB_G0272740 is an ycl1 ortholog. Additionally DDB_G0268506 is an nfyB/C type gene that is well conserved in Dictyostelia. None of the Dictyostelium NFY type proteins have been functionally analysed. Crtf and cudA are transcription factors that were first identified in Dictyostelia. They have previously annotated functional domains or Interpro identifiers. We retrieved homolog and orthologs of these TFs by BLASTp searches with Ddis crtf and cudA sequences. Because it is unknown whether these TFs are also present outside of Dictyostelia, we also performed BLASTp queries of Genbank and Amoebozoan genomes. Both crtf and cudA-like proteins were found in other Amoebozoa, but not in other eukaryotes. Phylogenetic trees were constructed using MrBayes from unambiguously aligned regions of protein sequence. Gene IDs and locus tags are colour coded to reflect host species and gene expression profiles were added as described for Figure S1.
Crtf is a transcription factor with a zinc finger-like motif that binds to the cAMP receptor 1 (cAR1) promoter and is essential for cAR1 expression and spore maturation (Mu et al. 2001). There is one crtf homolog in Ddis and both proteins are conserved throughout Dictyostelia. Crtf-like proteins were also detected in solitary Amoebozoa.
CudA is a transcription factor, which is expressed in both the prespore region and the tip of slugs. (Fukuzawa et al. 1997;Yamada et al. 2008). Tip-specific expression is required for initiation of fruiting body formation, while prespore-specific expression is essential for expression of some prespore genes. SpaA is essential for spore maturation and expression of PKA regulated prespore and spore genes (Yamada et al. 2018). Dictyostelids have 5 conserved cudA-like genes, of which one is duplicated in group 4. Several CudAgenes also have orthologs in solitary Amoebozoa. The sequences containing an E2F/DP, ENY2, FAR1 or Gal4-like domain were retrieved from Dictyostelia by the Interpro identifiers IPR003316, IPR018783, IPR004330 or IPR001138, respectively, and by performing BlastP searches. Sequences were aligned for Bayesian phylogenetic inference and trees were annotated as described for Figure S1.
E2F (adenovirus early gene 2 promoter-binding factor) and DP (dimerization partner) both display an Nterminal winged-helix fold for DNA-binding and a C-terminal dimerization domain. E2F proteins can form homodimers, but heterodimerization with DP increases DNA-binding efficiency (Zheng et al. 1999). E2F promotes the G1/S transition by activating transcription of cell cycle control genes and is itself negatively regulated by the retinoblastoma protein (Wu et al. 1995). The Ddis genome contains five proteins with the E2F TDP domain, including e2f and tfdp2. Both genes are conserved throughout Dictyostelia, with a second tfd2-related gene in Ddis. Their expression increases after starvation, particularly in prespore cells. In addition there is a small cluster of e2f-like genes unique to taxon groups 1 and 2 and a set of 2 genes unique to Ddis.
The enhancer of yellow 2 (EnY2) acts as a transcription factor in Drosophila (Georgieva et al. 2001), but is also known as SUS1 in yeast, where it is involved in mRNA export coupled transcription by interacting with the SAGA and TREX2 complexes (Pascual-García et al. 2008). Dictyostelia have a single eny2 gene that is conserved in all taxon groups. FAR1 is a plant transcription factor, related to mutator-like transposases, which is required for responses to far red light (Lin et al. 2008). Although the FAR1 domain was detected only in Dlac below threshold levels, its sequence is also well conserved in the other Dictyostelia.
The yeast Gal4 transcription factor contains a Zn(II)2C6 type binuclear cluster with six cysteines that interact with two zinc atoms (Pan and Coleman 1990). The zinc cluster can bind to CGG triplets or direct, inverted or everted CGG repeats as monomers, homodimers and heterodimers and are considered to be unique to fungi (MacPherson et al. 2006). Dictyostelia contain 3 deeply conserved genes with a GAL4 domain, and one gene, suvA, unique to Ddis. The Drosophila homolog of this gene, which does not have a GAL4 domain, is involved in heterochromatin mediated gene silencing (Schotta et al. 2003). GATA-ZnF domain containing proteins were identified in Dictyostelid genomes using the InterPro identifier, IPR000679, and by BlastP search. After sequence alignment, a phylogenetic tree was inferred using RaxML. The two main branches of the tree are shown separately as set 1 and set 2.
The GATA-type zinc finger domain is found in many transcription factors and its name is derived from the target DNA sequence (T/A)GATA(A/G) (Yamamoto et al. 1990). The zinc ion is coordinated by four cysteines in a core that consists of two antiparallel beta sheets, an alpha helix and a long loop that connects with the carboxyl-terminal tail (Omichinski et al. 1993). The Ddis genome has 24 genes with a GATA ZnF domain, with about 13 genes conserved across Dictyostelia and 9 resulting from Ddis specific amplification from a single gene. Dpur and to a lesser extent Ppal and Dfas also show species-or group-specific gene amplifications.
One of the expanded Ddis genes is comH, which is required for sporulation and development beyond the tight aggregate stage in a non-cell autonomous manner (Kibler et al. 2003). Three of the conserved genes were functionally analysed. Mutants defective in gtaC show defective aggregation and lack cAMP pulse induced gene expression. In wild-type, the cAMP pulses cause out-of-phase transport of GtaC to and from the nucleus (Cai et al. 2014;Santhanam et al. 2015). gtaC-cells also form fragile slugs and fruiting bodies that fail to remain erect. The latter phenotype is also found in mutants defective in DIF-1 signalling. DIF-1 was found to both induce gtaC expression and its translocation to the nucleus (Keller and Thompson 2008). GtaG is expressed in prestalk and stalk cells and gtaG null mutants cannot from fruiting bodies. This defect is rescued by the stalk inducing factor c-di-GMP, suggesting that GtaG is (indirectly) required for c-di-GMP synthesis (Katoh-Kurasawa et al. 2016). StkA is required for terminal spore differentiation and in stkAmutants the prespore cells differentiate into vacuolated stalk (Chang et al. 1996). The G-box binding factor gbfA was identified in Ddis as a protein binding to GC rich regions in the cAMPinducible gene cprA (Hjorth et al. 1990;Schnitzler et al. 1994) and was shown to be essential for cAMP induction of cprA and other cAMP-inducible genes. GbfA contains two putative zinc fingers and its expression is itself upregulated by cAMP (Hjorth et al. 1989;Brown and Firtel 2001). We searched for gbfA homologs in Amoebozoa using BlastP with gbfA as bait and identified orthologs of gbfA and another gbf-like protein in all Dictyostelia as well as respectively three and two more homologs in Ddis and Dpur. The putative zinc fingers with their two CxxC motifs were recognized in most proteins by SMART as RPT (repeat) 1 domains, but were also present when no RPT1 domain was recognized, as shown in the alignment below. A gbfA ortholog was also detected in the Amoebozoan Physarum polycephalum.

GCFC, HLH, HMG and HSF transcription factors
Sequences with GCFC domains were retrieved with Interpro identifiers IPR012890 and IPR022783. HLH, HMG and HSF proteins were retrieved with identifiers IPR011598, IPR009071and IPR000232, respectively. Dictyostelid genomes were further probed by BlastP using GCFC, HLH, HMG and HSF sequences. Phylogenetic trees were inferred and annotated as for figure S1.
The GC-rich sequence DNA-binding factor (GCFC) domain was originally identified in a transcriptional repressor (Kageyama and Pastan 1989), but is also present in a protein that interacts with the Pax7 transcription factor (Diao et al. 2012) and in septin and tuftelin interacting (stip) proteins that function in RNA splicing (Wen et al. 2005). Dictyostelia contain two conserved GCFC domain proteins; one is homologous to stip (Yu et al. 2011) and the other is of unknown function.
The basic helix-loop-helix (HLH) domain is the DNA binding region of a large family of transcriptional regulators (Jones 2004). Ddis has only a single gene, lsrA, with Tcf25 repressor region that contains a HLH domain (Cai et al. 2006). LsrA defective mutants were identified as "losers", because they become overrepresented in the prestalk population when mixed with wild-type cells (Parkinson et al. 2011). LsrA is conserved across Dictyostelia, while non-group 4 species have from 1 to 3 other HLH transcription factors.
High mobility group (HMG) domains are present in many transcription factors and other DNA binding proteins involved in replication and repair, and also mediate protein-protein interactions (Stros et al. 2007). Their role in Dictyostelids is as yet unknown.
Heat shock factor (HSF) activates transcription of heat shock genes in response to increased temperature, which induces HSF trimerization and binding to heat shock element (HSE) sequences in promoter regions (Clos et al. 1990). Dictyostelia contain a single well-conserved heat shock factor, which is expressed throughout development. Figure S11. Homeo domain

Figure S11. Homeo domain transcription factors
Sequences with homeodomain were retrieved from Dictyostelid genomes with the Interpro identifier IPR001356 and by Blastp search with identified proteins. Phylogenetic trees were inferred and annotated as described in figure S1.
The homeo domain or homeo box (HOX) transcription factors contain helix-turn-helix structure for binding to DNA. They owe their name to the fact that they cause homeotic mutations when mutated in animals, i.e. replace one body part with another, but also have many other functions across phyla (Bürglin and Affolter 2016). The Ddis genome contains 14 HOX genes of which 10 are conserved across Dictyostelia. Four genes were functionally analysed. WarA, which also harbours ankyrin repeats, regulates cell type proportioning, most likely by repressing pstO cells (Han and Firtel 1998). Additional knockout of hbx2 in warA null cells slightly enhances the phenotype, and hbx2 may therefore potentiate the function of warA. Overexpression of hbx4 repressed expression of the Ca 2+ -dependent adhesion protein cadA (Kim et al. 2011). An hbx9 knockout shows slow cell proliferation and delayed initiation of development as well as reduced cadA expression and overproduction of pstA cells (Mishra et al. 2017). Sequences with jmjC domains were retrieved from Dictyostelid genomes with the Interpro jmjC domain identifier (IPR003347) and by BlastP and tblastn queries with the sequences of the 13 Ddis jmjC proteins. Phylogenetic trees were inferred and annotated as described for figure S1.
Jumonji C (JmjC) domains (IPR003347) are found in transcription factors covering a species range from bacteria to humans (Clissold and Ponting 2001). The JmjC domain contains a cupin fold with putative Zn 2+ binding region that was shown to participate in histone demethylation by hydroxylation (Trewick et al. 2005), which classifies this group of transcription factors as chromatin remodeling proteins. The JmjC domain is often found in combination with an N-terminally located JmjN domain and acts in this configuration as a transcriptional repressor (Takeuchi et al. 2006).
The Ddis genome contains 13 genes encoding proteins with JmjC domains and 9 of those were conserved across the four taxon groups. Only one of the Ddis JmjC proteins has been examined in more detail. CbfA is a transcription factor activator that binds to the regulatory C-module of the retrotransposon TRE5-A and to an A/T-enriched motif in the promoter of adenylate cyclase A. CbfA is essential for basal but not cAMP-pulse induced expression of ACA (Winckler et al. 2004;Siol et al. 2006)   Proteins with the lambda repressor-like DNA-binding domain, MADS box domain and MIZ were identified in Dictyostelid genomes using the Interpro identifiers IPR010982, IPR002100 and IPR004181, respectively and by Blast searches. Phylogenetic trees were inferred and annotated as in figure S1.
The Lambda-cro/C1 type repressors contain a DNA binding helix-turn-helix domain and associate with DNA as dimers (Ohlendorf et al. 1998). Dictyostelid genomes contain only one conserved lambda-repressor type HLH protein. The two Ddis proteins result from a recent partial duplication of chromosome 2 in AX3 derived strains. Expression of these proteins is upregulated by inhibition of protein synthesis induced by cycloheximide (Singleton et al. 1988).
The MADS-box transcription factors are widely used in different eukaryote phyla. Proteins belonging to the MADS family function as dimers. The primary DNA-binding element is formed by an anti-parallel coiled coil of two amphipathic alpha-helices, one from each subunit (Messenguy and Dubois 2003). This family of proteins often interacts with other transcription factors or accessory factors, expanding their range of target genes. Dictyostelia have four deeply conserved MADS-box proteins, three similar to serum response factors (SRF) and the other to myocyte enhancer factor 2 (Mef2). Ppal has two additional MADS-box proteins. Ddis srfA is required for spore coat formation and organization of the actin cytoskeleton during spore formation (Escalante et al. 2004). Cells lacking srfB are defective in cytoskeleton related functions, chemotaxis to cAMP and early gene expression (Galardi-Castilla et al. 2008). Cells defective in mef2A show reduced growth and impaired prespore and spore differentiation (Galardi-Castilla et al. 2013).
The Msx-Interacting Zinc finger (MIZ) interacts with homeobox protein Msx2 to increase its DNA binding activity (Wu et al. 1997). A MIZ domain is also present in PIAS, an inhibitor of activated STAT transcription factors (Palvimo 2007)   Sequences with Myb domains were retrieved from Dictyostelid genomes with the Interpro Myb domain identifier (IPR017930), which also recovers proteins with the related SANT domain. Dictyostelid proteomes or genomes and were queried further by BlastP or tblastn with the sequences incomplete orthologous sets the Ddis mybs. The sequences corresponding to the myb/SANT domains were aligned and a pilot phylogenetic tree was constructed, which showed subdivision into two large branches. Two new trees were constructed from each branch, shown here as sets 1 and 2, and annotated as in figure S1.
The myb domain is named after the retroviral oncogene v-myb (myeloblastosis) and its cellular counterpart c-myb, which both encode DNA-binding proteins (Klempnauer and Sippel 1987;Prouse and Campbell 2012). The highly similar SANT domain is usually part of chromatin remodelling proteins and is considered to interact with histone tails, rather than DNA itself (Aasland et al. 1996;Boyer et al. 2004). Both domains consist of tandem repeats of three alpha-helices that are arranged in a helix-turn-helix motif.
The Ddis genome contains 27 proteins that are annotated as mybs A to Z and mybAA, of which mybs A-C,E,G-J,L-Q,S,U,Y and Z are conserved in all taxon groups. We detected three additional mybs: AB, AC and AD that are conserved throughout Dictyostelia. Mybs D and F arose through a group-4 specific duplication of mybE, mybX is specific to group 4, but is related to the clade of Swi3 chromatin remodeling proteins. mybR and mybT are Ddis specific proteins, while mybV only has a counterpart in Dpur and Ppal. mybW is not clearly affiliated with any other myb. MybZ is twice duplicated in Ppal, while Ppal, Dfas, Dlac and Dpur have each a few unique mybs. The different mybs can have from 1 to 5 myb/SANT domains and a variety of other domains that mostly function in either DNA binding or chromatin remodeling. The conserved ada2, chdB, isw, swi2 and swi3 proteins are all homologs of well-established eukaryote chromatin remodeling proteins (Clapier and Cairns 2009), while cdc5l is a component of the eukaryote spliceosome (Burns et al. 1999). Bdp1 is a subunit of TFIIIB, which is required for transcription of tRNA by RNA polIII (Ishiguro et al. 2002).
Developmental roles were previously assigned to the Ddis ChdB and mybs B,C and E. MybB is (in addition to CbfA) required for basal expression of adenylate cyclase A (Otsuka and Van Haastert 1998). MybC null mutants show a non-cell autonomous defect in the switch from slug migration to fruiting body formation (Guo et al. 1999). MybE mediates induction of the prestalk gene ecmA by DIF-1 (Fukuzawa et al. 2006), and is also required for ecmB expression in the lower cup that supports the spore head (Tsujioka et al. 2007). Remarkably, many genes are only DIF-1 inducible in the absence of mybE (Yamada et al. 2010). Despite this, similar to DIF-1 less mutants, mybE null mutants form long, weak slugs and fruiting bodies without the basal disc (Saito et al. 2008). The chdB-null mutants show delayed mound formation during early development and multiple mis-expressed genes (Platt et al. 2013). Figure S15. NDT80, NFX1 and Pipsqueak Figure S15. NDT80 and NF-X1 DNA binding proteins Sequences containing NDT80, NFX1 or Psq-type HTH domains were identified with the Interpro identifier IPR024061 or IPR000967, IPR007889, respectively, and by Blast searches. Phylogenetic trees were inferred and annotated as in figure S1.
The NDT80 DNA-binding domain consists of a core β-sandwich with additional β-sheets and an α-helix (Montano et al. 2002). Transcription factors with this domain are found in Amoebozoa and Opisthokonta. In fungi, NDT80 family proteins are required for regulation of meiosis or nutritional responses (Winter 2012), whereas the metazoan protein, myelin gene regulatory factor (MRF) is a key regulator of CNS myelination (Emery et al. 2009). Interestingly, a human protein, MYRF and a Ddis homologue, mrfA are membranetethered transcription factors (Li et al. 2013;Senoo et al. 2013). They are released from the endoplasmic reticulum through auto-cleavage by an intramolecular chaperone domain that has similarity to the chaperone of the bacteriophage protein endosialidase. In Ddis, the released fragment of mrfA that contains the DNA binding domain translocates to the nucleus, where it acts as an activator of the prestalk gene ecmA (Senoo et al. 2012;Senoo et al. 2013). Dictyostelia have three deeply conserved genes with an NDT80 DNA binding domain, including mrfA. All three genes have transmembrane domains and the chaperone domain, and are conserved across 4 taxon groups.
The NF-X1 zinc finger with a unique pattern of cysteine and histidines is present in species ranging from plants to humans. The zinc finger domain binds to an X-box motif in the promoter of human major histocompatibility complex class II genes, acting as a transcriptional repressor (Song et al. 1994) In Arabidopsis, two NF-X1 proteins antagonistically control expression of stress-related genes (Lisso et al. 2006). Dictyostelia have one deeply conserved gene with tandem NF-X1 repeats, while Dfas and Ppal have two additional NF-X1 containing genes. None of the genes have been functionally characterized.
Pipsqueak (psq) is a helix-turn-helix type transcription factor with many developmental roles in Drosophila, where it binds to GAGAG consensus motifs in target genes (Lehmann et al. 1998). In Dictyostelia, only Ppal has a number of psq genes with high sequence similarity in the psq motif, which is however not always detected as such. In four Ppal genes, the psq domain is combined with a DDE_1 domain. DDE domains generally encode endonucleases required for efficient DNA transposition (Nesmelova and Hackett 2010). The Ppal psq genes are mostly poorly expressed in growth and development, but strongly upregulated in late encystation. Figure S16. STAT, TF2, TMF-1 and WRKY Figure S16. STAT, TF2, TMF-1 and WRKY transcriptional regulators.
Sequences containing STAT, TMF-1 or WRKY domains were identified by the Interpro identifiers IPR015347, IPR022092/IPR022091, or IPR003657, respectively, and by Blast queries. Phylogenetic trees were inferred and annotated as in figure S1.
STATs (Signal Transducer and Activator of Transcription) are widely used metazoan transcription factors with roles in development, proliferation and immune response. In the canonical metazoan pathway STATs are phosphorylated by the tyrosine kinase JAK, which causes their dimerization and accumulation in the nucleus, where they bind to target genes (Darnell 1997). Ddis has four STATs a-d. STATa is nuclear translocated in response to cAMP binding to cell surface cAMP receptors (Araki et al. 1998). It is required for efficient chemotaxis and inhibits expression of the stalk gene ecmB in the prestalk region of slugs. STATa null mutants also show prolonged slug migration, but eventually form abnormal fruiting structures with few, if any, stalk cells (Mohanty et al. 1999). STATb null mutants gradually disappear when co-cultured with wildtype, but are otherwise normal (Zhukovskaya et al. 2004). StatC is nuclear translocated both in response to hyperosmotic stress and to stimulation with DIF-1 (Araki et al. 2003). STATc null mutants show a minor growth defect, a 1-2 h acceleration of early development and prolonged slug migration (Fukuzawa et al. 2001). No roles for STATd have been uncovered. All four STATs are conserved throughout Dictyostelia, inclusive of their domain architecture.
TF2 was identified in Ddis as a protein binding to the 5'C box of the glycogen phosphorylase 2 promoter (Warner and Rutherford 2000). Its short-chain dehydrogenase domain is commonly found in NAD or NADPdependent oxidoreductases (Jornvall et al. 1995).
TMF-1 (TATA element modulatory factor) functions both as a Golgi protein involved in membrane trafficking (Fridmann-Sirkis et al. 2004;Yamane et al. 2007) and a nuclear protein that competes with TATA binding protein (TBP) for binding to some promoters with the RNA polymerase II TATA box (Garcia et al. 1992). Dictyostelia contain a single well conserved TMF-1 type protein of unknown function.
WRKY transcription factors contain a conserved WRKYGQK motif and C2-H2 or C2-H-C zinc-finger-like motifs and bind to TTGAC(C/T) promoter elements. They are very abundant in plants, where they have diverse biological functions (Bakshi and Oelmuller 2014), but are also found in Giardia. WRKY proteins are classified by the number of WRKY domains and features of the zinc-finger. WRKY proteins with two WRKY domains belong to group I, whereas those with one WRKY domain belong to groups II or III. Group I and II share the C2H2 Zn finger motif, while Group III have a C2HC motif. Dictyostelids have a conserved group I WRKY. The two single domain Dpur proteins are likely part of the same (mis-annotated) gene. Dfas has two additional group II type WRKY proteins. In addition to sequence-specific transcriptional regulators, eukaryote genes require one of three general transcription factor complexes to initiate transcription. These complexes incorporate the TATA-box binding protein (TBP), any of the three RNA polymerases, RNA pol. I, II and III and up to 14 other proteins (Cooper and Hausman 2016). Protein-coding genes are transcribed by RNA pol. II, large ribosomal RNAs are transcribed by RNA pol. I and small ribosomal RNAs and transfer RNAs are transcribed by RNA pol. III. Formation of the RNA pol. II transcription complex initiates with the binding of TBP and TBP-associated factors (TAFs) to the TATAA sequence that resides 25-30 nt upstream of the transcription start site. This complex, called TFIID, then sequentially binds TFIIB, RNA pol. II and TFIIF, TFIIE and TFIIH. Several other factors can additionally be recruited to this large complex. For RNA pol I. mediated transcription, TBP associates with other proteins to form SL1, which binds together with the transcription factor UBF to promoters of the large ribosomal RNAs and then recruits RNA pol. I to initiate transcription. RNA pol. III transcription of transfer RNAs initiates by binding of TFIIIC downstream of the transcription start site followed by TFIIIB and RNA pol. III. Transcription of the 5S rRNA additionally requires TFIIIA (Cooper and Hausman 2016).
Most of the proteins required for RNA pol. II and pol III mediated transcription were annotated to the Ddis genome in Dictybase (Basu et al. 2015) and blastP searches showed them to be conserved across Dictyostelia. In addition, searches with metazoan sequences revealed the presence of TAFs 4, 8, 11 and 14 across Dictyostelia, only TAF3 was not detected. Also neither of the RNA pol. I associated TFs could be found in Dictyostelia. Figure S18. Phylogeny-wide change in general transcription factors Figure S18. Phylogeny-wide change in general transcription factors Summary data on orthology and conservation of functional domains, developmental regulation and cell type specificity of components of the general transcription initiation complexes II and III. See the legend to figure 2 for explanation of the colour coding of feature states. Prepared from analyses presented in Figure S17 and summarized in Supplemental_Table_S2. Gene names in parentheses are not yet annotated in Dictybase (Basu et al. 2015).