Genome-wide analysis of human putative transcriptional target genes reveals significant functional enrichments

Functional enrichments of putative transcriptional target genes have been utilized to understand the functions of transcription factors and cascades in a cell. To investigate their features, transcriptional target genes were predicted using open chromatin regions of human immune and ES cells, as well as known transcription factor binding sequences. Gene Ontology annotations showed four times larger numbers of functional enrichments in putative transcriptional target genes than gene expression information alone in the cell types. More than two times larger numbers of functional enrichments in putative target genes was observed using forward-reverse orientation of CTCF-binding sites than without them. These analyses would be useful to find genomic features involved in chromatin interaction and improve the prediction of transcriptional target genes.


Introduction
More than 400 types of cells have been found in the human body. protein α (C/EBPα) play a critical role in the expression of myeloid-specific genes and the generation of monocytes and macrophages [1,2]. The transcription factor GATA-3 is essential ３ for early T cell development and the differentiation of naive CD4 + T cells into Th2 effector cells [3]. E2A, EBF1, PAX5, and Ikaros are among the most important transcription factors that control early development in mice, thereby conditioning homeostatic B cell lymphopoiesis [4].
We previously examined the differentiation of monocytes and macrophages in mice, and discovered that the transcription factor IRF8 was essential for cellular differentiation [5]. An analysis of transcription factor-binding sites (TFBS) revealed that IRF8 regulated the expression of KLF4 through the IRF8 transcriptional cascade. Functional enrichment analyses revealed that the target genes of IRF8 showed functional enrichment for antigen presentation, whereas those of KLF4 showed functional enrichments for phagocytosis and locomotion. These results suggested that the transcriptional cascades of IRF8 and KLF4 included different functional modules of target genes.
Functional enrichments of transcriptional cascades of IRF8 and KLF4 appeared to be related to the cellular functions of monocytes and macrophages. Although several transcription factors were expressed in monocytes and macrophages, the number of these transcriptional target genes that resulted in functional enrichments remains unknown. Whether transcriptional target genes in other human cells showed functional enrichments also remain unclear. If the transcriptional target genes showed significant functional enrichment, analyzing transcriptional target genes and cascades would be useful in identifying genes involved in a specific cellular function. Using the budding yeast, previous studies examined the functional enrichments on a genome-scale genetic interaction map using the GeneMANIA algorithm [6,7]. Using bacterial systems, the analyses of functional enrichments of predicted regulatory networks were performed using Gene Ontology annotations [8]. Various databases of functional annotations of genes and pathways exist. Analysis of functional enrichments is expected to be useful for understanding the association of genes involved in similar functions and same pathways, and for predicting unknown gene functions such as non-protein-coding RNAs. In addition, the extent of enhancer region contribution to functional enrichments of transcriptional target genes remains unknown.

４
In this study, transcriptional target genes were predicted using public databases of open chromatin regions of human monocytes, naive CD4 + T and CD20 + B cells, and known transcription factor binding sequences. Functional enrichment analyses of putative transcriptional target genes were conducted using 10 different annotation databases of functional annotations and pathways. ChIP-seq data were used [9][10][11][12][13][14][15]. Position weight matrices of transcription factor binding sequences were transformed into TRANSFAC matrices and then into MEME matrices using in-house Perl scripts and transfac2meme in MEME suite [16]. Transcription factor binding ５ sequences of transcription factors derived from vertebrates were used for further analyses.

Searches for transcription factor binding sequences from open chromatin regions
Searches were conducted for transcription factor binding sequences from the central 50-bp regions of each narrow peak using FIMO with p-value threshold of 10 −5 [17].

Prediction of transcriptional target genes
Target genes of a transcription factor were assigned when its TFBS was found in DNase-DGF narrow peaks in promoter or enhancer regions of genes. Promoter and enhancer regions were defined as follows: promoter regions were those that were within distances of ±5 kb from transcriptional start sites (TSS). Enhancer regions were defined as per the following four criteria, which are similar or same as those defined in a previous study [18]: (1) the basal plus extension association rule assigns a basal regulatory domain to each gene regardless of other nearby genes. The domain is then extended to the basal regulatory domain of the nearest upstream and downstream genes, and includes a 5 kb + 5 kb basal region and an extension up to 300 kb or the midpoint between the TSS of the gene and that of the nearest gene upstream and downstream; (2) 5 kb + 1 kb basal region and an extension up to 1 Mb; (3) the two nearest genes association rule, which extends the regulatory domain to the TSS of the nearest upstream and downstream genes without the limitation of extension length; and (4) the single nearest gene association rule, which extends the regulatory domain to the midpoint between the TSS of the gene and that of the nearest gene upstream and downstream without the limitation of extension length. Definition of criteria (1) was used in our previous study [5]. Definitions of criteria (2), (3), and (4) were the same as those in Figure 3 of the previous study [18], however, ６ definitions of criteria (3) and (4) did not have the limitation of extension length in this study.
The genomic positions of genes were identified using knownGene.txt.gz in UCSC bioinformatics sites [19]. The website knownCanonical.txt.gz was also utilized for choosing representative transcripts among various alternate forms for assigning promoter and enhancer regions of the genes. From the list of transcription factor binding sequences and transcriptional target genes, redundant transcription factor binding sequences were removed by comparing the target genes of a transcription factor binding sequence and its corresponding transcription factor; if identical, one of the transcription factor binding sequences was used. When the number of transcriptional target genes predicted from a transcription factor binding sequence was less than five, the transcription factor binding sequence was omitted.

Gene expression analyses
For gene expression data, RNA-seq reads mapped onto human hg19 genome sequences were obtained, including ENCODE long RNA-seq reads with poly-A of monocytes CD14 + cells, CD20 + B cells, and H1-hESC (GSM984609, GSM981256, GSE26284, and GSM958733), and UCSF-UBC human reference epigenome mapping project RNA-seq reads with poly-A of naive CD4 + T cells (GSM669617). Two replicates were present for monocytes CD14 + cells, CD20 + B cells, and H1-hESC and a single one for CD4 + T cells. RPKMs of the RNA-seq data were calculated using RSeQC [20]. For monocytes, Blueprint RNA-seq RPKM data (GSE58310, GSE58310_GeneExpression.csv.gz, Monocytes_Day0_RPMI) was also used [21]. Based on RPKM, UCSC transcripts with expression levels among top 30% of all the transcripts were selected in each cell type.

Functional enrichment analyses
The functional enrichments of target genes of a TFBS and its corresponding transcription factor were examined using GO-Elite v1.2.5 with p-value threshold at 1, and after GO-Elite analyses a false discovery rate (FDR) test was performed with q-value threshold at 10 -3 to ７ correct for multiple comparisons of thousands of groups of transcriptional target genes in each cell type and condition [22]. For examining functional enrichments of high or low expressed genes independent of transcriptional target genes, the p-value threshold was set to 0.01 or 0.05 to confirm that the results were not significantly changed. UCSC gene IDs were transformed To investigate whether the normalized numbers of functional enrichments of transcriptional target genes correlate with the prediction of target genes, a part of target genes were changed with randomly selected genes with high expression level (top 30% expression level), and functional enrichments of the target genes were examined. First, 5%, 10%, 20%, 40%, and 60% of target genes were changed with randomly selected genes with high expression level in monocytes, CD4 + T cells, and CD20 + B cells. Second, as another randomization of target genes, the same number of 5%, 10%, 20%, 40%, and 60% of target genes were selected randomly from highly expressed genes, then added them to the original target genes, and functional enrichments of the target genes were examined. All analyses were repeated three times to estimate standard errors (Figure 2A and B, and Supplementary  [25]. Combined promoter-enhancer regions (enhancer definition 4) were shortened at the genomic locations of CTCF-binding sites that were the closest to a transcriptional start site, and transcriptional target genes were predicted from the shortened enhancer regions using TFBS. Furthermore, combined promoter-enhancer regions (enhancer definition 4) were shortened at the genomic locations of forward-reverse orientation of CTCF-binding sites. When forward or reverse orientation of CTCF-binding sites were continuously located in genome sequences several times, the most external forward-reverse orientation of CTCF-binding sites were selected.

Prediction of transcriptional target genes
To examine functional enrichments of transcriptional target genes in a genome scale, Transcriptional target genes were predicted in human monocytes, CD4 + T cells, and CD20 + B cells. Searches for known transcription factor binding sequences, which were collected from factor binding sites and as some transcription factors will recognize multiple distinctly different sequence motifs, transcription factor binding sequences that targeted the same genes were recognized as redundant, and one of the sequences was used [26] (see Methods). In total, 3,337 transcription factor binding sequences in human monocytes, 3,652 in CD4 + T cells, and 3,187 in CD20 + B cells were identified with their target genes, which were selected from highly expressed genes in a cell.
The total numbers of unique highly expressed target genes of transcription factor binding sequences (top 30% expression level) were 4,481, 7,558, and 4,753 in monocytes, CD4 + T cells, and CD20 + B cells respectively using promoter regions. The mean target genes of a transcription factor were 124, 164, and 144 in monocytes, CD4 + T cells, and CD20 + B cells, respectively, with the corresponding medians being 24, 33, and 24, respectively. With regard to the genomic localizations of TFBS, 51%, 65%, and 61% of TFBS were located within promoter regions (±5 kb of TSS) of target genes in monocytes, CD4 + T cells, and CD20 + B cells, respectively (according to enhancer definition 1, see Methods).

Functional enrichments of putative transcriptional target genes
Functional enrichments of the putative target genes were examined. The distribution of functional enrichments in transcriptional target genes was predicted using genome sequences of promoter regions in the three cell types ( Figure 1 and Table 1, see Methods). Furthermore, the effect of transcriptional target genes including randomly selected genes on functional enrichments was investigated using DNase-DGF data of monocytes, CD4 + T cells and CD20 + B cells and ChIP-seq data of H1-hESC ( Figure 2A  Ontology annotations in monocytes, CD4 + T cells, and CD20 + B cells, respectively (Table 1).
Further, the examination of functional enrichments of highly expressed target genes (top 30% expression level) in target genes revealed 1,271, 1,654, and 1,192 unique Gene Ontology annotations in monocytes, CD4 + T cells, and CD20 + B cells, respectively i.e., These numbers were four times larger than functional enrichments identified by gene expression information alone, suggesting that transcriptional target genes were frequently associated with similar functions or pathways (Supplementary Table S3  CD4 + T cells, and CD20 + B cells are shown in Table 2 and Supplementary Table S5. Functional enrichments were found to be related to cellular functions, e.g., interferon signaling, GMCSF (Granulocyte-macrophage colony-stimulating factor, a kind of cytokine)-mediated signaling events, antigen processing-cross presentation in monocytes; TCR (T-cell receptor) signaling in naive CD4 + T cells, IL-12 (Interleukin-12, a kind of cytokine)-mediated signaling events, and downstream signaling in naive CD8 + T cells in CD4 + T cells; interferon alpha/beta signaling, IL8-and CXCR2 (Chemokine receptor type 2, a kind of cytokine)-mediated signaling events, and BCR (B cell antigen receptor) signaling pathway in CD20 + B cells. WikiPathways also revealed that functional enrichments were associated with cellular functions (Supplementary Table S6).

Effect of enhancer regions on functional enrichments
To understand the effect of enhancer regions on the functional enrichments of target genes, the definition of enhancer regions was modified according to four criteria ( Figure 3A and see Methods) [18], and functional enrichments were investigated.
According to the definition of enhancer (1), the means of target genes were 177, 217, and 175 in monocytes, CD4 + T cells, and CD20 + B cells, respectively, whereas the corresponding medians were 55, 58, and 37, respectively (Supplementary Table S7). The numbers of functional enrichments of Pathway Commons annotations using promoter regions were 1,005, 1,806, and 821 in monocytes, CD4 + T cells, and CD20 + B cells, respectively (Supplementary 3.32-fold increases, respectively, in the three cell types (enhancer definition 1, Table 3).
The normalized numbers of the functional enrichments of transcriptional target genes

１２
showed enhancer definition (4) as the highest number, followed by enhancer definition (1) and (2) in the three cell types. Although enhancer definition (3) was the longest among the four criteria, it showed the lowest number of functional enrichments in the three cell types ( Figure   3A and Table 3). ChIP-seq data of 19 TF in H1-hESC (Human embryonic stem cells) also showed the same or similar tendencies (Supplementary Table S9

Effect of CTCF-binding sites on functional enrichments
CTCF have the activity of insulators to block the interaction between enhancers and promoters [27]. Recent studies identified a correlation between the orientation of CTCF-binding sites and chromatin loops ( Figure 3B) [28].  Table S12). The normalized numbers of were significantly increased between

Comparison of expression levels of putative transcriptional target genes
To examine the relationship between functional enrichments and expression levels of

Discussion
Genome-wide functional enrichments of putative target genes of human transcription factors were investigated. This is the first report to show human genome-wide functional enrichment analyses of putative transcriptional target genes using various functional annotation databases. Functional enrichments common to the three cell types included immunological terms representing their common features; in addition, immunological terms related to each cell １６ type were also found. This result suggested that predicted transcriptional target genes included cell-specific and common target genes, which were related to cellular functions.
In     orientation exist in >60% neighboring TAD boundaries (Figure S4D), suggesting that the boundary reverse-forward CBS pairs play an important role in the formation of most of TADs. For example, there is a CBS pair in the reverse-forward orientation in a Chr12 genomic region of H1-hESC cells, located at or very close to each of the six TAD boundaries (boundaries 1-6), except for boundary 5, which has only one closely located CBS in the forward orientation ( Figure S4E). These data, taken together, strongly suggest that directional binding of CTCF to boundary CBS pairs in the reverse-forward orientations causes opposite topological looping and thus appears to function as insulators.

2015) (Figures S2B and S2C
CRISPR cell lines D3, D7, and of 38 clones screened) in which nal CBS4 (3 0 HS1) was del ure S2B), chromatin-looping in between CBS3 (5 0 HS5) in th orientation and the boundary CBS5 in the reverse ori domain1 persisted, although its interaction with (3 0 HS1) region was abolished ( Figures S5A and S5 pected, the interactions between CBS6/7 and C domain2 were unchanged ( Figure S5C). Strikingly, h the CBS4 (3 0 HS1) and CBS5 double-knockout CRISP C2, C4, and C14 (out of 49 clones screened) (Figure S chromatin-looping interactions between CBS3 (5 0 HS5 ward orientation of domain1 and CBS8/9 in the rever tion of the neighboring domain2 were observed, sugg these two domains merge as a single domain in C lines with CBS4/5 double knockout ( Figure S5B). when CBS8 was used as an anchor, this reverse-orie in domain2 establishes new long-range chromatin-loo actions with CBS1-3 in the forward orientation of dom CBS4/5 double-deletion CRISPR cell lines (Figure conclude that cross-domain interactions can be e after deletion of CBSs up to the boundary of topo mains, but not after deletion of the internal CBS in th locus.
To further test the functional significance of this or of CBSs, we again performed CRISPR/cas9-media fragment editing in the HEK293T cells and screened 19

CTCF-mediated DNA Looping
Forward Orientation of CTCF-binding Sites orientation exist in >60% neighboring TAD boundaries (Figure S4D), suggesting that the boundary reverse-forward CBS pairs play an important role in the formation of most of TADs. For example, there is a CBS pair in the reverse-forward orientation in a Chr12 genomic region of H1-hESC cells, located at or very close to each of the six TAD boundaries (boundaries 1-6), except for boundary 5, which has only one closely located CBS in the forward orientation ( Figure S4E). These data, taken together, strongly suggest that directional binding of CTCF to boundary CBS pairs in the reverse-forward orientations causes opposite topological looping and thus appears to function as insulators.

2015) (Figures S2B and S2
CRISPR cell lines D3, D7, an of 38 clones screened) in whic nal CBS4 (3 0 HS1) was del ure S2B), chromatin-looping i between CBS3 (5 0 HS5) in th orientation and the boundary CBS5 in the reverse or domain1 persisted, although its interaction with (3 0 HS1) region was abolished ( Figures S5A and S5 pected, the interactions between CBS6/7 and domain2 were unchanged ( Figure S5C). Strikingly, h the CBS4 (3 0 HS1) and CBS5 double-knockout CRISP C2, C4, and C14 (out of 49 clones screened) (Figure S chromatin-looping interactions between CBS3 (5 0 HS5 ward orientation of domain1 and CBS8/9 in the rever tion of the neighboring domain2 were observed, sugg these two domains merge as a single domain in C lines with CBS4/5 double knockout ( Figure S5B) when CBS8 was used as an anchor, this reverse-ori in domain2 establishes new long-range chromatin-loo actions with CBS1-3 in the forward orientation of dom CBS4/5 double-deletion CRISPR cell lines (Figure conclude that cross-domain interactions can be e after deletion of CBSs up to the boundary of topo mains, but not after deletion of the internal CBS in th locus.
To further test the functional significance of this o of CBSs, we again performed CRISPR/cas9-medi fragment editing in the HEK293T cells and screened 1