Inducing the fibroblast cells to neuroepithelial stem cells
We used a combination of three stem cell TFs, i.e., SOX2, OCT4 and KLF4, to convert fibroblast cells of Macaca mulatta to NESCs [20] (see “Materials and methods” section). We collected fibroblast cells at the third generation before conversion (named as RF-P3), and cells on day 5 (as RF-iN-d5), day 11 (as RF-iN-d11) after initiation of the conversion protocol. These three cell lines represent the original fibroblast cells and those at early stages when fibroblast cells converting to neuroepithelial stem cells. The RF-iN-d11 cells were further passaged and cultured on 5 mg/ml laminin (Gibco) coated plates in iNESC-M culture media. We then collected epithelial cell colonies at early passages 6 (as CHIR-P6) and later passages 23 (as CHIR-P23). After that, two ideal single-cell formatted colonies (named as C8-P7 and A9-P7, respectively) were collected. On and after CHIR-P6, the procedure for producing NESCs was regarded as being finished. Thus, CHIR-P6 and later cells were regarded as NESCs [20]. The total RNAs of these 7 obtained cell lines (with two replicates for each cell line) were sequenced with Illumina HiSeq 2000 sequencer and we obtained 11 to 49 million 2×100 bp pair-end reads from these RNA-Seq profiles.
The gene expression patterns in the reprogramming of fibroblast cells
We used the Cufflinks (v2.2.1) pipeline [23] to align the obtained RNA-Seq profiles to the genome of rhesus monkey and to quantify the abundance of genes at different stages of the reprogramming procedure. We obtained 40,705 genes after analyzing these 14 RNA-Seq profiles. These genes were filtered to keep 1627 genes with at least 20 FPKM in at least one of the 7 average values of two replicate and at least a standard deviation value of at least 20 FPKM in the 7 average values of two replicate (as listed in Additional file 1: Supplementary Table S1). These 1627 genes were clustered with the Self Organizing Map algorithm in the GeneCluster 2 package [24]. The obtained results were shown in Fig. 1a. The expressions of genes in Cluster G0 in Fig. 1a had a slight reduction at the third time point and then increased till the second last time point. On the contrary, the genes in Clusters G6 and G10 had increased expressions at the third and fourth time points, respectively. The expressions of genes in Cluster G5 gradually decreased in the whole reprogramming procedure.
We next analyzed the enriched GO terms of genes in the gene clusters and the significantly enriched (corrected P<0.05, Hypergeometric tests corrected with the Benjamini and Hochberg method [25]) GO terms of Clusters G0 to G15 were listed in Additional file 1: Table S2 to S17, respectively. The most significant GO terms of G0, G10, G6, and G5 are also shown in Fig. 1c to f, respectively.
The genes in G0 are upregulated after the fourth time point when the NESCs were established. As shown in Fig. 1c, two GO terms, i.e., mitotic cell cycle process and cell cycle, are enriched in G0, which is consistent with the unlimited proliferation property of stem cells. Several other enriched GO terms related to binding, such as protein binding, double stranded DNA-binding, intracellular organelle, organelle, cell, and chromatin binding, are also related to the proliferation.
The genes in G10 show a transient upregulation at the fourth time point, i.e., exactly when the NESCs were established. As shown in Fig. 1d, this cluster includes some GO terms related to endoplasmic reticulum (ER), RNA binding, poly(A) RNA binding, extracellular region, which might be related to the translations and transitions of proteins to extracellular regions. These results suggest that transitions of proteins to extracellular regions, potentially for communications between cells, are important when the NESCs were being established.
G6 is an important cluster and includes several key TFs, such as POU5F1 (also known as OCT4) and KLF4 (see Fig. 1b). This cluster has many enriched GO terms related to extracellular parts of cells (Fig. 1e), consistent with the physiological conversion of cell lines. Furthermore, binding and enzyme regulator activity, and regulation protein metabolic process are also enriched in G6, which is in accordance with the need to change the status of the cells. G6 also shows many GO terms related extracellular region, extracellular exosome, etc., suggesting that transitions of proteins to extracellular regions are active for fulfilling the reprogramming of fibroblast.
The genes in G5 show gradually decreased expression levels in the reprogramming of fibroblast cells. This cluster has some enriched GO terms related to cytoplasm, cytoplasmic part, cellular process, protein binding and binding (see Fig. 1f), suggesting that the genes in G5 are mainly involved in the normal growth of cells.
The transcription dynamics of TEs in the reprogramming of fibroblast cells
Because TEs contribute to the reprogramming of cell lines, we carefully examined the transcriptional patterns of TEs in the reprogramming of fibroblast cells in Macaca mulatta. After removing TEs with very low expression levels and constant expression levels, we kept 2067 TEs with abundance of at least 5 FPKM in at least one of the RNA-Seq profiles and with standard deviations of at least 5 FPKM in the 14 profiles. Then, TEs that overlapped to coding genes were excluded and finally 495 TEs were identified as dynamically regulated TEs during the reprogramming of fibroblast cells (as listed in Additional file 1: Table S18). These 495 TEs were then grouped to 6 clusters based on their expression patterns in the reprogramming of fibroblast cells with the SOM algorithm in the GeneCluster 2 package [24] (see Fig. 2a).
These 6 clusters mainly have three expression patterns. The first group, i.e., T0 to T1, shows gradual increases of expression during the conversion process. The second group, i.e., T3, has a clear increase of expression at the second and third time point during the conversion process, but the expressions of these TEs were completely silenced from the fourth time point (CHIR-P6), i.e., after the fibroblast cells were converted to NESCs. The third group (T2, T4 and T5) only shows slight fluctuations or has limited TEs.
As shown in Fig. 2b, approximately 80% TEs in Clusters T0, T1, T3 and T4 are Class I retrotransposons, i.e., ERV elements, LINEs and SINEs. The percentage of LTR/ERV elements in the T3 cluster was significantly larger than other clusters (Fig. 2b) and thirty two of the 55 LTR/ERV elements in Cluster T3 are MacERV3 integrase elements (Additional file 1: Table S18).
As shown in Fig. 2c, a region near DRAXIN on Chr2 included 9 TEs, all of which showed gradually increased expression during the reprogramming procedure of fibroblast and were included in the Cluster T0 in Part A of Fig. 2. A low complexity TE (GA-rich_E21909) near POU3F3 was only expressed in the last three cell lines (Fig. 2d) and was included in Cluster T0 as well. Similarly, three TEs near IGFBPL1 were only expressed in the last 4 cell lines and were grouped in Cluster T1 (Fig. 2e). An LTR (LTR21A_E28) on Chr5 had limited expressed in fibroblast and was gradually upregulated during the reprogramming progressed (Fig. 2f) and was grouped in Cluster T1 too.
We examined several MacERV3 integrase elements in Cluster T3 as well. As shown in Fig. 2g, MacERV3_int-int_E18 was only expressed in the second and third cell lines, i.e., in the early stages of reprogramming of fibroblast. MacERV3_LTR2_25 beside MacERV3_int-int_E18 was not found in the 495 dynamically regulated TEs, but showed similar expression patterns as MacERV3_int-int_E18. Similarly, MacERV3_int-int_E114/E115 in Fig. 2h and MacERV3_int-int_E79 in Fig. 2i were detected in the second and third cell lines, however the LTRs beside these MacERV3 integrase elements were not identified as dynamically regulated TEs.
LTRs of macERV3 are transiently activated in the reprogramming of fibroblast cells
More than 30 MacERV3 integrase elements were activated at the second and third time point as in Cluster T3 of Fig. 2a, but only one LTR of MacERV3 was identified in the same cluster of TEs (see Additional file 1: Table S18). We guessed that LTRs could not be sensitively detected with the featureCounts program because some LTRs were of relatively smaller sizes. Therefore, we recalculated the abundance of all LTRs with a more sensitive method to take all reads that covered the LTRs into account (see “Materials and methods” section), and kept LTRs with average expression levels of at least 5 FPKM in at least one of the 7 time points and standard deviation values of at least 5 FPKM. After removing LTRs overlapped to coding genes, we finally obtained 98 LTRs (as listed in Additional file 1: Table S19) that were clustered with the SOM algorithm in the GeneCluster 2 package [24]. As shown by the LTR clusters in Fig. 3a, we found that L0 and L3 with a total of 60 LTRs were activated at the second and third time points but were silenced after the third time point, which had similar expression patterns to the MacERV3 integrase elements in the Cluster T3 in Fig. 2. Most of these 60 (48/60) LTRs were LTR1 and LTR2 of MacERV3s. Actually, MacERV3_LTR2_25 in Fig. 2g, MacERV3_LTR2_3 and MacERV3_LTR2_31 in Fig. 2h were grouped in Cluster L3 of LTRs in Fig. 3a, and MacERV3_LTR2_6 in Fig. 2i was included in Cluster L0 of LTRs in Fig. 3a. In summary, these results suggest that the LTRs and integrates of MacERV3s were transiently activated in the early stages when converting monkey fibroblast cells to NESCs, but their expression levels were decreased to approximately 0 after the reprogramming procedure is finished.
We performed phylogenetic analysis for LTRs in Clusters L0+L3, L1 and L4. As shown in Fig. 3b to d, LTRs in L0+L3 were mainly LTRs of MacERV3, but L1 and L4 include very diverse types of LTRs.
The LTRs in Clusters L0 and L3 have very clear activation at the second and third cell lines (Fig. 3e). As shown in Fig. 3f and G, two solo LTRs were activated at the second and third cell lines. In comparison, the LTRs in Cluster L4 show almost constant expressions during the reprogramming procedure (Additional file 2: Figure S1).
We then examined the genomic contexts of the 60 LTRs in Clusters L0+L3. We found that 10 of these 60 LTRs locate beside annotated genes (Additional file 1: Table S19), with distances of smaller than 5 thousand basepairs, and were potentially transcribed as parts of long transcripts; and the other 50 are solo LTRs. The expression levels of 9 of these 10 LTRs near genes are positively correlated with those of their adjacent genes (Additional file 1: Table S19), suggesting their promoter functions to neighboring genes. Furthermore, five of these 10 LTRs locate in the upstream regions of annotated genes and potentially serve as promoters of these genes (Additional file 1: Table S19).
Validating the expressions of MacERV3 LTRs with PCR experiments
To further confirm the transient expressions of MacERV3 LTRs at early stages of NESC conversion procedure, we selected 13 LTRs of MacERV3 (in Additional file 1: Table S20) for validation with PCR experiments with a set of primers shared by these LTRs (Additional file 2: Figure S2). These selected LTRs were successfully amplified in all of the samples of RF-iN-d5 and RF-iN-d11 cells. As shown in Fig. 4a, these LTRs showed clear activation at the early stages of the conversion procedure and reached maximal expressions in the cell line of RF-iN-d5, which was consistent to their expressions in the RNA-Seq profiles (as shown in Fig. 4b). The expression levels examined with gel image also have similar patterns as those detected by the qRT-PCR assays (Fig. 4c). We also sequenced the obtained product in the PCR experiments with Sanger sequencing. As shown in Fig. 4d and e, the obtained sequence located in the expected region between two primers from MacERV2_LTR2_34.
KLF4 and ETV5 potentially activated MacERV3 LTRs in the reprogramming of fibroblast cells
Our results show that LTRs and integrase elements of MacERV3 are activated in the reprogramming of monkey fibroblast cells, and existing results in human and mouse show that LTRs contain cis-regulatory motifs of key TFs of stem cells [2–5, 12]. Therefore, we identified enriched cis-regulatory motifs in the activated LTRs and putative TFs that bind to these motifs with MEME [26]. As shown in Fig. 5a and c, we identified five significantly enriched motifs (E-values <10−150) from LTRs in Clusters L0 and L3 of Fig. 3a, and these motifs were related to some TFs with TOMTOM [27].
Since different members of the same TF family may share very similar cis-regulatory motifs, but only a few members may really be expressed and functional in the reprogramming of monkey fibroblast cells. We first filtered all putative TFs to keep 21 TFs with at least 5 FPKM in at least one of the 7 time points and standard deviation values of at least 5 FPKM (as shown in Fig. 5b). To further validate the putative regulatory relations between the remaining 21 TFs and the 60 MacERV3 LTRs in Cluster L0 and L3 of Fig. 3a, we calculated the correlation coefficients between the TFs and these MacERV3 LTRs, then performed a bidirectional hierarchical clustering using the obtained correlation coefficient matrix. As shown in Fig. 5d, KLF4 and ETV5 (ETS variant 5) had very high positive correlation coefficient values with almost all the MacERV3 LTRs examined, suggesting that KLF4 and ETV5 activated these MacERV3 LTRs in the early stages of the reprogramming of monkey fibroblast cells. Actually, the correlation coefficient values between the expression levels of KLF4 and the 48 LTRs of MacERV3 in Clusters L0 and L3 of Fig. 3a range from 0.875 to 0.998, all of which are very significant (P<0.01). The expression levels of ETV5 and the same 48 LTRs are also significantly correlated with correlation coefficient values from 0.749 to 0.923, and 47/48 of these values are significant (P<0.05).
KLF4 is one of the key TFs for inducing iPSCs [8–11]. Thus, it is reasonable that KLF4 activated the expressions of MacERV3 LTRs. Beside KLF4, ETV5 is also identified as a putative regulator of MacERV3 LTRs. ELK1 and ELK3 also show large positive correlation coefficient values with most MacERV3 LTRs examined. ETV5, ELK1 and ELK3 belong to the ETS TF family. The members of this family have been implicated in the development of different tissues as well as cancer progression. ETS genes also play important roles in the specification and differentiation of dopaminergic neurons in both C. elegans and olfactory bulbs of mice [28]. Our results suggest that the ETS members, ETV5, ELK1 and ELK3, might contribute to the reprogramming of monkey fibroblast cells by joining the activation of LTRs of MacERV3s.
As shown in Fig. 6, when inducing human iPSCs, OCT3/4, SOX2, and KLF4 transiently hyperactivated LTR7s of Human ERV-H (HERV-H) through direct occupation on LTR7 sites [12, 13]. When producing mouse iPSCs, MuSD (Class II ERV) and MERVL (Class III ERV) were transiently activated in reprogramming to iPSCs [13]. Our results suggest that KLF4 activates LTRs and integrates of MacERV3 through its binding sites on LTRs.
As shown in Fig. 6, after ERVs were activated, KRAB-ZFPs bind to LTRs and recruit TRIM28 to induce hetereochromatin to silence the LTRs [3, 29]. For example, an murine KRAB-ZFP, ZFP809, represses murine leukemia virus (MLV) in embroynic stem cells and recruits Trim28 (also known as KAP1) to the LTRs of MLV [3]. In human, ZNF91 and ZNF93 were found to repress two retrotransposons SVA and L1, respectively [30]. The expression level of TRIM28 gradually increases in the reprogramming procedure of human CD34+ cells, putatively as a mechanism to repress HERVH that is transiently activated when the reprogrammed cells approach iPSC stage [2, 13]. We noticed that TRIM28 also showed gradually increased expression levels on and after the third cell line (RF-iN-d11) in the 7 cell lines of this study (see Fig. 1b and Additional file 1: Table S1), potentially to repress expressions of MacERV3 elements. In the future, it is thus interesting to further explore whether a KRAB-ZFP also binds to MacERV3 LTRs and recruits TRIM28 to these loci in the reprogramming of monkey fibroblast cells.
MacERV3, as well as HERV-H and MLV, belongs to the Class I ERV [2, 31]. In summary, our results suggest that KLF4 is the key transcription factor in activating Class I ERVs in Macaca mulatta during the reprogramming procedures of somatic cells toward iPSCs or NESCs. Subsequently, TRIM28 is potentially recruited by an unknown KRAB-ZFP to silence the MacERV3 elements.