Skip to main content

Preimplantation development regulatory pathway construction through a text-mining approach



The integration of sequencing and gene interaction data and subsequent generation of pathways and networks contained in databases such as KEGG Pathway is essential for the comprehension of complex biological processes. We noticed the absence of a chart or pathway describing the well-studied preimplantation development stages; furthermore, not all genes involved in the process have entries in KEGG Orthology, important information for knowledge application with relation to other organisms.


In this work we sought to develop the regulatory pathway for the preimplantation development stage using text-mining tools such as Medline Ranker and PESCADOR to reveal biointeractions among the genes involved in this process. The genes present in the resulting pathway were also used as seeds for software developed by our group called SeedServer to create clusters of homologous genes. These homologues allowed the determination of the last common ancestor for each gene and revealed that the preimplantation development pathway consists of a conserved ancient core of genes with the addition of modern elements.


The generation of regulatory pathways through text-mining tools allows the integration of data generated by several studies for a more complete visualization of complex biological processes. Using the genes in this pathway as “seeds” for the generation of clusters of homologues, the pathway can be visualized for other organisms. The clustering of homologous genes together with determination of the ancestry leads to a better understanding of the evolution of such process.


Bioinformatics tools currently allow research to focus on the integration of large-scale data generated by sequencing, differential expression analysis, gene interaction studies and others. Several initiatives exist to organize this knowledge in secondary databases, thus allowing easier access and visualization. Databases containing interaction information are a good source for novel research. iHOP [1] allows users to tag gene names of interest and browse through the related PubMed literature with highlighted keywords. Another interaction database is STRING [2], which contains physical interactions and functional associations between proteins and integrates data retrieved from literature (PubMed), genomic context, large scale experiments and conserved co-expression. Text-mining, therefore, has a fundamental role in these tools and allows access to interactions spread throughout the literature. The extraction of biological events from literature through text-mining tools is essential to not only update the interaction databases but also for the creation and annotation of pathways.

Metabolic and regulatory pathways are an example of organized knowledge that allow a better visualization of a complex system and can be found in databases such as iPath [3], BioCyc [4] or KEGG Pathways [5]. When orthology information is added to pathways, the same process can be represented in different organisms. Orthology is also an important tool for sequence annotation. Current orthologue databases such as COG and KOG [6], eggNOG [7], OrthoMCL [8] and KEGG Orthology [5] all provide a good source for manually curated clusters of orthologues defined for organisms with complete genomes. We developed a procedure to enrich the COG database with UniRef50 clusters from the UniProt database [9], creating the UECOG database [10]. Recently, a similar procedure was applied to the KEGG Orthology database creating the enriched UEKO database (unpublished, Fernandes et al.).

The available tools described raise the possibility of integrating current information and generating complex regulatory pathways. Previous publications individually reported the regulatory interactions that control preimplantation embryo development [1115]. However, a complete preimplantation development regulatory pathway has never been built.

In humans, the preimplantation phase of embryonic development is a period of approximately six days after fertilization prior to attachment of the embryo to the uterine wall. Implantation can occur before or in the seventh embryonic day (E7), a time during which the uterus is receptive [16]. Mammalian embryonic development has been thoroughly studied in mice and the blastomeres remain totipotent, able to generate any other cell, up to the eight-cell stage, unlike other animals [17]. After fertilization, successive cleavages take place during the first two days of development, resulting in the eight-cell embryo. The next stage of development is called the morula stage. An increase in cell-cell contact results in formation of a compacted morula. The subsequent divisions increase the complexity of the embryo and cells may be located on the inside, surrounded by other cells, or on the outside, in contact with the environment. The identification of the initial cells for each lineage has shown that the trophectoderm (TE) is derived mostly from the outer cells, whereas the inner cells give rise to the inner cell mass (ICM). Later, the ICM divides into the primitive endoderm (PE) and the epiblast (EPI). During the differentiation of the TE from the ICM, the blastocoel is formed through a process of cavitation. The embryo is called a blastocyst when all three structures are present (TE, ICM and blastocoel). Twenty-four hours after blastocyst formation occurs, the last stage of preimplantation development takes place when the PE differentiates from the ICM. The three lineages thus formed in preimplantation development present different fates during subsequent embryonic development. While the epiblast, which forms from the ICM following implantation, is still undifferentiated and will give rise to the fetus itself, the trophectoderm will become the fetal portion of the placenta and the primitive endoderm (as part of the extraembryonic endoderm) will form the yolk sac [14]. Complex regulatory processes such as animal development are a result of the interaction of many different gene products and elements that control the expression of these genes. Traditional experiments that determine the function of one or a few genes are essential, but do not result in a comprehensive view of complex systems. A complex regulatory network should be able to portray specific and general aspects of development, such as the embryonic fate of certain cells [18].

In this work, we noticed the absence in databases of a pathway describing the preimplantation phase of embryo development and sought to develop the given pathway using text-mining tools, complementing it with orthology information. The resulting pathway comprises 86 genes and the interactions between them. Clusters of orthologous groups were generated for each gene represented in the pathway and provided the necessary information to determine the last common ancestor. This determination revealed that the preimplantation development pathway is an ancient Chordata pathway with addition of modern elements throughout evolution.



Initially, we used the PubMed platform to search for articles related to the embryo preimplantation development (query: “preimplantation development”) and obtained 3524 entries as a result. To obtain a more efficient set of articles with relevancy to our work, the result entries were submitted to MedlineRanker [19]. This software computes discriminating words by comparing a set of user selected abstracts indicated as highly relevant to a background set and then scores any abstracts in terms of their content of those discriminating words. After the classification, we selected the top 1000 abstracts for further analysis, which presented p-value lower than 0.01 and by manual inspection provide large amount of information when uploaded in PESCADOR. Since human and mouse embryo development are highly similar, it was plausible to use abstracts from work on both organisms as source of information for the preimplantation pathway construction, paying attention to any possible conflict.

Using these 1000 highly informative abstracts as our input, PESCADOR (manuscript in preparation, Barbosa-Silva et al.) an online platform for friendly operation of the LAITOR software [20]) was used for tagging of gene names and biointeractions extraction from each abstract. As a result, 722 gene names were tagged and 223 type 1 biointeractions were highlighted as well as other informative biointeractions. Biointeractions are classified by LAITOR [20] as type 1 when in the same sentence the software encounters a gene name, a biointeraction word and another gene name, in that order (e.g.: CDX2 downregulates NANOG). From these tagged abstracts we manually curated the information and constructed the pathway for the preimplantation embryo development describing 86 genes and numerous interactions between them during the early developmental stages, trophectoderm differentiation from the inner cell mass and posterior extraembrionary endoderm differentiation. A sample abstract tagged by PESCADOR and the manual extraction of the information it contains is exemplified in Figure 1. The pathway shown in Figure 2 was constructed according to KGML (KEGG Markup Language). The large decrease in the initial number of genes tagged in the abstracts is mainly due to redundancy between abstracts (same genes mentioned) and also to genes tagged in type 3 and 4 biointeractions, which not always result in pathway building information.

Figure 1

Biointeraction extraction from PESCADOR. Top: Sample abstract tagged by PESCADOR. Gene or protein names (terms) recognized are highlighted in violet and the biointeraction words in yellow. The platform allows users to search for their interactions of interest by terms, abstracts or concepts of interest added initially by the user. Bottom: Manual curation of the information presented in the abstract and its graphical representation in the form of a regulatory pathway.

Figure 2

Preimplantation development pathway. The figure shows a pathway representation of the genes involved in the regulation of the preimplantation development and interactions between them. Some functions are also detailed in the grey rectangles. In the upper part of the figure are located genes involved in the early stages of development (until blastocyst formation) below these, the left part corresponds to the regulations that occur in the inner cell mass, the portion of cells that remains undifferentiated during a longer period of time, part of these cells will give rise to the primitive endoderm and the genes that regulate this process are shown in the bottom left. In the right are the genes involved in the development of the outer cells of the blastocyst, which differentiate to form the trophectoderm. The interactions are described in the text. KEGG Markup Language was used for pathway representation. The developmental stages figures were adapted from Yamanaka et al. 2006 [14]. The pathway genes are represented according to their ancestry based on the determination of their Last Common Ancestor. Genes considered recent are shown in green while genes of more ancient origin are shown in lilac. Genes that present an ortholog in D. melanogaster are marked (*). This will be further adressed in the text section “Pathway Ancestry”. DPPA1 and hRSCP are shown in grey due to the fact that the lack of corresponding SwissProt annotated gene product to be used as seed prevented their use in this analysis.

Preimplantation pathway

The pathway obtained after the analysis of all the abstracts from the PESCADOR output is represented in Figure 2 and the regulations are reviewed below.

First embryonic cleavages

The oncogene c-MYC is an important transcriptional regulator and its expression is observed in the initial stages of development, where it is present in embryonic cells until the morula stage and repressed thereafter [21]. Two additional genes recently associated with these early developmental stages are BORIS and ECSA. BORIS is involved in early development following fertilization and soon afterwards repressed, and ECSA, expression begins in the blastocyst exclusively in the cells of the inner cell mass (ICM). The presence of these genes was compared to the expression pattern of the Oct4 transcription factor, which is present in the early cleavages, repressed after this initial stage, and then its expression is afterwards stimulated again in the blastocyst [22]. The expression of the gametogenesis associated gene Gse was also recently identified in cells of the early embryo; later this protein is found only in the ICM, suggesting a role in the specification of cell lineage [23].

Methylation patterns and correct preimplantation development

Genomic methylation patterns in mammalian cells depend on Dnmt1 (DNA methyltransferase-1). In the mouse, an embryo-specific variant called Dnmt1o is expressed in the early stages of development. In the 8-cell stage this protein relocates to the cell nucleus where maintains essential methylation patterns, allowing embryos to complete early developmental events [24]. It was recently shown [25] that the inability of Dnmt1o to properly relocate not only results in a developmental arrest at the 5-7 cell stage, but is also responsible for the downregulation of five genes involved in the formation of gap and tight junctions (Cx31, Cx43, Cx45, Cdh1 and Ctnnb1). These junctions are crucial for early processes such as compaction of the 8-cell embryo and cavitation of the blastocoel.

TE versus ICM dichotomy: key role of Lats controlling Tead4 co-activator Yap

Cells destined to become part of the ICM are marked by repression of two genes (aPKC and PARD3) [26] and by upregulation of Sox2 [11]. In these cells, the major pluripotency transcription factors, including Nanog and Oct4, remain active due to the expression of an important player and member of the Hippo signaling pathway: Lats. This serine/threonine protein kinase is responsible for phosphorylating Yap, leading to its cytoplasmic localization and thus preventing its association with the transcription factor Tead4.

Triggering TE differentiation: Tead4/Yap target Cdx2 to repress Nanog and Oct4

Conversely, in the outer cells that will differentiate and form the trophectoderm, Yap is unphosphorylated, remains in the nucleus and associates with Tead4, leading to the activation of Cdx2, a key repressor of Nanog and Oct4 [27]. Repression of Oct4 and Nanog transcription by Cdx2 then releases the inhibition that these two key factors were exerting on many different genes, in turn activating these targets [28, 29]. Activation of Cdx2 requires release from basal repression; Nanog [30] and Oct4 [31] repress basal levels of Cdx2 and induction of higher levels of Cdx2 by Tead4/Yap overcomes this repression, allowing Cdx2 to play its role [28]. Tead4 was also recently determined to activate another trophectoderm differentiation factor, GATA3 [32], which acts alongside Cdx2 and affects transcription of a number of genes independent of Cdx2. The Tead4-dependent activation of GATA3 seems to be independent of Yap, suggesting Tead4 interacts with another partner as well as Yap. Also required for high level expression of Cdx2 in trophectoderm cells is the cell motility protein Arp3; experiments with complete knockdown of this protein show trophoblast cells unable to develop properly, possibly undergoing apoptosis as a result of loss of Cdx2 [33]. The TGFbeta pathway is another important pathway for trophectoderm differentiation; TGFbeta signaling is stimulated by BMP4, which leads to the activation of SMAD proteins. These proteins can also stimulate transcription of Cdx2 [34], and BMP4 is known to inhibit Id2, an inhibitor of differentiation [35], and to activate Hand1, which is involved in trophoblast cell differentiation [36].

In the absence of Oct4 and Nanog

The downregulation of Oct4 in the outer cells of the embryo leads to the activation of a positive regulator of TE cell fate, Eomes (T-box protein eomesodermin) [29, 37], which is also a possible Cdx2 target [38]. The subsequent differentiation of these cells into trophectoderm is accompanied by the expression of several genes, such as the glycoprotein PSG2 [39] and the marker KRT18. PSG2 and KRT18 expression are among the first signs that a blastomere has lost its totipotent competence, prior to any visible differentiation [33]. Removal of Oct4-dependent repression also results in activation of genes such as ETIF2B and Rps14 [40], allowing these cells to engage in an intense translation routine. Knockdown studies targeting Oct4 also show that it represses the expression of Gcm1, which is normally placenta specific [41], and of the hCG hormone’s beta chain [42].

Concurrently, Nanog downregulation allows the expression of a number of genes associated with both trophectoderm (GATA2, hCG-alpha and hCG-beta) and extraembryonic endoderm (GATA4, GATA6, LAMB1 and AFP) [30]. These latter genes will in turn initiate the formation of tissues such as the primitive endoderm, a component of the yolk sac. From the early blastocyst stage on, desmosomes are assembled in the trophectoderm in response to desmocollin (DSC2), which is also not expressed in the ICM [43].

Thus, Tead4/Yap activation of Cdx2, accompanied by the subsequent repression of Nanog and Oct4, describes a scenario for the TE differentiation.

Underneath the maintained activation of Oct4 and Nanog

Back in the ICM, the main pluripotency genes remain active and form a complex regulation pathway. Recently it was discovered that transcription of Nanog is further stimulated by the presence of compounds such as retinol [44]. Klf2, Klf4 and Klf5 exert a redundant role in the activation of Nanog. These krüppel-like factors were described as essential for the maintenance of pluripotency. Indeed, Klf4 was already known for this role and is commonly used in reprogramming of differentiated cells into induced pluripotent stem cells. However, only the simultaneous depletion of Klf4, 2 and 5 results in the differentiation of stem cells, indicating functional redundancy [45]. Other proteins known to activate Nanog include the two other main pluripotency regulators, Oct4 [37, 46] and Sox2 [47]. The estrogen receptor ESRRB is also reported to be involved in the activation of Nanog by Oct4 and Sox2 [47]. Conversely, Nanog can activate Oct4 [46], and ESRRB is necessary to maintain Oct4 promoter activity [48].

Each of the three key factors, Oct4, Sox2 and Nanog, also act as self-activators, e.g. the partners Oct4 and Sox2 bind and activate Oct4 transcription [49]. Another key transcription factor involved in the maintenance of cell pluripotency is Sall4 [50]. Sall4 binds to the conserved regulatory region in the Pou5f1 (the Oct4 gene) distal enhancer and activates its transcription [31]. Studies with microRNA interference of Sall4 show that the loss of this factor leads to reduction of Oct4 mRNA levels and significant expression of Cdx2 in the ICM [31]. b-MYB, a gene expressed in proliferating cells, is also a positive regulator of Oct4 and studies report early differentiation of ICM in the absence of b-MYB [51].

The Notch signaling pathway is a conserved pathway that is involved in cellular communication processes and correct cell fate decisions that also has a role in ICM development [52]. Nle protein, a direct regulator of this pathway, is essential for survival of the ICM [53]. Another protein associated with development and survival of the ICM is Tbn (Taube nuss), whose absence promotes cell apoptosis in the ICM [54].

Expression of the platelet and endothelial cell adhesion molecule (PECAM1 or CD31) was detected by immunofluorescence confocal microscopy in the blastocyst and restricted to the ICM cells. Subsequently, PECAM1 remains only in the pluripotent epiblast cells, disappearing the moment these cells undergo differentiation [55], and indicating a new role for this molecule during embryo development.

Activation, but with moderation

Other control pathways maintain expression of these genes at a steady-state concentration and balance these many mechanisms for activation and upregulation of transcription. A complex regulation feedback loop consists of FOXD3, Nanog and Oct4 [46]. To keep Oct4 and Nanog expression within steady-state levels, these three genes interact so that (i) expression of Nanog activates FOXD3 and Oct4 but not above steady-state levels due to Oct4 exerted repression; and (ii) FOXD3 and Nanog activate Oct4 expression but not above steady-state levels due to Oct4 self-repression.

Dax1 is an orphan nuclear hormone receptor recently identified as a repressor of Oct4 transcription [56].

Dax1 expression was also capable of reducing Nanog and Rex1 expression. Assays show that Dax1 binds to Oct4 and abolishes its DNA binding activity, thus decreasing the transcription of Nanog and Rex1, targets of Oct4 activation.

Another repressor in the ICM is Tcf3, a Wnt signaling pathway effector. TLE2 (a Groucho family protein) and CtBP (C-terminal binding protein) are key partners of Tcf3 in mediating this repressive effect. Tcf3 binds to and represses the Oct4 promoter, and this repressive effect requires both the Groucho and CtBP interacting domains of Tcf3 [13]. Tcf3 also limits the steady-state levels of Nanog mRNA, protein, and promoter activity in self-renewing embryonic stem cells (ESCs); the Tcf3 Groucho domain is involved in this repression [57]. Thus, Tcf3 is critical for maintaining the appropriate levels of both Oct4 and Nanog in ESCs. Experiments show that loss of Tcf3 by RNA interference (RNAi) knockdown blocks the ability of ESCs to differentiate [13], emphasizing the importance of this interaction.

Downstream of Oct4 and Sox2

Oct4 activates embryonic stem cell-specific gene 1 (Esg1), which encodes an RNA binding protein present in the ICM that is responsible for regulating several specific target transcripts [58]. Oct4 and Sox2 are also responsible for the regulation of the fibroblast growth factor 4 (FGF4) [49]. Expression of FGF4, therefore, requires the combined activity of these two transcription factors that bind to adjacent sites on the FGF4 enhancer DNA region [59]. Once expressed, the FGF4 protein can interact with its receptor FGFR2 and activate ICM and adjacent TE cell proliferation, activating extraembryonic endoderm cells as well in later stages.

Several other genes with important functions in embryonic development are also targets of Oct4-dependent activation. These include growth factor TDGF1, growth inhibitor SAP18, regulator of nonsense transcripts RENT1, two proteins involved in stem cell self-renewal DPPA4 and DPPA1 (developmental pluripotency associated), anterior visceral endoderm (AVE) markers LEFTY1 and LEFTY2, surface antigen THY1, and other genes encoding proteins involved in specialized cellular processes (DPP3, ATP6AP2, DDB1) and hypothetical proteins (GK003, hRscp) [37, 40].

The master regulation exerted by Sox2 and Oct4 during mammalian embryogenesis is believed to operate through their cooperative binding to DNA regulatory regions composed of adjacent HMG and POU motifs (HMG/POU cassettes) [60]. Exemplifying this arrangement, DPPA4 is one such gene with the presence of an HMG/POU cassette in its promoter region [61].

Downstream of Nanog and STAT3

Activation of JAK/STAT pathway also has an important contribution to pluripotency. In mice, the LIF/STAT3 pathway [44, 62, 63] for maintenance of cell pluripotency comprises LIF and LIF receptor, which deliver intracellular signaling through STAT3. STAT3, a signal transducer and activator of transcription is activated by the JAK1 kinase and binds to several promoters inducing transcription of pluripotency related genes [64]. Nanog and Stat3 were found to bind to and synergistically activate Stat3-dependent promoters [64]. Nanog also functions as a transcriptional inhibitor to NFκB, a factor known to have pro-differentiation activity [64]. Nanog is also responsible for SMAD1 repression, thereby preventing BMP4-induced differentiation through the TGFbeta signaling pathway, for which SMAD1 is a key signal transducer [65].

Extraembryonic endoderm differentiation from ICM cells

Prior to embryo implantation one more differentiation takes place. Certain cells from the ICM give rise to the primitive endoderm, the first morphologically distinct cell type of the extraembryonic endoderm. The extraembryonic endoderm comprises the primitive, parietal and visceral endoderm components and will become the yolk sac during posterior development stages.

Wnt6 was recently identified as an inducer of primitive endoderm and this induction is accompanied by translocation of beta-catenin (CTNNB1) and Snail1 to the nucleus [66]. This study also showed that up-regulation of protein kinase A (PKA) induces markers of parietal endoderm. Another Wnt family member, Wnt9a, is expressed only in ICM cells that surround the bastocoel [67] and induces repositioning of the cells expressing GATA6, which is necessary for formation of primitive endoderm [68].

Sox7 plays a major role in parietal endoderm differentiation. Through studies with short interfering RNA molecules, it was established that Sox7 is responsible for transcription induction of GATA4 and GATA6 [69]. Individual or combined silencing of Sox7, GATA4 and GATA6 result in suppression of cell shape changes and production of laminin-1 (LAMB1), characteristic changes present in parietal endoderm differentiation [69]. Gata4 was previously identified as a transcription factor responsible for the activation of FGF3 [70]. Sox7 also activates the FGF3 promoter. Conversely, Sox2 can negatively modulate the GATA4-dependent activation of FGF3, which is supported by the role of this factor in ICM pluripotency [71]. Another Sox family member, Sox17, is responsible for the differentiation of the extraembryonic endoderm in the final steps of preimplantation development [72]. The Runx1 factor is associated with the expression of Sox17 and is also specific for the extraembryonic endoderm [73]. HNF4 is a transcription factor specific of the extraembryonic endoderm with subsequent roles in post-implantation development and organogenesis [74]. Its expression may result from BMP4-induced differentiation [75]. Finally, the Dab2 protein is indispensable for the development of visceral endoderm; though its exact role is still not established, it is perhaps related to correct cell positioning [76, 77]. The expression of Cer1, a marker of the anterior visceral endoderm (AVE), commences before embryo implantation in the subset of cells that comprise the primitive endoderm. This ancestral population includes both cells expressing Cer1 together with cells in which Cer1 expression begins after implantation and formation of the AVE [60].

Search for homologues

To establish an ortholog database and provide sequence information to the genes contained in the preimplantation pathway, aminoacid sequences corresponding to the human and mouse gene products were used as seed for the software SeedServer (Guedes et al., unpublished, see Methods for details). In fact, only the UniProt identifier for these proteins is necessary to execute SeedServer - gene symbols were verified in the NCBI Gene database and converted to the corresponding geneID, and the desired identifiers were obtained afterwards from the UniProt database. For each gene a cluster of homologues was generated comprising from 2 to 260 sequences (Additional file 1).

The recruited sequences contained in each cluster can be Swiss-Prot annotated or unrevised TrEMBL sequences. In total, 25% of the cluster sequences are Swiss-Prot, the great majority of clusters being comprised of TrEMBL sequences (75%). The search for homologues through SeedServer provides therefore a large amount of candidates for manual curation in Swiss-Prot. Furthermore, SeedServer can recruit sequences from organisms without a complete genome due to its use of UEKO (UEKO is built on top of Kegg Orthology homologues as UECOG [8] has been built on top of COG database) and bidirectional best hit (BBH) searches conducted by SeedLinkage [78], and in fact only 27% of the sequences present in all clusters are from organisms with a complete genome. The ortholog clustering by SeedServer was only performed for genes that had a corresponding SwissProt annotated gene product to be used as seed, therefore hRSCP and DPPA1, which are described in the pathway, did not go through this analysis.

Pathway ancestry

We then focused on the putative origin of these genes, determining which clade in the human lineage (e.g. class, order, family) shares each gene. The generation of ortholog clusters allowed for the determination of the last common ancestor (LCA) for each of the genes in the pathway. Figure 2 shows the genes according to their origin. Genes were arbitrarily considered ancient for this analysis if their last common ancestor originated before the divergence of the clade Euteleostomi and are coloured grey. Genes with a LCA belonging to the clade Euteleostomi or originated after divergence of Euteleostomi are considered recent genes and are coloured blue. Ancient origin genes with an ortholog in Drosophila melanogaster are marked with a red asterisk. This arbitrary classification was meant to attract attention to the two key pluripotency controlling genes, Nanog (ancient) and Oct4 (modern).

The graph shown in Figure 3 represents the distribution of all the genes in the pathway according to their origin respect to clades of the human lineage. It may be observed that a large quantity of genes originates in certain periods as seen in Eumetazoa, Coelomata, Euteleostomi and Eutheria. The reasons for this wavelike origin need to be further analysed. On the other hand, the apparent origin of complex structures, that characterize all descendents from a certain moment of evolution, might have occurred simultaneously to the specialization of gene groups. The coverage of genomic sequences in the database is far from homogeneous and can influence the shape of this graph [79]. In any case, the pattern observed agrees with the expansion of protein families related to stem cell markers observed in the ray-finned fish, that is, after divergence of the Euteleostomi [80].

Figure 3

Gene origin in human evolution. Distribution of the genes in the preimplantation pathway according to their origin in clades of the human lineage, based on the determination of the Last Common Ancestor for the ortholog clusters generated by SeedServer. The y-axis represents the number of genes and the x-axis represents the taxonomical groups in which the genes originated.

Furthermore, we searched for functional information related to the D. melanogaster orthologues in order to determine if these functions are somehow similar or related to the functions of the corresponding pathway genes. This was done through a second text mining approach similar to the first and from the information recovered a secondary pathway was generated simply to illustrate the ortholog genes and their relative functional roles (Additional file 2). The regulatory pathways in which these genes are involved show us that these genes are all related to some part of Drosophila embryo development, some of them with highly conserved functions still observed in the preimplantation pathway described. An example is the Hippo signalling pathway, which is extremely conserved, showing Wts (Lats ortholog) phosphorylating Yki (Yap ortholog); this modification prevents Yki interaction with Sd (Tead4 ortholog). The correlation between the human gene names and corresponding D. melanogaster ortholog names can be found in Additional file 3 and also the PMID reference for the gene function in Drosophila development.


The use of text-mining tools for the generation of regulatory pathways is an effective approach and it is important for the current interest of gathering data related to an organism or biological process. The search for information related to a specific concept such as “preimplantation development” resulted in the selection of data related to this process only. When other tools such as iHOP [1] and STRING [2] are used for the search of biointeractions, it is necessary to know the names for the genes you are interested on and the information is then retrieved. Moreover in the case of iHOP, the information retrieved consists of a large list of papers related to the gene of interest, which need to be manually analysed to extract the information related to the specific process. In the case of STRING, the result of a query is a network of direct associations to other genes, which can be activations, repressions, or unknown, but for which it is not possible to perform a search restricting the query to a specific process for which you seek to determine the involvement of a given gene.

The approach described in this work (using PubMed, MedlineRanker, PESCADOR) summarized in Figure 4, allows the researcher to initiate the study of a pathway without knowing exactly the genes involved, simply by selecting the published information related to the process of interest. The manual curation required to create a pathway through this approach is significantly smaller. However, the verification of all the interactions highlighted by the tool is essential. Text-mining is not able to eliminate the selection of false interaction pairs; in the case of LAITOR (contained in the PESCADOR platform), the type 3 and 4 interactions can present genes with no association specified in the text [20].

Figure 4

Pathway construction flowchart. The initial step consists of a PubMed search with the subject of interest (e.g. preimplantation development). The list of PubMed identifiers (PMIDs) obtained in the search is then used in the web tool Medline Ranker as the background set along with a list of PMIDs of manually selected abstracts considered informative which form the test set. The tool generates a list of abstracts classified by order of relevance. Best 1000 abstracts are recovered and their corresponding PMID is then introduced in the PESCADOR platform. Abstracts are tagged by PESCADOR and provide a source of biointeractions for manual curation and pathway construction. UniProt IDs for products of the genes present in the final pathway are obtained and used as seed in SeedServer. The software recruits homologues for each gene and creates the final clusters. Taxonomy IDs from each cluster can be used for Last Common Ancestor (LCA) determination.

The text-mining data contribute the complete description of the pathway in the form of a literature review, a necessary step for the validation of the regulations represented, and for the inclusion of the pathway in a specific database, such as KEGG Pathway. The establishment of this procedure for pathway generation allows future work to enlarge the knowledge on subjects still not approached, such as regulatory pathways for several types of cancer, mechanisms of pathogen resistance in plants and response to abiotic stresses in plants, among other themes of interest.

The inclusion of the preimplantation pathway in databases such as the KEGG database will allow automatic annotation for several other organisms, as it is usually done in this database. Concurrently, a laboratory with a specific interest can promptly build a similar Pathway for its local use. From the 86 genes present in the pathway, 20 do not possess entries in KEGG Orthology and would constitute important additions. Considering that the contribution of KEGG for the sequence recruitment in the SeedServer clusters is only 25% of the total number of sequences, some organisms evolutionarily divergent from the ones represented in KEGG begin to play a more relevant role for a more efficient annotation of new sequences. It is relevant to stress that only the SeedLinkage and UEKO components of SeedServer are capable of clustering sequences proceeding from organisms without a complete genome project. Moreover, linkage of recruited to seed sequences are verified with PSI-BLAST.

Another important contribution from the ortholog clustering by SeedServer is the identification of candidates for Swiss-Prot Annotation. Swiss-Prot annotation depends on the correct association of sequences to gene families and proteins with known function, using the available literature as a reference. The annotation is facilitated since each of the genes is associated with PubMed Identifiers (PMIDs) stored in the PESCADOR tool, which are important references for the related orthologs.

The search for functional information for the D. melanogaster orthologues revealed the involvement of the genes in processes related to the embryonic development and was also a good validation for the clustering by SeedServer, since all sequences from D. melanogaster that clustered to the initial human and mouse genes present an embryo development related function.

Generation of correct clusters is essential for the correct determination of gene ancestry, but it is not the sole limiting factor. Sequencing of key organisms from taxonomic outgroups relative to the ones with complete genome sequences available will be a crucial source of sequences that will allow a revaluation of gene ancestry. Meanwhile, additional sequences clustered by software (SeedLinkage) and database enrichment (UEKO) improve the inspection of ancestry.

Determination of the ancestry for the genes in the preimplantation pathway was nonetheless a central analysis, given the expectancy that this pathway would be mainly formed by more contemporary components. Our data suggest that an ancient fraction of the pathway including Nanog and Sox2 originated before Chordata, whereas a modern fraction including Oct4 and LIF has appeared near the origin of Eutheria, the placentary organisms. Thus, an important transcriptional pathway comprising ancient and modern members has been characterized with text mining, and homologues search with SeedServer promptly allowed LCA determination.


Generation of regulatory pathways through text-mining tools allows integration of data generated by previous studies for a more complete view of a biological process. If the genes present in this pathway are associated with clusters of orthologues this information is added to the pathway making the visualization of the same process available for different organisms. The analysis of orthology also permits determination of the ancestry of the genes involved in the process leading to a better understanding of the evolution of such process.


Text-mining and pathway construction

NCBI’s PubMed database was used as a source of available literature ( for the text-mining approach. The search query used was “preimplantation development” and the PubMed identification numbers of the selected papers (PMIDs) were saved as a text file. Ten papers were selected manually by us to be used in the Medline Ranker software ([19]; These papers, (references [23, 26, 28, 29, 31, 50, 59, 81, 82]), were considered by us as highly informative because they described numerous gene regulations concerning preimplantation development. We used the PMIDs retrieved by the PubMed search as the background set and the 10 manually selected PMIDs as the training set. After classification by order of relevance we selected the 1000 better-classified abstracts for further analysis presenting a p-value < 0.01. These abstracts were then submitted through PESCADOR (manuscript under preparation, Barbosa-Silva et al.), an online platform for the software LAITOR [20]. After PESCADOR, results were manually curated and the gene biointeractions recovered were used to build a regulatory pathway in Keynote MacOS according to the markup language used by KEGG for pathway construction (KGML can be found at This process consisted mainly of finding the highlighted interaction in the abstract tagged by PESCADOR, confirming its involvement in the preimplantation development by checking the corresponding paper and drawing this interaction in the pathway picture.

SeedServer search for homologues

UniProt IDs for human and mouse gene products corresponding to each of the genes represented in the preimplantation pathway were used as seed in the SeedServer software (not published, Guedes et al.). SeedServer is a web application ( which searches for homologous sequences through two components: the program SeedLinkage [78] and the databases KEGG Orthology (KO) [5] and its enriched version UEKO (unpublished, developed by Fernandes et al. by application of the procedure described to enrich COG [10] to the KEGG Orthology database). Clustering was verified by PSI-BLAST searches using seed sequences as query and the recruited proteins as database, and eventual false positives were discarded (1.5% of the recruited sequences).

LCA determination

Clusters generated for each of the pathway genes were used to determine the Last Common Ancestor (LCA) of each gene. Each cluster provided a list of Taxonomy IDs corresponding to the organisms in which orthologs of the pathway genes were found. The clade in the human lineage that comprised these Taxonomy IDs as leaves in the Taxonomy Tree was considered to bear the LCA.

Note added in proof

PESCADOR, referred in the text as in preparation, is now published: PESCADOR, a web-based tool to assist text-mining of biointeractions extracted from PubMed queries. Barbosa-Silva A, Fontaine JF, Donnard ER, Stussi F, Ortega JM, Andrade-Navarro MA. BMC Bioinformatics. 2011 Nov 9;12(1):435. [Epub ahead of print] PMID: 22070195[83].


  1. 1.

    Hoffmann R, Valencia A: A gene network for navigating the literature. Nat Genet. 2004, 36: 664-10.1038/ng0704-664.

    PubMed  Article  Google Scholar 

  2. 2.

    Jensen LJ, Kuhn M, Stark M, Chaffron S, Creevey C, Muller J, Doerks T, Julien P, Roth A, Simonovic M, et al: STRING 8--a global view on proteins and their functional interactions in 630 organisms. Nucleic Acids Res. 2009, 37: D412-416. 10.1093/nar/gkn760.

    PubMed  PubMed Central  Article  Google Scholar 

  3. 3.

    Letunic I, Yamada T, Kanehisa M, Bork P: iPath: interactive exploration of biochemical pathways and networks. Trends Biochem Sci. 2008, 33: 101-103. 10.1016/j.tibs.2008.01.001.

    PubMed  Article  Google Scholar 

  4. 4.

    Karp PD, Ouzounis CA, Moore-Kochlacs C, Goldovsky L, Kaipa P, Ahren D, Tsoka S, Darzentas N, Kunin V, Lopez-Bigas N: Expansion of the BioCyc collection of pathway/genome databases to 160 genomes. Nucleic Acids Res. 2005, 33: 6083-6089. 10.1093/nar/gki892.

    PubMed  PubMed Central  Article  Google Scholar 

  5. 5.

    Kanehisa M, Goto S: KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000, 28: 27-30. 10.1093/nar/28.1.27.

    PubMed  PubMed Central  Article  Google Scholar 

  6. 6.

    Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, et al: The COG database: an updated version includes eukaryotes. BMC Bioinformatics. 2003, 4: 41-10.1186/1471-2105-4-41.

    PubMed  PubMed Central  Article  Google Scholar 

  7. 7.

    Muller J, Szklarczyk D, Julien P, Letunic I, Roth A, Kuhn M, Powell S, von Mering C, Doerks T, Jensen LJ, Bork P: eggNOG v2.0: extending the evolutionary genealogy of genes with enhanced non-supervised orthologous groups, species and functional annotations. Nucleic Acids Res. 2010, 38: D190-195. 10.1093/nar/gkp951.

    PubMed  PubMed Central  Article  Google Scholar 

  8. 8.

    Li L, Stoeckert CJ, Roos DS: OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res. 2003, 13: 2178-2189. 10.1101/gr.1224503.

    PubMed  PubMed Central  Article  Google Scholar 

  9. 9.

    Suzek BE, Huang H, McGarvey P, Mazumder R, Wu CH: UniRef: comprehensive and non-redundant UniProt reference clusters. Bioinformatics. 2007, 23: 1282-1288. 10.1093/bioinformatics/btm098.

    PubMed  Article  Google Scholar 

  10. 10.

    Fernandes GR, Barbosa DV, Prosdocimi F, Pena IA, Santana-Santos L, Coelho Junior O, Barbosa-Silva A, Velloso HM, Mudado MA, Natale DA, et al: A procedure to recruit members to enlarge protein family databases--the building of UECOG (UniRef-Enriched COG Database) as a model. Genet Mol Res. 2008, 7: 910-924. 10.4238/vol7-3X-Meeting008.

    PubMed  Article  Google Scholar 

  11. 11.

    Guo G, Huss M, Tong GQ, Wang C, Li Sun L, Clarke ND, Robson P: Resolution of cell fate decisions revealed by single-cell gene expression analysis from zygote to blastocyst. Dev Cell. 2010, 18: 675-685. 10.1016/j.devcel.2010.02.012.

    PubMed  Article  Google Scholar 

  12. 12.

    Medvedev SP, Shevchenko AI, Mazurok NA, Zakiian SM: [OCT4 and NANOG are the key genes in the system of pluripotency maintenance in mammalian cells]. Genetika. 2008, 44: 1589-1608.

    PubMed  Google Scholar 

  13. 13.

    Tam WL, Lim CY, Han J, Zhang J, Ang YS, Ng HH, Yang H, Lim B: T-cell factor 3 regulates embryonic stem cell pluripotency and self-renewal by the transcriptional control of multiple lineage pathways. Stem Cells. 2008, 26: 2019-2031. 10.1634/stemcells.2007-1115.

    PubMed  PubMed Central  Article  Google Scholar 

  14. 14.

    Yamanaka Y, Ralston A, Stephenson RO, Rossant J: Cell and molecular regulation of the mouse blastocyst. Dev Dyn. 2006, 235: 2301-2314. 10.1002/dvdy.20844.

    PubMed  Article  Google Scholar 

  15. 15.

    Zhou Q, Chipperfield H, Melton DA, Wong WH: A gene regulatory network in mouse embryonic stem cells. Proc Natl Acad Sci USA. 2007, 104: 16438-16443. 10.1073/pnas.0701014104.

    PubMed  PubMed Central  Article  Google Scholar 

  16. 16.

    Wang H, Dey SK: Roadmap to embryo implantation: clues from mouse models. Nat Rev Genet. 2006, 7: 185-199. 10.1038/nrg1808.

    PubMed  Article  Google Scholar 

  17. 17.

    Johnson MH, McConnell JM: Lineage allocation and cell polarity during mouse embryogenesis. Semin Cell Dev Biol. 2004, 15: 583-597. 10.1016/j.semcdb.2004.04.002.

    PubMed  Article  Google Scholar 

  18. 18.

    Davidson EH, Rast JP, Oliveri P, Ransick A, Calestani C, Yuh CH, Minokawa T, Amore G, Hinman V, Arenas-Mena C, et al: A genomic regulatory network for development. Science. 2002, 295: 1669-1678. 10.1126/science.1069883.

    PubMed  Article  Google Scholar 

  19. 19.

    Fontaine JF, Barbosa-Silva A, Schaefer M, Huska MR, Muro EM, Andrade-Navarro MA: MedlineRanker: flexible ranking of biomedical literature. Nucleic Acids Res. 2009, 37: W141-146. 10.1093/nar/gkp353.

    PubMed  PubMed Central  Article  Google Scholar 

  20. 20.

    Barbosa-Silva A, Soldatos TG, Magalhaes IL, Pavlopoulos GA, Fontaine JF, Andrade-Navarro MA, Schneider R, Ortega JM: LAITOR--Literature Assistant for Identification of Terms co-Occurrences and Relationships. BMC Bioinformatics. 2010, 11: 70-10.1186/1471-2105-11-70.

    PubMed  PubMed Central  Article  Google Scholar 

  21. 21.

    Suzuki T, Abe K, Inoue A, Aoki F: Expression of c-MYC in nuclear speckles during mouse oocyte growth and preimplantation development. J Reprod Dev. 2009, 55: 491-495. 10.1262/jrd.09-069A.

    PubMed  Article  Google Scholar 

  22. 22.

    Monk M, Hitchins M, Hawes S: Differential expression of the embryo/cancer gene ECSA(DPPA2), the cancer/testis gene BORIS and the pluripotency structural gene OCT4, in human preimplantation development. Molecular Human Reproduction. 2008, 14: 347-355. 10.1093/molehr/gan025.

    PubMed  Article  Google Scholar 

  23. 23.

    Mizuno S, Sono Y, Matsuoka T, Matsumoto K, Saeki K, Hosoi Y, Fukuda A, Morimoto Y, Iritani A: Expression and subcellular localization of GSE protein in germ cells and preimplantation embryos. J Reprod Dev. 2006, 52: 429-438. 10.1262/jrd.18005.

    PubMed  Article  Google Scholar 

  24. 24.

    Chung YG, Ratnam S, Chaillet JR, Latham KE: Abnormal regulation of DNA methyltransferase expression in cloned mouse embryos. Biol Reprod. 2003, 69: 146-153. 10.1095/biolreprod.102.014076.

    PubMed  Article  Google Scholar 

  25. 25.

    Yu JN, Xue CY, Wang XG, Lin F, Liu CY, Lu FZ, Liu HL: 5-AZA-2'-deoxycytidine (5-AZA-CdR) leads to down-regulation of Dnmt1o and gene expression in preimplantation mouse embryos. Zygote. 2009, 17: 137-145. 10.1017/S0967199408005169.

    PubMed  Article  Google Scholar 

  26. 26.

    Plusa B, Frankenberg S, Chalmers A, Hadjantonakis AK, Moore CA, Papalopulu N, Papaioannou VE, Glover DM, Zernicka-Goetz M: Downregulation of Par3 and aPKC function directs cells towards the ICM in the preimplantation mouse embryo. J Cell Sci. 2005, 118: 505-515. 10.1242/jcs.01666.

    PubMed  Article  Google Scholar 

  27. 27.

    Ralston A, Rossant J: Cdx2 acts downstream of cell polarization to cell-autonomously promote trophectoderm fate in the early mouse embryo. Dev Biol. 2008, 313: 614-629. 10.1016/j.ydbio.2007.10.054.

    PubMed  Article  Google Scholar 

  28. 28.

    Nishioka N, Inoue K, Adachi K, Kiyonari H, Ota M, Ralston A, Yabuta N, Hirahara S, Stephenson RO, Ogonuki N, et al: The Hippo signaling pathway components Lats and Yap pattern Tead4 activity to distinguish mouse trophectoderm from inner cell mass. Dev Cell. 2009, 16: 398-410. 10.1016/j.devcel.2009.02.003.

    PubMed  Article  Google Scholar 

  29. 29.

    Strumpf D, Mao CA, Yamanaka Y, Ralston A, Chawengsaksophak K, Beck F, Rossant J: Cdx2 is required for correct cell fate specification and differentiation of trophectoderm in the mouse blastocyst. Development. 2005, 132: 2093-2102. 10.1242/dev.01801.

    PubMed  Article  Google Scholar 

  30. 30.

    Hyslop L, Stojkovic M, Armstrong L, Walter T, Stojkovic P, Przyborski S, Herbert M, Murdoch A, Strachan T, Lako M: Downregulation of NANOG induces differentiation of human embryonic stem cells to extraembryonic lineages. Stem Cells. 2005, 23: 1035-1043. 10.1634/stemcells.2005-0080.

    PubMed  Article  Google Scholar 

  31. 31.

    Zhang J, Tam WL, Tong GQ, Wu Q, Chan HY, Soh BS, Lou Y, Yang J, Ma Y, Chai L, et al: Sall4 modulates embryonic stem cell pluripotency and early embryonic development by the transcriptional regulation of Pou5f1. Nat Cell Biol. 2006, 8: 1114-1123. 10.1038/ncb1481.

    PubMed  Article  Google Scholar 

  32. 32.

    Ralston A, Cox BJ, Nishioka N, Sasaki H, Chea E, Rugg-Gunn P, Guo G, Robson P, Draper JS, Rossant J: Gata3 regulates trophoblast development downstream of Tead4 and in parallel to Cdx2. Development. 2010, 137: 395-403. 10.1242/dev.038828.

    PubMed  Article  Google Scholar 

  33. 33.

    Vauti F, Prochnow BR, Freese E, Ramasamy SK, Ruiz P, Arnold HH: Arp3 is required during preimplantation development of the mouse embryo. FEBS Lett. 2007, 581: 5691-5697. 10.1016/j.febslet.2007.11.031.

    PubMed  Article  Google Scholar 

  34. 34.

    Hayashi Y, Furue MK, Tanaka S, Hirose M, Wakisaka N, Danno H, Ohnuma K, Oeda S, Aihara Y, Shiota K, et al: BMP4 induction of trophoblast from mouse embryonic stem cells in defined culture conditions on laminin. In Vitro Cell Dev Biol Anim. 2010, 46: 416-430. 10.1007/s11626-009-9266-6.

    PubMed  PubMed Central  Article  Google Scholar 

  35. 35.

    Kondo M, Cubillo E, Tobiume K, Shirakihara T, Fukuda N, Suzuki H, Shimizu K, Takehara K, Cano A, Saitoh M, Miyazono K: A role for Id in the regulation of TGF-beta-induced epithelial-mesenchymal transdifferentiation. Cell Death Differ. 2004, 11: 1092-1101. 10.1038/sj.cdd.4401467.

    PubMed  Article  Google Scholar 

  36. 36.

    Riley P, Anson-Cartwright L, Cross JC: The Hand1 bHLH transcription factor is essential for placentation and cardiac morphogenesis. Nat Genet. 1998, 18: 271-275. 10.1038/ng0398-271.

    PubMed  Article  Google Scholar 

  37. 37.

    Babaie Y, Herwig R, Greber B, Brink TC, Wruck W, Groth D, Lehrach H, Burdon T, Adjaye J: Analysis of Oct4-dependent transcriptional networks regulating self-renewal and pluripotency in human embryonic stem cells. Stem Cells. 2007, 25: 500-510. 10.1634/stemcells.2006-0426.

    PubMed  Article  Google Scholar 

  38. 38.

    Marikawa Y, Alarcón VB: Establishment of trophectoderm and inner cell mass lineages in the mouse embryo. Mol Reprod Dev. 2009, 76: 1019-1032. 10.1002/mrd.21057.

    PubMed  PubMed Central  Article  Google Scholar 

  39. 39.

    Adjaye J, Herwig R, Brink TC, Herrmann D, Greber B, Sudheer S, Groth D, Carnwath JW, Lehrach H, Niemann H: Conserved molecular portraits of bovine and human blastocysts as a consequence of the transition from maternal to embryonic control of gene expression. Physiol Genomics. 2007, 31: 315-327. 10.1152/physiolgenomics.00041.2007.

    PubMed  Article  Google Scholar 

  40. 40.

    Shin MR, Cui XS, Jun JH, Jeong YJ, Kim NH: Identification of mouse blastocyst genes that are downregulated by double-stranded RNA-mediated knockdown of Oct-4 expression. Mol Reprod Dev. 2005, 70: 390-396. 10.1002/mrd.20219.

    PubMed  Article  Google Scholar 

  41. 41.

    Yamada K, Ogawa H, Tamiya G, Ikeno M, Morita M, Asakawa S, Shimizu N, Okazaki T: Genomic organization, chromosomal localization, and the complete 22 kb DNA sequence of the human GCMa/GCM1, a placenta-specific transcription factor gene. Biochem Biophys Res Commun. 2000, 278: 134-139. 10.1006/bbrc.2000.3775.

    PubMed  Article  Google Scholar 

  42. 42.

    Matin MM, Walsh JR, Gokhale PJ, Draper JS, Bahrami AR, Morton I, Moore HD, Andrews PW: Specific knockdown of Oct4 and beta2-microglobulin expression by RNA interference in human embryonic stem cells and embryonic carcinoma cells. Stem Cells. 2004, 22: 659-668. 10.1634/stemcells.22-5-659.

    PubMed  Article  Google Scholar 

  43. 43.

    Collins JE, Lorimer JE, Garrod DR, Pidsley SC, Buxton RS, Fleming TP: Regulation of desmocollin transcription in mouse preimplantation embryos. Development. 1995, 121: 743-753.

    PubMed  Google Scholar 

  44. 44.

    Chen L, Yang M, Dawes J, Khillan JS: Suppression of ES cell differentiation by retinol (vitamin A) via the overexpression of Nanog. Differentiation. 2007, 75: 682-693. 10.1111/j.1432-0436.2007.00169.x.

    PubMed  Article  Google Scholar 

  45. 45.

    Jiang J, Chan YS, Loh YH, Cai J, Tong GQ, Lim CA, Robson P, Zhong S, Ng HH: A core Klf circuitry regulates self-renewal of embryonic stem cells. Nat Cell Biol. 2008, 10: 353-360. 10.1038/ncb1698.

    PubMed  Article  Google Scholar 

  46. 46.

    Pan G, Li J, Zhou Y, Zheng H, Pei D: A negative feedback loop of transcription factors that controls stem cell pluripotency and self-renewal. FASEB J. 2006, 20: 1730-1732. 10.1096/fj.05-5543fje.

    PubMed  Article  Google Scholar 

  47. 47.

    van den Berg DL, Zhang W, Yates A, Engelen E, Takacs K, Bezstarosti K, Demmers J, Chambers I, Poot RA: Estrogen-related receptor beta interacts with Oct4 to positively regulate Nanog gene expression. Mol Cell Biol. 2008, 28: 5986-5995. 10.1128/MCB.00301-08.

    PubMed  PubMed Central  Article  Google Scholar 

  48. 48.

    Zhang X, Zhang J, Wang T, Esteban MA, Pei D: Esrrb activates Oct4 transcription and sustains self-renewal and pluripotency in embryonic stem cells. J Biol Chem. 2008, 283: 35825-35833. 10.1074/jbc.M803481200.

    PubMed  Article  Google Scholar 

  49. 49.

    Okumura-Nakanishi S, Saito M, Niwa H, Ishikawa F: Oct-3/4 and Sox2 regulate Oct-3/4 gene in embryonic stem cells. J Biol Chem. 2005, 280: 5307-5317.

    PubMed  Article  Google Scholar 

  50. 50.

    Cauffman G, De Rycke M, Sermon K, Liebaers I, Van de Velde H: Markers that define stemness in ESC are unable to identify the totipotent cells in human preimplantation embryos. Hum Reprod. 2009, 24: 63-70.

    PubMed  Article  Google Scholar 

  51. 51.

    Holzinger M, Bouffier L, Villalonga R, Cosnier S: Adamantane/beta-cyclodextrin affinity biosensors based on single-walled carbon nanotubes. Biosens Bioelectron. 2009, 24: 1128-1134. 10.1016/j.bios.2008.06.029.

    PubMed  Article  Google Scholar 

  52. 52.

    Adjaye J, Huntriss J, Herwig R, BenKahla A, Brink TC, Wierling C, Hultschig C, Groth D, Yaspo ML, Picton HM, et al: Primary differentiation in the human blastocyst: comparative molecular portraits of inner cell mass and trophectoderm cells. Stem Cells. 2005, 23: 1514-1525. 10.1634/stemcells.2005-0113.

    PubMed  Article  Google Scholar 

  53. 53.

    Cormier S, Le Bras S, Souilhol C, Vandormael-Pournin S, Durand B, Babinet C, Baldacci P, Cohen-Tannoudji M: The murine ortholog of notchless, a direct regulator of the notch pathway in Drosophila melanogaster, is essential for survival of inner cell mass cells. Mol Cell Biol. 2006, 26: 3541-3549. 10.1128/MCB.26.9.3541-3549.2006.

    PubMed  PubMed Central  Article  Google Scholar 

  54. 54.

    Voss AK, Thomas T, Petrou P, Anastassiadis K, Schöler H, Gruss P: Taube nuss is a novel gene essential for the survival of pluripotent cells of early mouse embryos. Development. 2000, 127: 5449-5461.

    PubMed  Google Scholar 

  55. 55.

    Robson P, Stein P, Zhou B, Schultz RM, Baldwin HS: Inner cell mass-specific expression of a cell adhesion molecule (PECAM-1/CD31) in the mouse blastocyst. Dev Biol. 2001, 234: 317-329. 10.1006/dbio.2001.0274.

    PubMed  Article  Google Scholar 

  56. 56.

    Sun C, Nakatake Y, Akagi T, Ura H, Matsuda T, Nishiyama A, Koide H, Ko MS, Niwa H, Yokota T: Dax1 binds to Oct3/4 and inhibits its transcriptional activity in embryonic stem cells. Mol Cell Biol. 2009, 29: 4574-4583. 10.1128/MCB.01863-08.

    PubMed  PubMed Central  Article  Google Scholar 

  57. 57.

    Pereira L, Yi F, Merrill BJ: Repression of Nanog gene transcription by Tcf3 limits embryonic stem cell self-renewal. Mol Cell Biol. 2006, 26: 7479-7491. 10.1128/MCB.00368-06.

    PubMed  PubMed Central  Article  Google Scholar 

  58. 58.

    Tanaka TS, Lopez de Silanes I, Sharova LV, Akutsu H, Yoshikawa T, Amano H, Yamanaka S, Gorospe M, Ko MS: Esg1, expressed exclusively in preimplantation embryos, germline, and embryonic stem cells, is a putative RNA-binding protein with broad RNA targets. Dev Growth Differ. 2006, 48: 381-390. 10.1111/j.1440-169X.2006.00875.x.

    PubMed  Article  Google Scholar 

  59. 59.

    Ambrosetti DC, Schöler HR, Dailey L, Basilico C: Modulation of the activity of multiple transcriptional activation domains by the DNA binding domains mediates the synergistic action of Sox2 and Oct-3 on the fibroblast growth factor-4 enhancer. J Biol Chem. 2000, 275: 23387-23397. 10.1074/jbc.M000932200.

    PubMed  Article  Google Scholar 

  60. 60.

    Torres-Padilla ME, Richardson L, Kolasinska P, Meilhac SM, Luetke-Eversloh MV, Zernicka-Goetz M: The anterior visceral endoderm of the mouse embryo is established from both preimplantation precursor cells and by de novo gene expression after implantation. Dev Biol. 2007, 309: 97-112. 10.1016/j.ydbio.2007.06.020.

    PubMed  PubMed Central  Article  Google Scholar 

  61. 61.

    Chakravarthy H, Boer B, Desler M, Mallanna SK, McKeithan TW, Rizzino A: Identification of DPPA4 and other genes as putative Sox2:Oct-3/4 target genes using a combination of in silico analysis and transcription-based assays. J Cell Physiol. 2008, 216: 651-662. 10.1002/jcp.21440.

    PubMed  Article  Google Scholar 

  62. 62.

    Saito S, Liu B, Yokoyama K: Animal embryonic stem (ES) cells: self-renewal, pluripotency, transgenesis and nuclear transfer. Hum Cell. 2004, 17: 107-115.

    PubMed  Article  Google Scholar 

  63. 63.

    De Felici M, Farini D, Dolci S: In or out stemness: comparing growth factor signalling in mouse embryonic stem cells and primordial germ cells. Curr Stem Cell Res Ther. 2009, 4: 87-97. 10.2174/157488809788167391.

    PubMed  Article  Google Scholar 

  64. 64.

    Torres J, Watt FM: Nanog maintains pluripotency of mouse embryonic stem cells by inhibiting NFkappaB and cooperating with Stat3. Nat Cell Biol. 2008, 10: 194-201. 10.1038/ncb1680.

    PubMed  Article  Google Scholar 

  65. 65.

    Suzuki A, Raya A, Kawakami Y, Morita M, Matsui T, Nakashima K, Gage FH, Rodríguez-Esteban C, Izpisúa Belmonte JC: Nanog binds to Smad1 and blocks bone morphogenetic protein-induced differentiation of embryonic stem cells. Proc Natl Acad Sci USA. 2006, 103: 10294-10299. 10.1073/pnas.0506945103.

    PubMed  PubMed Central  Article  Google Scholar 

  66. 66.

    Krawetz R, Kelly GM: Wnt6 induces the specification and epithelialization of F9 embryonal carcinoma cells to primitive endoderm. Cell Signal. 2008, 20: 506-517. 10.1016/j.cellsig.2007.11.001.

    PubMed  Article  Google Scholar 

  67. 67.

    Kemp C, Willems E, Abdo S, Lambiv L, Leyns L: Expression of all Wnt genes and their secreted antagonists during mouse blastocyst and postimplantation development. Dev Dyn. 2005, 233: 1064-1075. 10.1002/dvdy.20408.

    PubMed  Article  Google Scholar 

  68. 68.

    Meilhac SM, Adams RJ, Morris SA, Danckaert A, Le Garrec JF, Zernicka-Goetz M: Active cell movements coupled to positional induction are involved in lineage segregation in the mouse blastocyst. Dev Biol. 2009, 331: 210-221. 10.1016/j.ydbio.2009.04.036.

    PubMed  PubMed Central  Article  Google Scholar 

  69. 69.

    Futaki S, Hayashi Y, Emoto T, Weber CN, Sekiguchi K: Sox7 plays crucial roles in parietal endoderm differentiation in F9 embryonal carcinoma cells through regulating Gata-4 and Gata-6 expression. Mol Cell Biol. 2004, 24: 10492-10503. 10.1128/MCB.24.23.10492-10503.2004.

    PubMed  PubMed Central  Article  Google Scholar 

  70. 70.

    Murakami A, Thurlow J, Dickson C: Retinoic acid-regulated expression of fibroblast growth factor 3 requires the interaction between a novel transcription factor and GATA-4. J Biol Chem. 1999, 274: 17242-17248. 10.1074/jbc.274.24.17242.

    PubMed  Article  Google Scholar 

  71. 71.

    Murakami A, Shen H, Ishida S, Dickson C: SOX7 and GATA-4 are competitive activators of Fgf-3 transcription. J Biol Chem. 2004, 279: 28564-28573. 10.1074/jbc.M313814200.

    PubMed  Article  Google Scholar 

  72. 72.

    Shimoda M, Kanai-Azuma M, Hara K, Miyazaki S, Kanai Y, Monden M, Miyazaki J: Sox17 plays a substantial role in late-stage differentiation of the extraembryonic endoderm in vitro. J Cell Sci. 2007, 120: 3859-3869. 10.1242/jcs.007856.

    PubMed  Article  Google Scholar 

  73. 73.

    Kurimoto K, Yabuta Y, Ohinata Y, Ono Y, Uno KD, Yamada RG, Ueda HR, Saitou M: An improved single-cell cDNA amplification method for efficient high-density oligonucleotide microarray analysis. Nucleic Acids Res. 2006, 34: e42-10.1093/nar/gkl050.

    PubMed  PubMed Central  Article  Google Scholar 

  74. 74.

    Duncan SA, Manova K, Chen WS, Hoodless P, Weinstein DC, Bachvarova RF, Darnell JE: Expression of transcription factor HNF-4 in the extraembryonic endoderm, gut, and nephrogenic tissue of the developing mouse embryo: HNF-4 is a marker for primary endoderm in the implanting blastocyst. Proc Natl Acad Sci USA. 1994, 91: 7598-7602. 10.1073/pnas.91.16.7598.

    PubMed  PubMed Central  Article  Google Scholar 

  75. 75.

    Coucouvanis E, Martin GR: BMP signaling plays a role in visceral endoderm differentiation and cavitation in the early mouse embryo. Development. 1999, 126: 535-546.

    PubMed  Google Scholar 

  76. 76.

    Morris SM, Tallquist MD, Rock CO, Cooper JA: Dual roles for the Dab2 adaptor protein in embryonic development and kidney transport. EMBO J. 2002, 21: 1555-1564. 10.1093/emboj/21.7.1555.

    PubMed  PubMed Central  Article  Google Scholar 

  77. 77.

    Yang DH, Smith ER, Roland IH, Sheng Z, He J, Martin WD, Hamilton TC, Lambeth JD, Xu XX: Disabled-2 is essential for endodermal cell positioning and structure formation during mouse embryogenesis. Dev Biol. 2002, 251: 27-44. 10.1006/dbio.2002.0810.

    PubMed  Article  Google Scholar 

  78. 78.

    Barbosa-Silva A, Satagopam VP, Schneider R, Ortega JM: Clustering of cognate proteins among distinct proteomes derived from multiple links to a single seed sequence. BMC Bioinformatics. 2008, 9: 141-10.1186/1471-2105-9-141.

    PubMed  PubMed Central  Article  Google Scholar 

  79. 79.

    Perez-Iratxeta C, Palidwor G, Andrade-Navarro MA: Towards completion of the Earth's proteome. EMBO Rep. 2007, 8: 1135-1141. 10.1038/sj.embor.7401117.

    PubMed  PubMed Central  Article  Google Scholar 

  80. 80.

    Krzyzanowski PM, Andrade-Navarro MA: Identification of novel stem cell markers using gap analysis of gene expression data. Genome Biol. 2007, 8: R193-10.1186/gb-2007-8-9-r193.

    PubMed  PubMed Central  Article  Google Scholar 

  81. 81.

    Scaffidi P, Bianchi ME: Spatially precise DNA bending is an essential activity of the sox2 transcription factor. J Biol Chem. 2001, 276: 47296-47302. 10.1074/jbc.M107619200.

    PubMed  Article  Google Scholar 

  82. 82.

    Lim CY, Tam WL, Zhang J, Ang HS, Jia H, Lipovich L, Ng HH, Wei CL, Sung WK, Robson P, et al: Sall4 regulates distinct transcription circuitries in different blastocyst-derived stem cell lineages. Cell Stem Cell. 2008, 3: 543-554. 10.1016/j.stem.2008.08.004.

    PubMed  Article  Google Scholar 

  83. 83.

    Barbosa-Silva A, Fontaine JF, Donnard ER, Stussi F, Ortega JM, Andrade-Navarro MA: PESCADOR, a web-based tool to assist text-mining of biointeractions extracted from PubMed queries. BMC Bioinformatics. 2011, 12: 435-10.1186/1471-2105-12-435.

    PubMed  PubMed Central  Article  Google Scholar 

Download references


This article has been published as part of BMC Genomics Volume 12 Supplement 4, 2011: Proceedings of the 6th International Conference of the Brazilian Association for Bioinformatics and Computational Biology (X-meeting 2010). The full contents of the supplement are available online at

Author information



Corresponding author

Correspondence to J Miguel Ortega.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

ERD and JMO conceived the project and wrote the paper. ERD performed the research and pathway construction. MJK curated the pathway biointeractions. ABS (author of SeedLinkage and LAITOR) and MAAN designed the PESCADOR platform. RLMG designed the SeedServer software and conducted the ortholog search. HV was responsible for the LCA determination. GRF constructed the UEKO database. All authors read and approved the final manuscript.

Electronic supplementary material


Additional file 1: Homolog clusters. Clusters of homologous sequences found by SeedServer for each of the genes in the preimplantation pathway. For each gene, the left column shows the clustered sequence Uniprot ID and the right column shows the Taxonomy ID for this sequence. (PDF 278 KB)

Ortholog functions in

Additional file 2: Drosophila melanogaster . This figure represents the corresponding D. melanogaster orthologs found by SeedServer and their respective interactions and functions in fruit fly development. Note that these orthologs are involved in processes related to D. melanogaster embryo development. See Additional file 3 for a table with gene name correspondence between the genes in this figure and the ones on Figure 3. (PDF 118 KB)

Gene correspondence table.

Additional file 3: Human and Drosophila melanogaster gene name correspondence for the orthologs grouped by SeedServer. Column 3 lists the PubMed identifiers (PMIDs) from the papers where functions described in Additional file 2 were found. (PDF 253 KB)

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and Permissions

About this article

Cite this article

Donnard, E., Barbosa-Silva, A., Guedes, R.L. et al. Preimplantation development regulatory pathway construction through a text-mining approach. BMC Genomics 12, S3 (2011).

Download citation


  • Inner Cell Mass
  • Preimplantation Development
  • Primitive Endoderm
  • Last Common Ancestor
  • Inner Cell Mass Cell