Analysis of two mechanisms of telomere maintenance based on the theory of g-Networks and stochastic automata networks

* Background Telomeres, which are composed of repetitive nucleotide sequences at the end of chromosomes, behave as a division clock that measures replicative senescence. Under the normal physiological condition, telomeres shorten with each cell division, and cells use the telomere lengths to sense the number of divisions. Replicative senescence has been shown to occur at approximately 50–70 cell divisions, which is termed the Hayflick’s limit. However, in cancer cells telomere lengths are stabilized, thereby allowing continual cell replication by two known mechanisms: activation of telomerase and Alternative Lengthening of Telomeres (ALT). The connections between the two mechanisms are complicated and still poorly understood. * Results In this research, we propose that two different approaches, G-Networks and Stochastic Automata Networks, which are stochastic models motivated by queueing theory, are useful to identify a set of genes that play an important role in the state of interest and to infer their previously unknown correlation by obtaining both stationary and joint transient distributions of the given system. Our analysis using G-Network detects five statistically significant genes (CEBPA, FOXM1, E2F1, c-MYC, hTERT) with either mechanism, contrasted to normal cells. A new algorithm is introduced to show how the correlation between two genes of interest varies in the transient state according not only to each mechanism but also to each cell condition. * Conclusions This study expands our existing knowledge of genes associated with mechanisms of telomere maintenance and provides a platform to understand similarities and differences between telomerase and ALT in terms of the correlation between two genes in the system. This is particularly important because telomere dynamics plays a major role in many physiological and disease processes, including hematopoiesis.


Background
The introduction of Stochastic Automata Networks (SANs) [1] has led researchers to explore an efficient solution of finite multi-component Markov chains of potentially very high dimension. The main idea of the SAN formalism is to employ Kronecker algebra operations for generating the infinitesimal generator (also known as an intensity matrix or transition rate matrix) of a Markov chain in a product form in order to ease the problem of dimensionality and the complexity of the vector-matrix product [2,3]. Specifically, a compact representation using tensor product and sums of matrices of the form, which is formally proved in [1] provides the transition rates of a time-continuous Markov chain of the type just described. Notation: N is the number of automata, R is the number of synchronizations, and L i and M r,i are matrices containing information about the local transitions and the effect of the synchronization r on the automaton i, respectively. N r,i is the normalizing matrix of M r,i , and ⊗ and ⊕ denote, respectively, the (generalized) tensor product and tensor sum [4]. Notation will be discussed in detail in the following section. For a decade after SAN introduction, many outstanding analytical results have been proved. For example, Plateau and Stewart [5] proved in year 2000 that a product-form steady-state distribution for SANs without synchronizations exists, as long as some numerical conditions related to local balance are satisfied. Prior to that, Boujdaine et al. in 1997 [6] considered a special class of SAN having limited synchronization, and proved a sufficient condition for existence of stationary distribution, by applying properties of Kronecker sum. Note that classic queuing networks such as Jackson's networks and G-Networks with positive and negative customers [7] fall under this special class.
Most analyses and applications of SANs focus on finding steady-state distribution of queuing system models. To our knowledge, correlation analysis in SANs is rarely examined, although correlation quantifies the degree of inter-relatedness of two automata. It contributes to the understanding of how the association of two automata changes with time according to the state of interest, with the behavior of other automata in the system simultaneously considered. In the current study, we first examine the exact-form of an infinitesimal generator for G-Networks with positive and negative customers. Then it is applied to a gene regulatory network (GRN) having five genes related to the onset of cancer in order to show the time-dependent dynamics of transition rates and correlation between two particular genes of interest. Throughout this paper, our analysis is based on time-continuous Markov chains, but we note that most results are valid in the time-discrete case provided complications such as periodicity are excluded.

Stochastic automata network
The SAN consists of a number of automata, which are correlated by synchronizing transitions. Each automaton has states and transitions. The state space of the system is the Cartesian product of the states of the automata [6]. There exist two types of transitions: local and synchronizing.
1. Transition in one automaton may initiate a new simultaneous transition in another automaton. Such transitions are collectively referred to as synchronizing transitions. In one synchronizing transition, two automata are paired. The first is called a master automaton; it affects the state of another automaton as its state changes. The second is called a slave automaton; it is affected by its master automaton [8]. 2. In any given automaton, transitions not classified as synchronizing transitions are local transitions. Such transitions only change the state of one automaton.
In the SAN, two assumptions are made. First, the time to transition is exponentially distributed, and second, that a multidimensional Markov chain represents the SAN, although a particular automaton may not be Markovian itself. Then the (global) infinitesimal generator (the matrix of time derivatives at 0 of the transition probabilities), which reflects a change in transition probability from one state to another, is given by Eq. 1, with notation explained in Table1. Specifically, each matrix in Q is defined as follows: • L i is a local normalized transition rate matrix of automaton i; thus • For each synchronizing transition, for all i ∈ {1, 2, · · · , N}, M r,i = I K , where I K is an identity matrix of size K, except for the case when r is two indices corresponding to master and slave automaton. Therefore where msr(r) and sl(r) are indices for master automaton and slave automaton, respectively, in the r th synchronization. Here, D r,msr(r) denotes the transition rate matrix due to the r th synchronizing event on the master automaton, andD r,msr(r) is its diagonal matrix. E r,sl(r) , the transition probability matrix, is the effect of the r th synchronizing event on the automaton sl(r).
whose the l th diagonal element is the negative sum of the l th row of N i=1 M r,i .

G-Networks with positive and negative customers
Gelenbe-or G-Networks [9], devised by Erol Gelenbe, are Markovian stochastic models grounded in queueing network theory [10]. They were extensively applied to study probability models of various objects [11][12][13][14][15][16][17][18]. However they are particularly appropriate to construct GRNs, as they introduce a novel notion called a 'negative customer' , which can be biologically interpreted as a repression signal [19]. G-Networks model the number of positive customers, which is mRNA expression level biologically, in an automaton or a gene having an infinite state space. Positive and negative customers synchronously move from a master gene to a slave gene within the system with transition probabilities, p + msr(r),sl(r) and p − msr(r),sl(r) , respectively. It implies that there are two types of the synchronizing transition. Note that the behavior of positive customers identical to that of positive customers in Jackson networks [20]. In contrast, negative customers function distinctively as follows: they are not accumulated and instantaneously leave a queue after the completion of their tasks. If a negative customer arrives at a non-empty queue, it destroys one current positive customer. However if the queue is empty, then the negative customer disappears without locally affecting the queue. Both types of customers arrive at the i th gene from outside of the system at rates λ + i and λ − i , respectively. It is assumed that service disciplines are the same for all queue [21], and the service times for each queue i are independent and identically exponentially distributed with rates μ i ∈ (0, ∞) for i = 1, 2, · · · , N. Figure 1 visually illustrates four activities of mRNA and/or protein molecules, and Table 2 suggests how the biological terms that refer to these activities can be identified with those used in G-Networks.
G-Networks have an advantageous property of being analytically tractable because of the existence of productform stationary distributions under the typical Markov Chain assumptions as shown in Eq. 2 [7,22]. Let x = {x 1 , x 2 , · · · , x N } be the vector of nonnegative integers representing the state of the network. Then the time-dependent vector, {x(t) : t ≥ 0}, is a continuous-time Markov chain, which satisfies the system of Chapman-Kolmogorov equations [23]. Element x i (t), , is the number of customers (or the mRNA expression level) of gene i at time t. It has been proved that the joint steadystate distribution π (x) of x(t), has the form of a product of stationary distributions of each queue, each satisfying the balance equation of the G-Networks [23][24][25], i.e.
for i = 1, 2, · · · , N. It implies that the stationary probability for each positive recurrent state is expressed in the terms of a product of functions depending solely on the state of a single queue. According to [8], matrices constituting the global infinitesimal generator Q are defined as follows in the  The first and second columns contain biological and G-Networks terminologies of gene regulation, respectively. The third column includes the corresponding notations used in this study. The subscript i indicates the ith gene terms of customer-related rates: , if a customer leaves a master automaton and moves to a slave automaton as a positive customer , if a customer leaves a master automaton and moves to a slave automaton an as a negative customer Low , if a negative customer arrives at a slave automaton sl(r) if a slave automaton receives a positive customer

Results
This section consists of two parts. The first part shows the mathematical result that derives the correlation between two genes utilizing Kronecker algebra. The second part contains the biological results based on the application of the mathematical result to the gene regulatory network associated with telomere biology.

Mathematical result: derivation of joint transient distribution of two automata
If Q is the infinitesimal generator and P(t) is the transition probability matrix of a finite Markov chain, we can derive a set of differential equations called Kolmogorov's forward equations, in the form of P (t) = P(t) · Q. Solution of the forward equation is given in this case by the power series expansion that converges for any square matrix Q: The resulting 1 × K N -dimensional time-dependent state probability vector at time t has the form where π(t) is a state probability vector at time t [26]. In order to find a joint state probability vector π i,j (t) of two automata, say i and j, we introduce two matrices C i,j and C * i,j , which are K N × K 2 -and K 2 × K N -dimensional, respectively. After several steps of calculation, we obtain where, for all i = j , i, j = 1, 2, · · · , N, C i,j and C * i,j are defined by where , if n = i and n = j . Therefore Eq. 6 finally becomes Equation 8 implies that a time-dependent joint state probability vector π i,j (t) can be easily calculated, once the infinitesimal generator, Q, of the entire system is obtained.
We can then form a (K × K)-dimensional table that describes the joint probabilities for each time point t.
Based upon it, the association between two genes for all t is evaluated using the Pearson product-moment correlation coefficient [27]. It implies the correlation of a pair of genes in the regulatory system varies by each time (each generation of cells) as shown in Fig. 2.

Application to the gene regulatory networks GRNs and parameters from g-Networks
In this research, we use a GRN provided in Fig. 3. It consists of 5 statistically significant genes associated with telomere maintenance, which are discovered by (modified) Abnormal Pathway Detection Algorithm (APDA) based on G-Networks [18]. (See Methods for detailed information on the modification). We explore the correlation between a pair of genes, e.g., CEBPA and hTERT, which is not apparent from Fig. 3. On the biology side, Kirwan et al. in 2009 [28] reported that mutations in hTERT within 4 of 20 families were identified in familial acute myeloid leukemia (AML). It is known that the mutation of CEBPA is one of the important factors in AML and its prognosis [29]. However the direct relationship between CEBPA and AML is still not satisfactorily understood. The joint state probability vector of two genes, which can be calculated from Eq. 9, is expected to show the flow of probabilities for the K 2 -tuples of mRNA expression levels at each fixed time t, and thereby illustrates the relationship between any two genes.
In this case, the number of automata is equal to N = 5, while the number of synchronizing events is equal to R = 7. We limit the number of gene expression levels to K = 7 for two reasons. The first is that, because the data are normalized, there are few genes with a mRNA expression level of 7 or higher. Furthermore, the memory limitations enforce it. In this paper, we assume that pairs of genes are highly positively correlated at the initial time point; that is Note that this matrix needs to be transformed into a 1 × K 2 -dimensional joint state probability vector as   Table 3 allow to calculate a joint probability vector π i,j (t), under the normal condition, with the Table 4 contains information for malignant (ALT or active telomerase) cells. The three families of parameters, λ + i , λ − i and μ i for all i, are identical for both normal and ALT cells, while transition probabilities, p + ij , p − ij and d i , are different between the cell types. Transition probabilities are deciding factors in correlation coefficients based on Eq. 9 and the rates of convergence to stationarity in G-Networks. Table 3 Values of the parameters needed to determine the infinitesimal generator (Q) using 4 normal cell lines, where  The "service" (or firing) rate of gene i, denoted by μ i , represents the protein-protein interactions, e.g., phosphorylation and ubiquitination. Gene i activates and inhibits gene j with probability p + ij and p − ij , respectively. Genes in the rows correspond to the "starting" genes, while those in columns to the ending genes The definitions and representations of d i , μ i , p + ij and p − ij are the same as in Table 3 Simulation study using estimated parameters Estimated and/or assumed values of the parameters listed in Tables 3 and 4 determine the infinitesimal generator, Q, and the steady-state distribution of each gene. Developed on the global balance equation of G-Networks [23] and the exponentially-distributed holding times of Markov chains [30], the empirical cumulative density function (ECDF) was exploited to confirm that parameter estimation via the APDA is appropriate aligning with the theory of G-Networks. We compare the ECDF of simulated data utilizing values in Tables 3 and 4 and the theoretical cumulative density function (CDF) based on q i in Eq. 3. The algorithm for the simulation is explained in detail in Methods. The ECDF, usually denoted byF n (x), is associated with cumulative frequency of observations (empirical sampled data). It is a non-parametric estimator of the underlying CDF of a random variable of interest and is formally defined as follows: where 1 [·] is an indicator function. That is, the ECDF is a step function that increases by 1 n at each datum. The stationary probability distribution of each gene in G-Networks is in the geometric product-form as stated in Eq. 2, which results in the following tail of the theoretical CDF, T (x i ): Figure 4 (and Fig. 9) illustrates how the ECDF of hTERT utilizing estimated/assumed values of the parameters (Tables 3 and 4) converges to the theoretical CDF regardless of cell types. Similar patterns are obtained for all 5 statistically significant genes in this research.

Correlation between a pair of genes
We discuss the similarities or differences in the correlation of two genes between normal and malignant cells with either active telomerase or ALT. Figure 5 based on Eq. 9 illustrates the following: • First, Fig. 5 demonstrates the change of correlation between a pair of genes in the given system over time in normal cells and ALT (malignant) cells. We can then infer both the trend of correlation between a pair of genes and the speed to the transient state, i.e., before it reaches a steady state. Due to the product-form stationary probability distribution of network state in the theory of G-Networks, the correlation coefficient becomes 0 when the system becomes stationary [31]. Assuming that the pair of genes are positively correlated at time 0, the positive correlation persists over time in both types of cells (normal and ALT) and practically reaches 0 (the steady state) approximately at t = 0.8. As shown in the figure, the patterns of correlation between each pair of genes do not noticeably differ among the types of cells (normal or malignant with either ALT or active telomerase shown), although they are dissimilar from one pair to another within the same type of cells. For example, the correlation between CEBPA and FOXM1 in a transient state is stronger than that of c-Myc and hTERT, and it reaches a steady state more slowly in all cell types. • Second, Eq. 9 uncovers the interactions between (1) CEBPA and hTERT, (2) FOXM1 and hTERT and (3) FOXM1 and E2F1, which are unknown in Fig. 3. The pattern in Fig. 5, which is analogous to Fig. 7, delineates that the relationship of the three pairs of genes is similar regardless of the mechanisms by which telomeres maintain their sufficient lengths. However, within the same condition of cells, the Note that if we conversely assume that a pair of genes are highly negatively correlated at the initial time point, then the negative correlation prevails over time as shown in Fig. 8.

Discussion
One important advantage of using mathematical modeling to understand dynamics of a biological system is that the model can be a cost-effective and time-saving supplement or even a substitute for laboratory experiments in which patterns of the system are delicate and complex. Especially the stochastic approach is useful to quantify the role of fluctuations in the behavior of the system of interest. In this paper, we propose that a stochastic paradigm involving G-Networks and Stochastic Automata Networks both stemming from queuing theory, can contribute to the analysis of similarities and differences between telomerase activation and ALT using genes related to telomere maintenance. However, there are the following two caveats.
• First, correlation coefficient, whose sign relies upon the initial joint state probability vector π ij (0) in our example, evaluates the strength of the evidence for a relationship of two genes but does not determine causal relationships. In other words, even if a positive correlation between CEBPA and hTERT in ALT cells is revealed, it is not known which gene activates another one.
• Second, the dimension of the infinitesimal generator, Q, which is a function of tensor products, can be enormously large, with the corresponding matrix being sparse depending on the number of states of each queue. The state space used in this research is relatively small. Computations in larger dimensions can be made practical by using Compressed Sparse Row and a number of methods for the approximation of the matrix exponential [32] such as Krylov-type techniques.

Conclusion
In spite of the caveats discussed in the previous section, the APDA based on G-Networks initially introduced in [18] can suggest a set of statistically significant genes (CEBPA, FOXM1, E2F1, c-MYC, hTERT) in cells with either telomerase or ALT compared to normal cells from preexisting GRNs. We further confirmed that the APDA satisfactorily estimates the parameters, which decide the rates of convergence to stationarity as the ECDF of the simulated data using the estimated parameters approaches to theoretical CDF using q i∈{1,2,··· ,N} as time progresses. Further, our correlation analysis based on SANs helps to infer the link between genes in various conditions that has not yet been discovered through experiments. We prove that our mathematical expression (Eq. 9) allows analytically calculating the correlation coefficients between any pair of genes in the given system at every time point and compare them in various types of cells for an order relation. Specifically in our example, the trend of correlation among genes in the system is not influenced by the mechanism (telomerase vs. ALT) by which telomeres maintain their sufficient lengths, even though the trend differs from one pair to another within the same type of cells. To conclude, this study provides a platform to detect significant genes and infer the previously unknown connection among them by applying modeling techniques borrowed from queueing theory. This is particularly important because telomere dynamics plays a major role in many physiological and disease processes, including hematopoiesis.

Detailed proof of Eq. 9
If Q is the infinitesimal generator and P(t) is the transition probability matrix of a finite Markov chain, we can derive a set of differential equations called Kolmogorov's forward equations, in the form of P (t) = P(t) · Q d dt P(t) ≡ P (t) = P(t) · Q . P(0) initial condition is P(0) = I, where I is an identity matrix having the same dimension as Q [33], consistent Solution of the forward equation for finite Markov chains is given by the power series expansion that always converges for any square matrix Q: where ||·|| denotes l 1 -norm, and H(t) is is a linear operator with || H(t) || = o(t), as t → 0. When the infinitesimal generator, Q, is irreducible, its transition probability matrix, P(t) defined in Eq. 11, is strictly positive for each time t > 0. It signifies that as the trajectory length N approaches to the limit, at least one transition between any pair of states will almost surely happen without regard to the sparsity of Q [34]. Note that 0 = π · Q are a set of |S| linear equations, which is usually used to detect the stationary distribution π, if such one exists [26]. A formal solution to the time dependent state probability vector is defined as where π(t) is a state probability vector at time t [35,36]. In order to find a joint state probability vector π i,j (t) of two automata, say i and j, we introduce two matrices called C i,j and C * i,j , which are K N × K 2 -and K 2 × K Ndimensional matrices, respectively. Then based on Eq. 12, we achieve Here, C i,j and C * i,j are defined as follows respectively: where U i,j,n = I K , if n = i or n = j 1 K , if n = i and n = j Note that 1 T K indicates the matrix transpose of 1 K , and we have This results in , if n = i and n = j Therefore Eq. 13 finally becomes Equation 14 implies that a time-dependent joint state probability vector π i,j (t) can be easily calculated, once the (global) infinitesimal generator of the entire system is obtained.
Once π i,j (t) of two genes is obtained, we then can form a (K × K)-dimensional contingency table that describes the joint probability for each time point t. Based upon it, the association between two genes for all t is evaluated using the Pearson product-moment correlation coefficient.

Application to telomere maintenance Data description
Gene expression data (GSE14533) were obtained from the National Center for Biotechnology Information Gene Expression Omnibus (http://www.ncbi.nlm.nih.gov/geo/). Data includes expression gene levels of 3 types of cells: 18  [37]. However, only 5 significant genes (CEBPA, E2F1, FOXM1, c-MYC, hTERT) (connected to statistically significant edges) in either telomerase-positive or ALT cells compared to the normal cells, discovered by APDA based on G-Networks [18], have been included in this research. Note that the data were first normalized to the 50 th percentile to guarantee the identical medians across all samples. Then they were normalized again and scaled with mean 3 and variance 1 [18,38]. Table 5 and Fig. 6 outline the data in detail.

Modification of the aPDA
We adopt APDA introduced originally in [18], but slight modification of assumptions on λ + i and λ − i is added in this research as follows: • Define initiating genes as genes which do not receive any customer from other genes but send a customer to others within the system. • The arrival rate of negative customers from the external system, denoted by λ − i , of the initiating genes is assumed to be 0. This assumption enables the initiating genes to maintain their customers; thereby to be consistently activated.
• The arrival rate of positive customers from the external system is defined as follows: wherex i is the average mRNA expression level of the gene i in normal cell lines. Here, D in i represents the in-degree of gene i, which is the number of edges incoming to node i. The purpose of adding 3 to the non-initiating genes is to prevent the numerator of q i from becoming 0. The definition of λ + i for the initiating genes stems from the following logic: the in-degree, arrival rate of the negative customers from the external system and transition probabilities of an initiating gene are all 0; that is, In such cases, the parameters of stationary distribution in Eq. 3 are q i = λ + i μ i , resulting in λ + i = q i × μ i for all i ∈ Indices of initiating genes . Because the numerical value of q i is unknown yet, we estimate it based on the property of the geometric distribution with parameter (1 − q i ) as follows: Equation 16 finally leads to the definition of λ + i for i ∈ Indices of initiating genes . Note that, according to Eq. 3,q i describes the estimate of the steady-state probability that there is at least one mRNA of the gene i [18]. Table 2) in this research are identical to those in [18].

Algorithm: simulation study
According to the global balance equation of G-Networks [23][24][25], a movement of mRNA expression levels of a gene in small time t is decided by one of the 4 cases as shown in Table 6.
Algorithm 1 contains the details of the simulation assessing the estimated parameter values from the APDA. Algorithm 1 Simulation assessing the estimated parameter values from the APDA 1: Calculate the transition rates for each case. For Case 1 and Case 2, the dimension of the transition rate matrix will be 1 × n, while it is n × n for the others. 1 represents an indicator function.
-Case 1 : Track the time to the next movement, T, which follows the exponential distribution [30]: 3: Obtain the transition probabilities for each case as follows: : Set the mean mRNA expression level of each gene as initial points, x 0 = {x 1 ,x 2 , · · · ,x N }. 5: Select one case from 4 total cases according to the appropriate transition probabilities obtained in Step 3. 6: If either Case 1 or Case 2 is selected, then randomly choose a number from {1, · · · , n}. If not, then choose two numbers without replacement. 7: Update the mRNA expression levels according to the case chosen.