 Methodology article
 Open Access
 Published:
A skellam model to identify differential patterns of gene expression induced by environmental signals
BMC Genomics volume 15, Article number: 772 (2014)
Abstract
Background
RNAseq, based on deepsequencing techniques, has been widely employed to precisely measure levels of transcripts and their isoforms expressed under different conditions. However, robust statistical tools used to analyze these complex datasets are lacking. By grouping genes with similar expression profiles across treatments, cluster analysis provides insight into gene functions and networks that have become increasingly important.
Results
We proposed and verified a cluster algorithm based on a skellam model for grouping genes into distinct groups based on the pattern of gene expression in response to changing conditions or in different tissues. This algorithm capitalizes on the skellam distribution to capture the count property of RNAseq data and clusters genes in different environments. A twostage hierarchical expectationmaximization (EM) algorithm was implemented to estimate the optimal number of groups and mean expression levels of each group across two environments. A procedure was formulated to test whether and how a given group shows a plastic response to environmental changes. The model was used to analyze an RNAseq dataset measured from reciprocal crosses of early Arabidopsis thaliana embryos that respond differently based on the extent of maternal and paternal genome contributions, from which genes associated with maternal and paternal contributions were identified. Simulation studies were also performed to validate the statistical behavior of the model.
Conclusions
This model is a useful tool for clustering gene expression data by RNAseq, thus facilitating our understanding of gene functions and networks.
Background
The transcriptome is the total set of transcripts in a given organism at a specific developmental stage or under external environmental condition. Understanding the transcriptome is therefore essential to interpret the relationship between genome and organism function. Transcriptomics can be used to gain considerable biological insight by cataloguing all species of transcripts, determining the transcriptional structure of genes, and quantifying the changing expression levels of each transcript under various conditions [1–3]. RNAseq, a nextgeneration sequencing technique, quantifies the transcriptome at a given moment in time, allowing for a better understanding of genome structure, gene expression patterns and gene regulatory networks [4, 5]. The organism can alter transcriptome levels and pattern responses to environmental changes [6, 7]. RNAseq is a powerful tool used to identify specific genes associated with adaptive environments; such studies can assess genes involved in adaptation to environmental changes, particularly under different stresses or in various developmental stages. We hypothesized that, while an organism responds to growth conditions, particular environmental cues cause differential expression of its genes at a level that can be detected by RNAseq. By profiling transcriptional changes induced by environmental changes, it is possible to identify gene regions or pathways that are likely to be targets of selection. This information is important to enable researchers to assess variation across gene regions, on a landscape scale, to predict the capacity of organisms to adapt to different conditions. Recently, RNAseq experiments have evaluated differential mRNA processing events along the developmental gradient, as well as in different tissues, to account for the reaction norms of gene expression profiles [8–10]. In addition, RNAseq has been used to assess the physiological response of organisms at different spatial scales and gain more insight into adaption mechanisms [11].
To better understand responses of gene expression to growth conditions, cluster analysis has been used as a powerful computational tool to divide genes into groups according to their expression patterns. In biology, cluster analysis is implied by the basic assumption that a gene expression profile may have similar features within the group [12–14]. Despite their widespread use, traditional approaches, such as hierarchical clustering algorithms and kmeans algorithms, are largely heuristic, lacking a stringent inference about the underlying biological mechanisms. On the other hand, a modelbased clustering approach assumes that the data are generated by a mixture of the underlying probability distribution components, in which a different group or cluster represents a component [15–18]. Also, this approach is flexible in choosing the component distribution and obtaining density estimation for each cluster. Nevertheless, most existing approaches for modelbased cluster analysis have several limitations. First, the level of gene expression determined by RNAseq is represented by the abundance of short reads, mapped to the reference, which is defined as a set of exons [19]. In practice, modelbased cluster analysis is computationally difficult, especially because some genes are expressed at a very high level. In general, to discover important biological changes in expression and eliminate calculative hardship, normalization continues to be an essential step in the analysis, but most normalization methods neglect data features [20]. As a type of count data, three discrete probability distributions: binomial, Poisson and negative binomial (NB), have been used to model RNAseq data [21–23].
Second, a regular RNAseq experiment designed is to compare gene expression levels between test conditions. By comparing differential expression across treatments, one can characterize key genes that regulate the pattern of an organism’s response to rapid and stochastic environmental changes. Joint clustering for expression amounts in different treatments has been developed [24], but this strategy may not be sensitive to identify the differential response of genes to environmental changes, i.e., phenotypic plasticity [15]. The phenotypic plasticity of a gene can be expressed as the difference or ratio of expression amounts of the gene between two particular treatments. Since the difference and ratio of two Poisson variables requires totally different treatments of statistical modeling, we will, in this study, focus on modelbased clustering for treatmentdependent differences to accommodate environmental impact.
Although some attempt has been made to overcome the first limitation [24], simultaneous treatment of the two limitations has not been explored in the literature. Here, we developed a computational model that clusters the differences between two statistically independent random variables, each having a Poisson distribution. Since the difference of two Poisson variables follows a skellam distribution [25], skellam parameters were implemented within a mixture model framework in which each component is represented by a distinct pattern of expression differentiation. Model parameters are estimated through the twostage hierarchical expectation–maximization (EM) algorithm. Mean level of gene expression for a group is calculated for different environments, allowing us to compare the response level of gene expression to environmental changes. Results from this skellam model will obtain diverse insight into the genetic basis underlying adaptation to environments. The skellam model was used to analyze an RNAseq dataset collected for early Arabidopsis thaliana embryos derived from reciprocal crosses in the onetotwocell stage [26]. By comparing it with conventional kmeans and selforganizing mapping approaches, we show that the new model is statistically more powerful for gene clustering.
Methods
Mixture modelbased likelihood
The most common type of transcriptome study is carried out to measure the response of organisms to two treatments. This type of analysis is especially useful for comparison of expression in different organs, treated versus untreated conditions in the same tissue, or studying the difference between reciprocal crosses, etc. Suppose we obtain a transcriptome dataset in which the organism is measured for reads of n genes with two treatments (1 and 2), and expression reads of gene i are denoted as X_{ i } and Y_{ i }, respectively. In general, genes that are differentially expressed can be identified by determining differential expression between treatments. To assess gene expression changes across treatments, cluster analysis is a powerful tool for analyzing gene expression levels according to different patterns of gene expression. Therefore, we can discern different groups of genes per their functional similarities and differences in their plastic responses to changes in environment.
For any gene i, it should arise from one of the J groups that are classified on the basis of two expression values with two treatments. The joint likelihood of the expression data z_{ i } = (X_{ i }  Y_{ i }) of n genes is written as
where θ represents a set of unknown parameters, π_{ j } represents the probability of group j(j = 1, …, J) in the total genes, and f_{ j }(z_{ i }) represents the density function of two expression difference values for gene i that belongs to group j with the two treatments.
We used a skellam distribution function to describe f_{ j }(z_{ i }), which is specified by the mean values of gene expression with treatment 1(θ_{j1}) and 2(θ_{j2}). Let X_{ i } and Y_{ i } denote two independent random Poisson variables with mean θ_{j1}, θ_{j2} for group j, respectively. The two variables are expressed as one independent random variable: z_{ i } = X_{ i }  Y_{ i }. A skellam distribution of z_{ i } for gene i is described by a probability density function, expressed as
where θ_{j1} and θ_{j2} represent the mean expression values of all genes that belong to group j in treatment 1 and 2, respectively, with the two parameters arrayed in Λ_{ J } = (θ_{j1}, θ_{j2}). Here, f_{ j }(z_{ i }) in the mixture model (1) is specified by f_{ j }(Z = z_{ i }Λ_{ j }).
Estimation via the EM algorithm
Maximumlikelihood (ML) estimation is more complicated since the likelihood involves the modified Bessel function. If the true data X_{ i } and Y_{ i } are observed, then the estimation is straightforward since their means would be the ML estimates for Poisson parameters. Here, an EMtype algorithm is constructed based on the missing data representation of difference values Z. Unlike a general skellam model, the likelihood of z_{ i } is formulated within a mixturemodel framework (1), whose estimation is based on implementation of the EM algorithm. Thus, we implemented a twostage hierarchical EM algorithm to estimate the parameters Λ_{ j } of the likelihood (1).
In the E step, we calculate the conditional expectation of X_{ i } by
where f* is the density of joint distribution of (X_{ i }, Y_{ i }). Meanwhile, we calculate the posterior probability of gene i that belongs to group j,
In the M step, we obtained the estimates of parameters π_{ j } and Λ_{ j } by using
where E and M steps are iterated between equations (3–7) until the estimates of the unknown parameters converge to stable values. Estimates obtained this way represent the maximumlikelihood estimates (MLEs) of the parameters.
Choosing an optimal number of groups
One important question in the implementation of modelbased clustering analysis is to determine the actual number of clusters using a model selection criterion, such as BIC. For a given number of clusters J, we calculate the likelihood L by (1) and the BIC by  2 log(L) + J log(n), where n is the number of genes in the model. A low value of BIC corresponds to an optimal number of clusters.
Hypothesis tests
After an optimal number of gene clusters is determined, we tested whether genes are expressed differentially between treatments. Three biologically meaningful tests were formulated as follows:

(i)
For a given group j, we want to know whether its genes are differently expressed between the two treatments. This can be tested using the following equation:
If the H_{0} is accepted, then the group of genes expressed between the two treatments is stable. Otherwise, they exhibit differential expression across treatments, in which case, they can be used as a predictor of environmentalinduced changes.

(ii)
For a pair of groups j and l, we want to know whether they interact with each other to determine environmentalinduced changes. This can be determined using the following equation:
If the H_{0} is rejected, then these two groups of genes have significant interaction effects on biological changes between treatments.

(iii)
For a particular group j, we want to know whether changes in gene expression for a group are consistent with the extent of change of the environment. This can be determined using the following equation:
where c represents the difference between the environmental signals between treatments. If H_{0} is rejected, then the change in gene expression for the group is consistent with a change in the environment between treatments.
For each of the hypotheses (8–10), the likelihood ratio test statistics (LR) between these two hypotheses H_{0} and H_{1} are calculated. Since the H_{0} is nested within H_{1}, the LR value can be thought of being chisquare distributed, with the degree of freedom equaling the difference between the numbers of parameters to be estimated under the two hypotheses. The LR value is compared with a critical threshold to determine the acceptance or rejection of the null hypothesis. If these tests are incorporated by a particular environmental signal, e.g., temperature or nutritional level, we can better understand the relationship between gene expression and environmental change.
Results
Working example
The prevailing theory for the maternaltozygotic transition in plants proposes that most early embryonic mRNAs are maternally derived, resulting either from maternal inheritance or from higher transcriptional activity of maternally derived genes until the globular stages. However, this theory is difficult to reconcile with reports of equivalent maternal and paternal expression of interrogated genes at the preglobular stage. Recently, a study aimed to determine the origins of embryonic transcripts globally by reciprocally crossing polymorphic Col0 and Cvi0 Arabidopsis thaliana accessions Col0 × Cvi0 and Cvi0 × Col0; the transcriptomes of embryos with onetotwo cells were then measured for two reciprocal crosses [26], from which a total of 1,521 differential expression genes were gained.
The skellam model was used to analyze this data, clustering 1521 DE genes into distinct groups. We used BIC to determine an optimal number of gene groups. From the plot of BIC value against the number of groups, 13 was found to be an optimal number of groups (Figure 1). For each group j, the mean values of gene expression (θ_{j1} and θ_{j2}) in reciprocal crosses were estimated, with reasonable good standard errors, by a resampling approach (Table 1). In practical calculations, the estimate of θ_{ j } is sensitive to the choice of initial values. To obtain a global maximum, multiple initial values have been selected and compared. Figure 2 illustrates mean expression values of each group in two crosses; 13 groups not only display differential levels of gene expression, but also vary dramatically in terms of the difference of expression between reciprocal crosses. In Figure 3, we showed the pattern of how genes are differently expressed over different crosses. As can be seen, 13 groups of genes did not parallel, exhibiting significant gene–environment interactions under reciprocal cross conditions.
The hypothesis test (8) provided information regarding the significance of expression differences between treatments to determine the extent of the maternal and paternal contributions. Of these 13 groups, gene expression levels from group 3 (accounting for nearly 84% of genes) tended to be stable between reciprocal crosses, although change in gene expression was statistically significant (P < 0.05) (Table 2). This indicates that most genes of maternal and paternal genomes contribute slightly differently to Arabidopsis thaliana embryos at the onetotwo cell stage. Approximately 6% of genes (groups 1, 5, 6, 7, 9, 10, 12, and 13) and about 10% (groups 2, 4, 8, and 11) were clearly down or upregulated from Col0 × Cvi0 to Cvi0 × Col0, respectively, suggesting that they were preferentially inherited from one parent in onetotwo cell embryos. Hypothesis test (9) was used to determine whether a particular pair of gene groups interacts with the environment. Table 3 lists the significance test used for such gene–gene interactions. All pairs of gene groups exhibited significant gene–environment interactions (P < 0.05). Hypothesis test (10) was utilized to investigate whether gene expression was consistent with environmental change. Except for group 2, all groups conform to the extent of environmental change (Table 4). All calculations and hypothesis tests done above took about 24 h in a 225nodes computing cluster.
The data were also analyzed by traditional approaches, kmeans and selforganization mapping (SOM). Kmeans is a partitioning approach, whereas SOM is a method based on a machine learning algorithm that uses a competition and cooperation mechanism to achieve unsupervised learning, processed as implemented in the R package yasomi [27, 28]. It was observed that kmeans and the skellam model produce a similar result, different from that by SOM (Figure 4). Since these three approaches have different underlying principles, they can be interpreted differently. Kmeans clustering tends to identify clusters of similar spatial extents, whereas SOM is typically used as an artificial neural network that is trained using unsupervised learning to produce a lowdimensional, discretized representation of the input space of the training samples. The skellam model identifies clusters based on their pattern of gene expression in response to treatment.
Computer simulation
Simulation studies were conducted to test the statistical power of the skellam model. By assuming three up or downregulated expression patterns, we simulated 2000 genes expressed in two treatments. The treatmentdependent means of groups and their probabilities were given in Table 5.
Table 6 gives the maximumlikelihood estimates of θ_{j1} and θ_{j2}, in a comparison with their true values. In general, mean gene expression values in different treatments can be reasonably well estimated. The estimated curves of gene expression for each group were broadly consistent with the true curves (Figure 5), suggesting that our model was fully powered.
We used kmeans and SOM to analyze the same simulation data. Overall, the skellam model performs better than SOM since the former correctly clusters all genes into their underlying groups whereas the latter provides incorrect clusters for about 20% of genes. Like the skellam model, Kmeans can correctly discern three groups and clusters all genes into correct groups. The advantage of skellam over kmeans lies in its capacity to provide biologically testable hypotheses (8) – (10), thus being of greater value from a biological perspective.
Discussion
Recently, RNAseq has become a highly popular technology for measurement of transcript levels in response to different environment conditions. Here, we propose a statistical model to group RNAseq data in response to changing environmental conditions based on a skellam distribution. The skellam model is able to identify and cluster coexpression patterns of genes derived from different treatments. The same group of coregulated genes responds to environmental change through a similar function; therefore, a set of model responses can be estimated and tested in a functional space. These can then be used to characterize the functional relationship between genes and the environment. The model has three features that differentiate it from traditional clustering methods. First, traditional methods cluster genes based on their expression at single points in time or their joint expression at multiple points in time [22], ignoring the mechanism by which genes are differentially expressed in response to environmental conditions. By determining the differences in expression among treatments as the expression plasticity of a gene, the new model clusters genes into different groups based on their capacity to respond to environmental changes. This peculiarity makes the model particularly useful for understanding the changes in gene expression in response to different treatment conditions.
Second, classical clustering approaches are based largely on continuous expression data measured by microarrays [29, 30], whereas gene reads measured by RNAseq are count data, which are believed to follow a Poisson distribution [20]. Our model has considered the Poisson property of reads. Third, the skellam model treats the coexpression of genes under different condition as a system and integrates their capacity to corespond to environmental changes into clustering procedures. This treatment facilities our understanding of gene plasticity induced by environmental cues.
The skellam model has successfully clustered genes of early Arabidopsis thaliana embryos into groups based on their response to different conditions. Of the genes with a statistically significant change, group 9 is associated with adenosine triphosphate (ATP)involved ATP synthase 9, ATP synthase subunit C family protein and ATPase, F1 complex, alpha subunit protein [31], and group 8 is related to arabinogalactan protein 21, pathogenesisrelated thaumatinlike protein, and ribonuclease 1 [32]. Although both maternal and paternal genomes are active and contribute substantially to the embryonic transcriptome during the onetotwocell stage, some active gene sets are clearly derived from one parent.
We provided a general framework for gene clustering based on the Poisson function. Given a complex data with great variability in different treatments, i.e., overdispersion, the Poisson distribution with one free parameter is too simple to allow for the variance to be adjusted independently of the mean for such a data. Other more sophisticated distributions should be incorporated to provide a better flexibility of fit. These include negative binomial distribution as a natural extension of Poisson distribution and generalized Poisson distribution [33]. In general, clustering of genes with differential expression is not the final step of the analysis. Other analyses, such as gene set testing, gene network construction and knowledge databases should follow. A comprehensive model of integrating gene clustering and these followup analyses should be derived, which would enable geneticists to extract biological insight from gene expression data.
We used the difference of gene expression as a measure of gene plasticity over different environments. This measure can characterize the amount of environmentinduced response, but it cannot well discern the slope of differentiation expression, i.e., the sensitivity of a gene environmental change per its expression unit). Such a slope can be described by the ratio of gene expression over different environments. In theory, the clustering model can be extended to cluster genes expressed under multiple conditions, and provides greater understanding of the mechanistic relationships between gene expression and environmental changes. The extended model allows for the classification of different trajectories of reaction norm in response to an environmental gradient. In addition, most studies of gene expression by RNAseq are performed in a static state, but the role of dynamic gene expression in constructing regulatory networks is being recognized [14, 15]. To model dynamic changes in gene expression in response to environmental stimuli, more advanced statistical model such as longitudinal data analysis integrating the multivariate skellam distribution [34] is required; this warrants further investigation.
Conclusion
As a deepsequencing technique, RNAseq has proven to be powerful for precisely measuring levels of transcripts and their isoforms expressed under different conditions. We have developed a computational algorithm that clusters genes into distinct groups based on the differences of RNA counts between different treatments. The algorithm is based on the Poisson distribution of counts, making use of the skellam function that specifies the distribution of the differences between two independent Poisson variables. A twostage hierarchical EM algorithm was implemented to estimate the optimal number of groups and mean expression levels of each group across two environments. In a comparison with traditional clustering approaches, such as kmeans and selforganization mapping, the new skellam model has more biological relevance, equipped with a capacity to test whether a given group is responsive to environmental changes and how this plastic response is related with, or induced by, an environmental cue. The skellam model provides a useful tool for clustering gene expression data by RNAseq, thereby enhancing our understanding of gene functions and networks.
References
 1.
Metzker ML: Sequencing technologies–the next generation. Nat Rev Genet. 2010, 11: 3146. 10.1038/nrg2626.
 2.
Morozova O, Hirst M, Marra MA: Applications of new sequencing technologies for transcriptome analysis. Annu Rev Genomics Hum Genet. 2009, 10: 135151. 10.1146/annurevgenom082908145957.
 3.
Morozova O, Marra MA: Applications of nextgeneration sequencing technologies in functional genomics. Genomics. 2008, 92: 255264. 10.1016/j.ygeno.2008.07.001.
 4.
Wang Z, Gerstein M, Snyder M: RNASeq: a revolutionary tool for transcriptomics. Nature Rev Genet. 2009, 10: 5763. 10.1038/nrg2484.
 5.
Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y: RNAseq: An assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 2008, 18: 15091517. 10.1101/gr.079558.108.
 6.
Zhou S, Campbell TG, Stone EA, Mackay TF, Anholt RR: Phenotypic plasticity of the Drosophila transcriptome. PLoS Genet. 2012, 8: e100259310.1371/journal.pgen.1002593.
 7.
Viñuela A, Snoek LB, Riksen JAG, Kammenga JE: Genomewide gene expression regulation as a function of genotype and age in C. elegans. Genome Res. 2010, 20: 929937. 10.1101/gr.102160.109.
 8.
Tarazona S, GarcíaAlcalde F, Dopazo J, Ferrer A, Conesa A: Differential expression in RNAseq: a matter of depth. Genome Res. 2011, 21: 22132223. 10.1101/gr.124321.111.
 9.
Oshlack A, Robinson MD, Young MD: From RNAseq reads to differential expression results. Genome Biol. 2010, 11: 22010.1186/gb20101112220.
 10.
Anders S, Huber W: Differential expression analysis for sequence count data. Genome Biol. 2010, 11: R10610.1186/gb20101110r106.
 11.
Place SP, Menge BA, Hofmann GE: Transcriptome profiles link environmental variation and physiological response of Mytilus californianus between Pacific tides. Funct Ecol. 2012, 26: 144155. 10.1111/j.13652435.2011.01924.x.
 12.
Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genomewide expression patterns. Proc Natl Acad Sci U S A. 1998, 95: 1486314868. 10.1073/pnas.95.25.14863.
 13.
Sturn A, Quackenbush J, Trajanoski Z: Genesis: cluster analysis of microarray data. Bioinformatics. 2002, 18: 207208. 10.1093/bioinformatics/18.1.207.
 14.
Ramoni MF, Sebastiani P, Kohane IS: Cluster analysis of gene expression dynamics. Proc Natl Acad Sci U S A. 2002, 99: 91219126. 10.1073/pnas.132656399.
 15.
Wang YQ, Xu M, Wang Z, Tao M, Zhu JJ, Wang L, Li RZ, Berceli SA, Wu RL: How to cluster gene expression dynamics in response to environmental signals. Brief Bioinform. 2012, 13: 162174. 10.1093/bib/bbr032.
 16.
Pan W, Lin J, Le CT: Modelbased cluster analysis of microarray geneexpression data. Genome Biol. 2002, 3: 0009.10009.8.
 17.
McLachlan GJ, Bean RW, Peel D: A mixture modelbased approach to the clustering of microarray expression data. Bioinformatics. 2002, 18: 413422. 10.1093/bioinformatics/18.3.413.
 18.
Fraley C, Raftery AE: How many clusters? Which clustering method? Answers via modelbased cluster analysis. Comput J. 1998, 41: 578588. 10.1093/comjnl/41.8.578.
 19.
Bullard J, Purdom E, Hansen K, Dudoit S: Evaluation of statistical methods for normalization and differential expression in mRNASeq experiments. BMC bioinformatics. 2010, 11: 9410.1186/147121051194.
 20.
Robinson MD, Oshlack A: A scaling normalization method for differential expression analysis of RNAseq data. Genome Biol. 2010, 11: R2510.1186/gb2010113r25.
 21.
Kvam VM, Liu P, Si Y: A comparison of statistical methods for detecting differentially expressed genes from RNAseq data. Am J Bot. 2012, 99: 248256. 10.3732/ajb.1100340.
 22.
Di Y, Schafer DW, Cumbie JS, Chang JH: The NBP negative binomial model for assessing differential gene expression from RNAseq. Stat Appl Genet Mol Biol. 2011, 10: 128.
 23.
Srivastava S, Chen L: A twoparameter generalized Poisson model to improve the analysis of RNAseq data. Nucleic Acids Res. 2010, 38: e170e170. 10.1093/nar/gkq670.
 24.
Wang NT, Wang YQ, Hao H, Wang LJ, Wang Z, Wang JX, Wu RL: A biPoisson model for clustering gene expression profiles by RNAseq. Brief Bioinform. 2013, 15: 534541.
 25.
Alzaid AA, Omair MA: On the poisson difference distribution inference and applications. Bull Malaysian Math Sci Soc. 2010, 8: 1745.
 26.
Nodine MD, Bartel DP: Maternal and paternal genomes contribute equally to the transcriptome of early plant embryos. Nature. 2012, 482: 9497. 10.1038/nature10756.
 27.
Törönen P, Kolehmainen M, Wong G, Castrén E: Analysis of gene expression data using selforganizing maps. FEBS Lett. 1999, 451: 142146. 10.1016/S00145793(99)005244.
 28.
Reichardt J, Bornholdt S: Statistical mechanics of community detection. Phys Rev E. 2006, 74: 016110
 29.
Smith EN, Kruglyak L: Geneenvironment interaction in yeast gene expression. PLoS Biol. 2008, 6: e8310.1371/journal.pbio.0060083.
 30.
Li Y, Alvarez OA, Gutteling EW, Tijsterman M, Fu J, Riksen JA, Hazendonk E, Prins P, Plasterk RH, Jansen RC, Breitling R, Kammenga JE: Mapping determinants of gene expression plasticity by genetical genomics in C. elegans. PLoS Genet. 2006, 2: e22210.1371/journal.pgen.0020222.
 31.
Lin X, Kaul S, Rounsley S, Shea TP, Benito MI, Town CD, Fujii CY, Mason T, Bowman CL, Barnstead M, Feldblyum TV, Buell CR, Ketchum KA, Lee J, Ronning CM, Koo HL, Moffat KS, Cronin LA, Shen M, Pai G, Van Aken S, Umayam L, Tallon LJ, Gill JE, Adams MD, Carrera AJ, Creasy TH, Goodman HM, Somerville CR, Copenhaver GP, et al: Sequence and analysis of chromosome 2 of the plant Arabidopsis thaliana. Nature. 1999, 402: 761768. 10.1038/45471.
 32.
Theologis A, Ecker JR, Palm CJ, Federspiel NA, Kaul S, White O, Alonso J, Altafi H, Araujo R, Bowman CL, Brooks SY, Buehler E, Chan A, Chao Q, Chen H, Cheuk RF, Chin CW, Chung MK, Conn L, Conway AB, Conway AR, Creasy TH, Dewar K, Dunn P, Etgu P, Feldblyum TV, Feng J, Fong B, Fujii CY, Gill JE, et al: Sequence and analysis of chromosome 1 of the plant Arabidopsis thaliana. Nature. 2000, 408: 816820. 10.1038/35048500.
 33.
Karlis K, Meligkotsidou L: Finite mixtures of multivariate Poisson distributions with application. J Stat Plan Infer. 2007, 137: 19421960. 10.1016/j.jspi.2006.07.001.
 34.
Bulla J, Chesneau C, Kachour M: On the bivariate Skellam distribution. 2012, Hal00744355, version 1
Acknowledgments
This work is supported by Special Fund for Forest Scientific Research in the Public Welfare (201404102), NSF/IOS0923975, Changjiang Scholars Award and “Thousandperson Plan” Award.
Author information
Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
LJ derived the model, performed simulation studies, conducted data analysis and wrote the manuscript. KM interpreted results and participated in model derivation. RW conceived the model and wrote the manuscript. All authors read and approved the final manuscript.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
About this article
Cite this article
Jiang, L., Mao, K. & Wu, R. A skellam model to identify differential patterns of gene expression induced by environmental signals. BMC Genomics 15, 772 (2014). https://doi.org/10.1186/1471216415772
Received:
Accepted:
Published:
Keywords
 RNAseq
 Skellam distribution
 EM algorithm
 Arabidopsis thaliana embryos