 Methodology article
 Open Access
 Published:
A measure of agreement across numerous conditions: assessing when changes in network structures are tissuespecific
BMC Genomics volume 20, Article number: 26 (2019)
Abstract
Background
There is great interest to study how gene pathways change their structure across different tissues. The assessment of interstudy reliability of pathway changes across tissues can inform on the fraction of tissues with specific functional changes in network structure. However, there is a lack of agreement measures among studies that independently observe how a group of observations change across conditions. We, therefore, propose λ, a new interstudy reliability measure that determines the consistency to distinguish observations by condition.
Results
We derived λ’s distributional characteristics, determine its reliability properties and compared it with Cohen’s κ. We studied the coexpression structure of 287 gene pathways across four brain regions in two transcriptomic studies and applied λ to assess the interstudy reliability of the pathways’ brainregional changes. Brainrelated pathways showed highest λ; the top value was for the nicotine addiction pathway whose structure was reliably distinguishable among regions with dopaminergic projections.
Conclusion
Our results offer novel substantial evidence that changes in network structure across tissues can be inferred independently of samples, algorithms and experiments (RNAsequencing or microarrays). Reliability measures, such as λ, can inform on the tissues where changes in a network’s structure are likely functional. An R package is available at https://github.com/isglobalbrge/lambda .
Background
Reproducibility is a pressing issue in biomedical research that particularly worries a large number of researchers in the field [1]. Research guidelines from leading journals and the American Statistician Association urge for the need of confirmation studies and accurate statistical reporting [2, 3]. In systems biology, gene interaction networks are often derived from the integration of highthroughput data with the aim to determine gene structures with probable biological functionality. Networks, therefore, also need to be reproduced between independent studies in order to contain valid scientific content [4]. Inferred networks or established gene pathways may be additionally assessed under different and numerous conditions to understand the physiological changes of their associated biological functions. The interest here is, for instance, to determine for which tissues a pathway has a specific function [5]. Therefore, if the pathway is physiological only in specific tissues, the interstudy reproducibility of its structure across other tissues, where it is not functional, is expected to be low. As such, neurobiological networks would be expected to meaningfully change in structure between nervous tissues but not between connective tissues like blood [6]. While pairwise comparisons of a network reproducibility between two studies can be applied on a single tissue [7], there is a need to measure the reproducible changes in network structure across multiple tissues which, in particular, informs on the fraction of tissues for which the changes may underlie tissuespecificity. We, therefore, aimed to propose a measure that allows us to determine the degree to which the changes in network structure among a range of tissues are reproducible across studies; Fig. 1 shows a schematic representation of the data and the type of network reliability we intent to assess.
Classical statistics include numerous ways to measure the reliability of an observation [8]. Reliable observations are reproducible and accurate. Agreement measures between two independent experiments are used to assess the consistency of the observations being made. If observations are categorical, Cohen’s κ and its generalizations are typically used [9, 10]; if observations are continuous then a number of correlation measures can be used, such as intraclass correlations [11]. These and other agreement measures are suitable when experiments are performed under comparable, unchanged conditions. When studies are designed to test how a group of individuals change under a range of varying conditions, it is of interest to assess how reliable the changes across conditions are. However, there is a lack of recommended interrater measures that assess reliable changes among conditions [8, 12].
A first question that can be asked is the degree to which the most similar observations between studies correspond to those under the same condition, and hence have a measure of how reliably the observations can be distinguished by condition. Similarity can be assessed by any statistic, namely a preservation statistic, that compares two sets of observations at any particular property. A correlation of the observations between two conditions is an example of one preservation statistic. If a preservation statistic for observations between studies is defined [4], we can ask whether the preservation is maximum when the conditions between studies match. The question can be addressed mathematically by computing a preservation matrix, whose elements are the preservation statistics between conditions across studies, and assess the extent to which the diagonal terms are maxima across rows and columns. In particular, a preservation matrix across conditions can be defined for crosstabulated tables used in classical interstudy reliability studies and, therefore, the ability to correctly pair observations by condition can be studied in those cases. We thus aimed to construct a reliability measure for the correct condition pairing of observations between studies, compare it with Cohen’s κ and apply it to assess the interstudy reproducibility of changes in coexpression network structure across tissues.
The GTEx project is an unprecedented effort to study the gene expression in tens of tissues in hundreds of subjects using RNAsequencing [13]. It is, therefore, a strong candidate for becoming a preferred benchmark for interaction networks across different tissues. As the number of studies with expression data in multiple tissues is expected to increase, agreement measures against GTEx may serve to assess the reproducibility of network changes across tissues [3]. Some studies that measure gene expression in different tissues are publicly available, one of which is the BRAINEAC project [14]. Here, the gene expression using microarray data was measured in hundreds on individuals, free of neurodegenerative disorders at the time of death, in nine different brain tissues, four of which overlap with those in GTEx. Using our agreement measure between the studies, we assessed the fraction of brain regions for which 287 KEGG pathways [15] reliably changed across the cerebellum, frontal cortex, hippocampus and putamen.
Results
For two studies that measure the same group of units (subjects, coexpression pairs in a network, etc.) across k conditions, we propose to construct the k×k preservation matrix Z between conditions. The matrix element \(z_{ij'}\phantom {\dot {i}\!}\) (i,j=1,...k) is the correlation (or any preservation statistic) of the units’ observations between condition i in the first study and condition j in the second study. When the observations are the connections between all nodepairs in a network then \(\phantom {\dot {i}\!}z_{ij'}\) may be any normally distributed statistic where pairwise topologies are compared. Figure 1 shows the case in which a coexpression network A of m gene transctipts is inferred for n subjects in 6 tissues (1,...6) in 2 different experiments (E1 and E2). The matrix Z is given by all pairwise comparisons between tissues for the network connections across both experiments. The arrows show the values of Z that are maxima across the rows and columns that intersect at each diagonal term.
We are interested in measuring how reproducible the changes of units observations, such as network edges, are. In the case of perfect interstudy reproducibility across all conditions/tissues then the matrix Z looks diagonal. That is, \(\phantom {\dot {i}\!}z_{ii'}\) is maximum across row i and column i^{′}, for all i. In general, this is not the case and the preservation corresponding to row i and column j^{′}≠i^{′} may be maximum across the elements in row i and column i^{′}. Figure 1 shows, for instance, that the element of Z at tissue 3 in E1 and tissue 2 in E2 is the maximum in row 3 and column 3 of Z. As the network structure of tissue 3 in E1 is paired with that of tissue 2 in E2, network connections in tissue 3 have not been reliably reproduced between experiments. By contrast, all the structures at tissues 1, 2, 4 and 5 are correctly paired between experiments, as the corresponding diagonal terms are their row and column maxima and, therefore, their differences are reproduced between experiments.
We define a measure λ on the preservation matrix that tells us how diagonal Z is. In general, we assume that the units’ observations corresponding to row i can be paired by similarity with those in column j^{′} if the preservation statistic \(\phantom {\dot {i}\!}z_{ij'}\) is maximum across row i and column i^{′}. Note that, as a consequence, Z is nonsymmetric. We then define λ as the expected fraction of correct condition pairings between experiments (R), given by
where p_{ii} is the probability that the diagonal term \(z_{ii'}\phantom {\dot {i}\!}\) is maximum within its row and column elements. λ measures the fraction of conditions that can be reliably distinguished by structural changes, taking into account a probability for random pairing. We modeled p_{ii} assuming independence among the similarities of observations between conditions, and normality in the preservation statistic (see “Materials and methods” section). As an agreement measure, we derived its reliability properties and compare it with a classical measure of interstudy agreement, Cohen’s κ. We then applied it to determine which of 287 KEGG pathways presented higher rates of correct condition pairing across four brain regions between two independent transcriptomic studies.
Comparison between λ and Cohen’s κ in simulation studies
We studied the properties of λ with extensive simulations. While λ is applicable to more general situations than those covered by Cohen’s κ, we compared the performance of λ with κ in simulated cases where both measures can be used. We simulated numerous agreement tables between studies, in which 500 units were classified by both studies into numerous categories/conditions, under different scenarios. From the crosstabulation table of agreement, see expression 14 in “Materials and methods” section, we computed κ: the interstudy reliability of classification that accounts for random agreement. From the corresponding preservation matrix obtained from Eqs. 17 and 18 in “Materials and methods” section, we computed λ: the fraction of conditions that can be correctly distinguished between experiments. We also computed P_{0}: the proportion of units correctly classified between experiments; and r: the observed fraction of conditions that are correctly paired between experiments.
We recreated the reproducibility assessment between two raters/experiments across varying number of conditions/tissues k=(5,7,15), where three types of crosstabulation tables between raters across conditions were allowed (expression 14 in “Materials and methods” section). Those were tables with: 1) marginal equiprobabilities 1/k of both experiments across all tissues (scenario 1), 2) marginal probabilities ∼ 1/j for tissue j (scenario 2) and 3) marginal probabilities ∼ 1/j^{2} (scenario 3), where j=1,...k. Simulation scenarios 2 and 3 were designed to test situations where agreement between studies tend to concentrate around one tissue j=1. We performed 10,000 simulations per number of conditions and scenario. Simulations were extracted from the permutation of 50 diagonal crosstabulation tables (perfect agreement) and 50 crosstabulation tables with null diagonals (perfect disagreement). Conditions across subjects were permuted a total of 30,000 for each table, preserving their marginal frequencies, given by each simulated scenario. Every 30 permutations, agreement measures were computed obtaining a total of 5000 values for perfect agreement (Fig. 2) and 5000 for perfect disagreement. Simulations thus covered the entire interval between perfect agreement (P_{0}=1) and null agreement (P_{0}=0). See Additional file 1: Table S1 and “Materials and methods” section for further details.
Simulations showed that when null agreement is expected (P_{0}=0) λ is 0, while λ is 1 for full agreement (P_{0}=1). Consistently with this, we observed that λ increased monotonically with κ for all the simulation scenarios, see Additional file 2: Figure S1. The functional dependence was highly stable under different scenarios, revealing, as expected, high λ agreement for fair values of κ (0.2,0.4), given that the latter is a measure of exact agreement rather than discriminative agreement. Therefore, the agreement measured by λ is consistently higher than the agreement measured by κ. We also observed that for low values, λ tends to zero when κ takes small negative values, a situation already described in Cohen’s work [9]. For a given κ, we observed a sizable range of λ values, in particular, as tissues become less equiprobable (scenario 3). We noted that if the number of tissues is small (k=5) and the marginal distribution greatly concentrates around one single tissue (j=1 for scenario 3), then λ tends to 1/k (0.2) because the experiments can clearly distinguish that tissue from the rest. In this case, κ tends to zero.
We then studied the relationship between the agreement measures that account for random agreement with those that do not. In our simulations, we confirmed that κ is lower than the proportion of agreement P_{0} (Fig. 3); which illustrates the initial motivation for κ’s definition as a measure that corrects for random agreement [9]. Similarly, r, the observed fraction of times the diagonal terms in the preservation matrix are row and column maxima (Eq. 20 “Materials and methods” section), is higher than λ, a distributional estimate of such fraction. Note also that r, as a realization of the random variable R, is discontinuous with k+1 possible values, while λ is a continuous variable ranging from 0 to 1.
We additionally studied the variance σ of the fraction R, for which r is a realization and λ its expected value. We observed that σ decreased with the number of tissues and with departure from marginal equiprobability (Additional file 3: Figure S2). We observed that for a given λ, multiple values of σ are allowed (Additional file 4: Figure S3). However, the mean of κ, for a single draw of the binomial process in Eq. (16), is a function of its variance. In particular, κ’s variance is minimum at extreme values (0,1) and is inversely proportional to the number of subjects [16]. In contrast, we observed that σ decreases towards zero for a range of values of λ. This occurs when λ tends to r, that is, when the probabilities that the diagonal terms of the preservation matrix are maxima tend either to zero or to one, see Eq. (13) in “Materials and methods” section. We also computed the variability of λ from the 5% confidence intervals (CI) obtained from bootstrapping the units observations in the crosstabulation tables. For simulations corresponding to 5 equiprobable conditions and varying number of subjects, we observed that the length of λ’s CI was proportional to σ and independent of the number of subjects (Additional file 5: Figure S4).
Overall, these observations show that the fraction of tissues for which network changes are reproducible can be determined with high precision, low σ and therefore small CIs for λ. From a practical point of view, low or moderate values of λ can be used to select the tissues for which the network changes can be reliably measured, and are likely tissuespecific. Consider, for instance, a situation where the structure of a gene pathway is inferred in 50 different tissues. If we expect that the network is biologically functional in only 5 of the tissues, then λ will be at most 0.1. In addition, low σ will also indicate that the network structure in 90% of tissues could be discarded on grounds of measurement reliability. By contrast, κ is maximally informative only as it tends to 1, that is when all the conditions are distinguishable and the preservation matrix is fully diagonal.
We finally performed a power test for λ, where we simulated a true agreement scenario in which 3/5 tissues could be reliably paired across two studies. We simulated a true scenario using a multivariate distribution for 10 random variables corresponding to 5 networks per experiment; 3 networks between experiments correlated with strength of 0.5 while all other possible network correlations, across tissues and studies, varied from 0 to 0.5. We also varied the number of genes N in the network (10, 20, 40) corresponding to N(N−1)/2 nonredundant connections (45, 190, 780). We observed that λ had sufficient power (> 80%) to detect true agreement when the difference between matching tissue correlations and background correlations is 0.05 for 40 gene networks or 0.2 for networks consisting of 10 genes (Additional file 6: Figure S5).
KEGG pathways
The Kyoto encyclopedia of genes and genomes (KEGG) offers a list of experimentally characterized biochemichal pathways. We computed λ between two independent transcriptomic studies, GTEx (RNAsequencing) and BRAINEAC (microarray), that measured gene expression levels with different technologies in the cerebellum, frontal cortex, hippocampus and putamen.
For each of the 287 KEGG pathways, we selected the commonly annotated genes in each study. We computed the coexpression networks, where genes were hubs and the correlations between the expression levels were edges between hubs. We then derived the pathways’ adjacency matrices from the absolute value of the genepair correlations in each tissue. As a pairwise preservation statistic, we used the adjacency correlation (cor.cor) that quantifies how similar the connectivity is between two network structures [4], see “Materials and methods” section. We then computed the preservation matrices of network connectivities for each pathway. As a result, we obtained a λ per pathway. Table 1 shows the pathways with the top λ values (> 0.5). While no pathway correctly matched its structure between studies across all four tissues, we observed seven pathways (2%) with agreement between (0.5, 0.75); those are pathways that correctly paired their structures across two to three tissues between studies. Remarkably, five of these pathways are directly linked with signaling processes specific to brain, suggesting that the differences between network structures across brain regions may be tissuespecific. We further observed that in the complete list of all KEGG pathways (Additional file 8: Table S2), brain pathways were highly ranked by λ. In the topranked pathways, we also observed MicroRNAs in cancer and adrenergic signaling in cardiomyocytes. While these are not brainspecific pathways, numerous MicroRNAs have been shown to express in brain [17] and adrenergic signaling plays an important role in longterm potentiation in hippocampus [18]. We also computed the bootstrap 5% CI of λ and estimated σ for each pathway. We observed a linear relationship between the length of the interval and σ, confirming our previous observation (Additional file 5: Figure S4) that the variability of λ is a function of σ and independent of the number of genes in the pathways; see Additional file 7: Figure S6).
Figure 4 illustrates the structure of the strongest edges, across tissues and studies, of the nicotine addiction pathway, which was the top hit with λ=0.67 and σ^{2}=0.02. We observed that the structure of the network was preserved between studies in which, for instance, GABRB3 was an important hub across tissues and studies while some genes like GABRR2 and CHRNA6 were consistently unlinked from the network. While some reliable differences between tissues can be noticed, the preservation matrix in Fig. 5 clearly shows that the frontal cortex, the hippocampus and the putamen can be correctly paired between studies, while the cerebellum cannot. The lower value of λ=0.67 from the observed R=0.75 is due to the variability of the preservation values and, perhaps, to the closeness in structure between the hippocampus and the frontal cortex.
We also computed benchmark λ as a measure of benchmarking the pathway’s structures obtained from BRAINEAC against GTEx, (Table 2 and Additional file 8: Table S3). Benchmark λ corresponded to the expected value of tissues from BRAINEAC that were correctly paired to those in GTEx, given by the maximum over rows of the preservation matrix at a given column (see “Materials and methods” section). We observed higher values of benchmark λ, and in particular, 20 pathways with benchmark λ>0.7, 15 of which were brainrelated. Seven of those pathways are involved in synaptic signaling (glutamatergic, cholinergic, GABAergic and dopaminergic synapse, and longterm potentiation, retrograde endocannabinoid and calcium signaling pathways), and four in addiction processes (nicotine, morphine, amphetamine and alcohol).
Transcriptomewide gene network
We inferred the transcriptomewide coexpression network for 9071 genes across the GTEx and BRAINEAC studies in the four brain tissues. The network was fully characterized by 8.2×10^{7} genepair correlations. We observed λ=0.5 for the transcriptomewide network across all four tissues. Figure 5 illustrates the preservation matrix between studies across tissues. We observed that all the preservation statistics in the matrix were similar in size, between 0.38 and 0.49, and that the cerebellum and frontal cortex diagonals were correctly paired between studies. We noted that the diagonal terms at the hippocampus and putamen were the second maxima after their correlations with the frontal cortex in GTEx. Therefore, the transcriptomewide networks cannot clearly disentangle the frontal cortex from the hippocampus and putamen, suggesting that a large amount of the 8.2×10^{7} genepair correlations may not be brainregion specific.
We finally benchmarked the transcriptomewide network obtained in BRAINEAC against GTEx across the four brain regions. In this case, we confirmed that all diagonal terms were their column maxima (see Fig. 5) and therefore benchmark λ=1. These results show that the transcriptomewide network can correctly pair BRAINEAC tissues with those of GTEx. As the benchmarking only takes into account the variability between tissues of GTEx and not BRAINEAC, the observation: benchmark λ=1, is consistent with the consideration that RNAsequencing (GTEx) may provide better characterization of the transcriptome than microarray technology (BRAINEAC) [19].
Discussion
We proposed a new measure, λ, of agreement between studies. The motivation of the measure is the assessment of agreement between studies that measure the effects of varying conditions on a set of units (subjects, coexpression pairs, etc). We showed that the measure conformed to the properties of agreement measures and used simulations to compare it with Cohen’s κ. Our results illustrate the large potential of λ, in particular, for studying the changes in genenetwork structure across numerous tissues. The measure is particularly useful to determine the fraction of tissues for which network structures can be reliably distinguished, and thus informs on the ratio of structural changes that may be tissuespecific.
In our application to coexpression networks in brain, we observed that the structure of a number of brainrelated pathways could be reliably matched by tissue across GTEx and BRAINEAC. We observed that the nicotine addiction pathway presented the highest λ in which the frontal cortex, the hippocampus and the putamen were reliably distinguished. Interestingly the cerebellum, with a limited role in addiction, could not be reliably distinguished. Benchmarking of BRAINEAC against GTEx, confirmed that the differences in the nicotine pathway structure across relevant brain regions were highly reliable, as were those in glutamatergic, cholinergic, GABAergic, dopaminergic synapse pathways, whose complex interactions are critical to nicotine addiction [20]. The reliability of other addiction pathways (morphine, amphetamine and alcohol) supports these findings and indicates that structural changes of those pathways in regions with dopamine projections, which are essential in addiction processes (frontal cortex, hippocampus and putamen), are likely tissuespecific [21].
The agreement measure λ is based on the preservation matrix between tissues. Langfelder and colleagues thoroughly studied a number of pairwise preservation statistics of Cholesterol Biosynthetic Process network across sex and tissue differences in mice [4]. Their approach to assessing the reproducibility of the data was to test the preservation of network structure against a reference set of conditions (female/liver), allowing them to identify the condition (male/liver) for which structures are most similar. However, their data show, more generally, that sex changes are much smaller than tissue changes and, therefore, that differences in network structures can reliably pair tissues between sex changes. In this case, the interstudy reliability for changes in the network structure across a range of tissues is clearly high, as it would be summarized by a by high λ. While a pairwise preservation statistic assesses the reliability of a network’s structure between two conditions [4], the preservation matrix registers all pairwise changes of the structure among numerous conditions. λ is a measure on the preservation matrix that allows the assessment of agreement between studies to reliably discriminate between conditions, measuring the probability that the diagonal terms of the matrix are their row and column maxima.
We are unaware of similar measures of agreement on akin preservation matrices that test the changes in the structure of a network across tissues. Other functions of the preservation matrix could also be investigated. We observed, in particular, that λ is a measure of interobserver agreement that is consistently higher than Cohen’s κ. Perfect agreement for κ is exclusively given by diagonal tables, while perfect agreement for λ is given by maxima diagonal terms in tables where their elements, irrespective of their magnitude, are estimated with sufficient low variability. λ does not test the level of pairwise preservation but the extent to which structural patterns are correctly classified by conditions. This is an important difference between the measures, which allows λ to be used in more general situations where the elements of crosstabulated tables between raters/studies are inferences. In particular, we observed that λ is informative for intermediate values, as it can be used to select the fraction of tissues where network changes are reproducible and likely to be tissuespecific.
Currently, the validity of a gene or protein network derived from highthroughput data is often benchmarked against networks derived from current knowledge of specific interactions, given by curated pathways, specific experiments or even text mining of published articles [22]. This type of confirmatory analysis extracts networks that are a mixture of interactions that have been individually reported on different tissues. Therefore, while validity is investigated, in terms of consistency with previous knowledge, reproducibility of network structure in an independent confirmatory study is not being assessed. We observed important preservation of network patterns between two equivalent, independent studies, based on different technologies (RNAseq and microarray), analysis methods and subjects. As such, our results fully test interobserver reproducibility and provide support that inferred network structures are biological entities independent of numerous sources of variability and heterogeneity in the studies. Previous work has tested the agreement of transcription measurements between RNAsequencing and microarray experiments on the same subject samples ([23, 24]). These are important studies to validate experimental techniques. However, fully testing interobserver reproducibility of network inferences, as we have done, must be based on independent confirmatory experiments on different population samples, experiments and analyses. Finally, recent efforts have been made to assess Bayesian network reproducibility, where intrastudy preservation statistics have been proposed [25]. While there is still a need to apply those metrics to assess interstudy reproducibility, note that λ will also be applicable in such a case.
Conclusions
We propose a novel interstudy measure of agreement λ to determine the fraction of conditions for which structural changes in a set of observations may be deemed reliable. The measure revealed that changes in the coexpression structure of addictionrelated pathways can correctly distinguish the frontal cortex, hippocampus and putamen but not cerebellum between studies. Reliable tissue differences in network structure can help to identify tissuespecific pathwaybiology and increase the reproducibility of network inferences. More generally, our agreement measure can be used in a set of independent studies that measure how the same group of units changes across numerous conditions.
Materials and methods
We propose an interstudy measure of agreement to discriminate conditions. While the measure can be applied to different research studies, such as those with factorial designs, we illustrate how its need arises from an example in current functional genomic research. We consider two studies that measure the same units (subjects, genepair correlations, etc.) across multiple conditions. We aim to construct a measure that informs on the fraction of conditions that can be reliably distinguished by structural changes in the observations.
The problem
Let us assume that we have two studies that measure the expression of a set of genes in two different population samples, in the same range of tissues and using different experimental setups. For instance, one experiment may use RNAseq and the other a microarray technology. We may be interested in inferring the coexpression relationships between the genes across tissues and determine whether the changes in the relationships are consistent between experiments. The coexpression between two genes in the network is given by their correlation over the subjects’ gene expression levels. Figure 1 illustrates the situation where the coexpression between the genes of a network is inferred for 6 tissues in two different studies. Our aim is then to propose a measure that tells us the fraction of tissues that can be reliably distinguished by the network structure across studies. The measure will then inform on the fraction of tissues for which the changes in gene relationships may be tissuespecific.
A network A of m genes (nodes) can be represented by a m×m adjacency matrix between genes, where the entries are the edges linking a pair of nodes, see Fig. 1. We are interested in genenetwork modules, such as pathways or genesets associated with a biological function. There are numerous structural properties of networks that can be used to test the preservation of a network between two studies. Preservation statistics quantify given aspects of withinnetwork topology that are preserved between studies, see Langfelder et al. for a review [4]. Given a preservation statistics z, we aim to assess the interstudy reliability of a network’s structure between studies E1 and E2 across numerous conditions.
For studies that compute a network across different conditions like tissues (i=1,...k), we want to assess the preservation between studies of the network changes across tissues. We thus form the preservation matrix Z across conditions where the entries are pairwise preservation statistics \(z\left (A_{i},A'_{j}\right)=z_{ij'}\phantom {\dot {i}\!}\) of network A between tissues i and j in E1 and E2, respectively. For three tissues we thus have
We would then like to have a measure of interstudy agreement from Z, which can tell how diagonal Z is or, more explicitly, the extent to which i can be correctly paired with i^{′}, for all i.
A solution
In Z, we can pair by similarity two network structures corresponding to condition i in E1 and condition j in E2 if the element \(\phantom {\dot {i}\!}z_{ij'}\) is maximum within row i and column j. As such, we are interested in measuring the extent to which two structures can be paired when the conditions match between studies, that is when j^{′}=i^{′}. We then propose to measure the probability that the diagonal terms in the preservation matrix are their row and column maxima. For Z in equation 3, we thus compute

\(p_{11}=Pr(z_{11'}> z_{12'}, z_{13'}, z_{21'}, z_{31'}) \phantom {\dot {i}\!}\),

\(p_{22}=Pr(z_{22'}> z_{21'}, z_{23'}, z_{12'}, z_{32'})\phantom {\dot {i}\!}\) and

\(p_{22}=Pr(z_{33'}> z_{31'}, z_{32'}, z_{13'}, z_{23'})\phantom {\dot {i}\!}\),
where p_{ii} is the probability that the network structures in tissue i between studies are correctly paired by maximum similarity. Taken the preservation statistic as a similarity measure, we can assume that the similarity of one structure (i) to another (i^{′}) is independent of its similarity to a third (j^{′}). Therefore, the probabilities p_{ii} can be computed as the product of the individual pairwise probabilities
where the first factor is the probability that \(\phantom {\dot {i}\!}z_{ii'}\) is the maximum over row i of Z, and the the second factor is the probability that \(z_{ii'}\phantom {\dot {i}\!}\) is the maximum over column i^{′}. We can write down an explicit form for the pairwise probabilities, further assuming that the preservation statistic distributes normally
Following the independence of structural similarities between experiments in different tissues, we then have that the pairwise probability \(\phantom {\dot {i}\!}Pr(z_{ii'}>z_{ij'})\) can be derived from the suitable integration of the joint distribution of \(\phantom {\dot {i}\!}z_{ii'}\) and \(\phantom {\dot {i}\!}z_{ij'}\)
where erf is the error function [26]. Therefore, we have that the probability that the diagonal term \(\phantom {\dot {i}\!}z_{ii'}\) is the maximum in the row i is
The probability that the diagonal term \(z_{ii'}\phantom {\dot {i}\!}\) is the maximum in the column i follows a similar form.
Our agreement measure then follows from the fraction R of diagonal terms in Z that are their row and column maxima. The fraction R distributes as the fraction of successes in k Bernoulli trials each of which has its own probability p_{ii}; that is, as a Poisson binomial distribution for the success fraction with mean and variance
The agreement measure λ is then the expected fraction of conditions with the correct matching of network structures between studies. In the case that E1 is the benchmark for experiment E2, then one is interested in testing whether the diagonal terms are the maxima of their columns only, generalizing the concepts of sensitivity and specificity. In this case λ can be computed by simply setting \(Pr(z_{ii}>z_{ji'})=1\phantom {\dot {i}\!}\) for j=1,...k. Note also that it is straightforward to generalize the measure for more than two studies by expanding the products in Eq. (4).
Reliability properties of λ
Suitable reliability measures satisfy three basic properties: i) their values range from 0 to 1; ii) they tend to 0 when no agreement is expected and tend to 1 when a full agreement is expected and iii) they account for random agreement.
Regarding property i), λ clearly ranges from 0 to 1, as it is given by the fraction between the sum of k probabilities p_{ii} and k. To show properties ii) and iii), we look at the explicit model for p_{ii} given by the substitution of Eq. (7) in (4)
erf(x) is the error function which, importantly, defines the Heaviside step function in the limit: \(H(x)= 1/2*{\lim }_{t\to 0} \text{erf} (x/t)\) [26]. Therefore, when {σ_{ii},σ_{ji},σ_{ij}}→0, p_{ii} tends to
That is, when there is no variability in z_{ij}, p_{ii} is 1 if \(\phantom {\dot {i}\!}z_{ii'}\) is maximum across rows and columns and 0 otherwise. Therefore, regarding property ii), null and perfect reliabilities are clearly expected with null variability. For such cases all p_{ii} are either 0 or 1 according to Eq. (13). In the case of null reliability none of the μ_{ii} is maximum, all p_{ii} are 0 and λ=0. Whereas, in the case of perfect reliability all μ_{ii} are maxima, all p_{ii} are 1 and λ=1. Regarding property iii), we can see that under null variability, λ tends to the observed fraction of conditions r that are correctly paired between experiments. The fraction r, which is a realization of R, can be considered as a reliability measure but does not account for random agreement. By contrast, λ, as an expected value of R under a probability distribution, incorporates a variability that accounts for correct random pairing.
Comparison between measures of agreement
Let us take two raters who observe l units in k categories. Then, for instance, if k=3 and categories are labeled a, b and c, the crosstabulation matrix of observaions takes a similar form of expression (3),
where n_{i,j} is the number of units observed in category i and j by the first and second rater respectively, and \({\sum \nolimits }_{i,j=1}^{k} n_{ij}=l\). Agreement is typically measured by Cohen’s κ
where P(i,i)=n_{ii}/l is the observed frequency of units that were measured in category i by both raters and P_{d}(i) is the frequency of units in i observed by rater d (d=1,2). The sum \(P_{0}= \sum _{i} P(i,i)\) is the total fraction of agreement: the proportion of observations that falls in the diagonal, which does not account for random agreement. Cohen’s κ measures the fraction of discordant observations expected by chance that are actually observed in agreement.
While λ can be applied to a wider range of reliability studies than κ, we compared the two measures in cases where both of them can be computed. Note that a matrix Z in Eq. (3) can be computed from the crosstabulation table in expression (14). Given that in row i, in expression (14), the number of observed units is n_{i}=P_{1}(i)∗l, we can then assume that n_{ij} is one draw of a binomial process
where \(\hat {\theta }_{ij}=n_{ij}/n_{i}\), and the mean and variance of the mean are given by
For for large n_{i} the binomial distribution tends to a normal distribution. Therefore, the values in Eqs. 17 and 18 can be used in Eq. 7. With a similar computation for the column elements, the measure λ can be obtained for a table in the form (14) and compared with the value of κ for varying values of the total fraction of agreement P_{0}.
As κ is an agreement measure that corrects P_{0} for random agreement, we compared λ with the observed fraction of diagonal elements that are their row and column maxima r, explicitly given by
Simulations
We performed a series of simulations to study the properties of λ with respect to κ and r. Simulations were obtained for three total number of conditions/tissues k=(5,7,15), and three fifferent forms for the marginal frequencies P_{1}(i) and P_{2}(i) (i=1,...k) across studies 1 and 2

Scenario 1 (equiprobable): \(P_{1}(i)=P_{2}(i)=\frac {1}{k}\), ∀i

Scenario 2: \(P_{1}(i)=P_{2}(i)=\frac {1}{i} / {\sum \nolimits }_{j=1}^{k} \frac {1}{j} \)

Scenario 3 (the least equiprobable): \(P_{1}(i)=P_{2}(i)=\frac {1}{i^{2}} / {\sum \nolimits }_{j=1}^{k} \frac {1}{j^{2}} \)
We set the number of observations to l=500. For each scenario, we simulated 50 cases of perfect agreement tables (P_{0}=1), i.e. diagonal matrices, and 50 cases of perfect disagreement (P_{0}=0); those are tables with zeros on the diagonal terms except for the cell of maximum joint probability. For each case, we permuted a pair of observations 3000 times, such that the original marginal frequencies were conserved. After each set of 30 permutations, we computed the four agreement measures. This procedure allowed the assessment of 10,000 simulations, in each scenario and tissue level, covering the whole agreement interval of P_{0}. The simulation scheme is shown in Fig. 2 and further details are given in Additional file 1: Table S1.
We used R.3.30 and the package psych to perform calculations and compute Cohen’s κ.
Gene expression data
We downloaded expression data from the GTEx project obtained from RNAseq [27]. Reads per kilobase per million mapped reads (RPKM) of version 6 were obtained for all brain tissues. Covariates for each tissue were also downloaded.
We also downloaded the brain expression data of the BRAINEAC project [28] obtained from winsorized values of exon array data (Affymetrix Human Exon 1.0 ST array). Downloaded data had been previously normalized and corrected for batch effects.
We identified four brain tissues common in both datasets and for which GTEx had covariates’ information. Those were cerebellum (CRBL) with 125 individuals in GTEx and 130 in BRAINEAC, frontal cortex (FCTX) with 108 and 135 individuals, (HIPP) hippocampus with 94 and 130 individuals, and putamen (PUTM) with 82 and 135 individuals, respectively.
Between the two studies, we mapped 9071 common genes to compute all genepair correlations of expression levels.
Coexpression networks
For each tissue in GTEx, we downloaded its corresponding table of covariates. These included 20 variables, 3 of which are genomewide principal components that inform on ancestry, 15 surrogate transcriptomic variables that inform on batch effects, one variable containing gender and another platform. As expression levels for GTEx are derived from count data, we performed Spearman’s partial correlations of the expression levels between each genepair, among 9071 genes. We used the par.cor function of the ppcor R package to test the rank correlations adjusting for covariates. A transcriptomewide adjacency matrix was constructed for each tissue, where the entries, corresponding to the edges of the genomewide network, were the absolute value of all genepair correlations.
For the BRAINEAC data, similar adjacency matrices per tissue were obtained. In this case, however, the downloaded gene expression values had already been normalized and corrected for batch effects. As gene expression levels are close to normality, we performed a Pearson’s correlation between all genepairs and used these values to compute the adjacency matrices.
We computed λ to test the interstudy agreement of the coexpression networks of 287 pathways, downloaded from https://www.genome.jp/kegg/. For each pathway, we extracted a total of eight adjacency matrices (4 tissues × 2 studies) whose elements were the absolute value of the expression correlations between all gene pairs within the pathway. We computed the pathways’ vectorized adjacency matrices, which are the vectors with nonredundant components
where m in the number of genes in the pathway A. As a preservation statistic between studies, we used the adjacency correlation (cor.cor in [4]), based on the Pearson’s correlation between the vectorized matrices of pathway A
between tissues i and j, across experiments E1 and E2, respectively. The elements z_{ij} of the preservation matrix Z for each pathway A were obtained by the Fisher’s Ztransformation of cor.cor(A_{i},Aj′).
Abbreviations
 B i n o m i a l(n,p):

Binomial distribution of n trials with success probability p
 BRAINEAC study:

The brain eQTL almanac study
 CRBL:

Cerebellum
 erf(x):

Error function of x
 FCTX:

Frontal cortex
 GTEx study:

Genotypetissue expression study
 HIPP:

Hippocampus
 KEGG:

The Kyoto encyclopedia of genes and genomes
 netA:

Network A
 PUTM:

Putamen
 RPKM:

Reads per kilobase per million mapped reads
 RNAseq:

RNA sequencing
 N(μ,σ ^{2}):

Normal distribution with mean μ and variance σ^{2}
References
 1
Baker M. 1500 scientists lift the lid on reproducibility. Nature. 2016; 533(7604):452–4.
 2
Wasserstein RL, Lazar NA. The asa’s statement on pvalues: context, process, and purpose. Am Stat. 2016; 70(2):129–33.
 3
Mogil JS, Macleod MR. No publication without confirmation. Nature. 2017; 542(7642):409–11.
 4
Langfelder P, Luo R, Oldham MC, Horvath S. Is my network module preserved and reproducible?. PLoS Comput Biol. 2011; 7(1):1001057.
 5
Oldham MC, Konopka G, Iwamoto K, Langfelder P, Kato T, Horvath S, Geschwind DH. Functional organization of the transcriptome in human brain. Nat Neurosci. 2008; 11(11):1271.
 6
Cai C, Langfelder P, Fuller TF, Oldham MC, Luo R, van den Berg LH, Ophoff RA, Horvath S. Is human blood a good surrogate for brain tissue in transcriptional studies?. BMC Genomics. 2010; 11(1):589.
 7
Chen C, Cheng L, Grennan K, Pibiri F, Zhang C, Badner JA, Gershon ES, Liu C. Two gene coexpression modules differentiate psychotics and controls. Mol Psychiatry. 2013; 18(12):1308.
 8
Hallgren KA. Computing interrater reliability for observational data: an overview and tutorial. Tutor Quant Methods Psychol. 2012; 8(1):23.
 9
Cohen J. A coefficient of agreement for nominal scales. Educ Psychol Meas. 1960; 20(1):37–46.
 10
Banerjee M, Capozzoli M, McSweeney L, Sinha D. Beyond kappa: A review of interrater agreement measures. Can J Stat. 1999; 27(1):3–23.
 11
Shrout PE, Fleiss JL. Intraclass correlations: uses in assessing rater reliability. Psychol Bull. 1979; 86(2):420.
 12
Gwet KL. Handbook of Interrater Reliability: The Definitive Guide to Measuring the Extent of Agreement Among Raters, 4th edn.Gaithersburg, USA: Advanced Analytics, LLC; 2014.
 13
Lonsdale J, Thomas J, Salvatore M, Phillips R, Lo E, Shad S, Hasz R, Walters G, Garcia F, Young N, et al.The genotypetissue expression (gtex) project. Nat Genet. 2013; 45(6):580–5.
 14
Trabzuni D, Ryten M, Walker R, Smith C, Imran S, Ramasamy A, Weale ME, Hardy J. Quality control parameters on a large dataset of regionally dissected human control brains for whole genome expression studies. J Neurochem. 2011; 119(2):275–82.
 15
Kyoto encyclopedia of genes and genomes. http://www.genome.jp/kegg. Accessed 12 Apr 2018.
 16
Fleiss JL, Cohen J, Everitt B. Large sample standard errors of kappa and weighted kappa. Psychol Bull. 1969; 72(5):323.
 17
Krichevsky AM, King KS, Donahue CP, Khrapko K, Kosik KS. A microrna array reveals extensive regulation of micrornas during brain development. Rna. 2003; 9(10):1274–81.
 18
O’Dell TJ, Connor SA, Guglietta R, Nguyen PV. βadrenergic receptor signaling and modulation of longterm potentiation in the mammalian hippocampus. Learn Mem. 2015; 22(9):461–71.
 19
Zhao S, FungLeung WP, Bittner A, Ngo K, Liu X. Comparison of rnaseq and microarray in transcriptome profiling of activated t cells. PLoS ONE. 2014; 9(1):78644.
 20
Markou A. Phil Trans R Soc B Biol Sci. 2008; 363(1507):3159–68.
 21
Volkow ND, Wang GJ, Fowler JS, Tomasi D, Telang F, Baler R. Addiction: decreased reward sensitivity and increased expectation sensitivity conspire to overwhelm the brain’s control circuit. Bioessays. 2010; 32(9):748–55.
 22
Szklarczyk D, Franceschini A, Wyder S, Forslund K, Heller D, HuertaCepas J, Simonovic M, Roth A, Santos A, Tsafou KP, et al.String v10: protein–protein interaction networks, integrated over the tree of life. Nucl Acids Res. 2014; 43:D447–52.
 23
Trost B, Moir CA, Gillespie ZE, Kusalik A, Mitchell JA, Eskiw CH. Concordance between rnasequencing data and dna microarray data in transcriptome analysis of proliferative and quiescent fibroblasts. R Soc Open Sci. 2015; 2(9):150402.
 24
Guo Y, Sheng Q, Li J, Ye F, Samuels DC, Shyr Y. Large scale comparison of gene expression levels by microarrays and rnaseq using tcga data. PloS ONE. 2013; 8(8):71462.
 25
Cohain A, Divaraniya AA, Zhu K, Scarpa JR, Kasarskis A, Zhu J, Chang R, Dudley JT, Schadt EE. Exploring the reproducibility of probabilistic causal molecular network models. In: PACIFIC SYMPOSIUM ON BIOCOMPUTING 2017. Singapore: World Scientific: 2017. p. 120–31.
 26
Weisstein ew. heaviside step function. in mathworld–a wolfram web resource. 2018. http://mathworld.wolfram.com/heavisidestepfunction.html. Accessed 12 Apr 2018.
 27
Genotype tissueexpression project. http://www.gtexportal.org. Accessed 12 Apr 2018.
 28
Braineac project. http://www.braineac.org/. Accessed 12 Apr 2018.
Acknowledgments
We thank the reviewers of the manuscript for their contribution.
Funding
This work was partially supported by Red Española de Supercomputación (BCV201630002) and by Ministerio de Economía e Innovación (Spain) (MTM201568140R).
Availability of data and materials
The entire code to fully reproduce this work is given in Additional file 9: Computer Code and all required pre and postprocessed data has been deposited in Figshare and assigned the identifier https://figshare.com/account/home#/projects/20065, see README file within computer code. The function lambda.agreement has been implemented in an R package available in https://github.com/isglobalbrge/lambda.
Author information
Affiliations
Contributions
AC designed the study, analyzed the data and wrote the manuscript, JRG supervised the work and revised the manuscript. All authors have read and approved the manuscript.
Corresponding author
Correspondence to Alejandro Cáceres.
Ethics declarations
Ethics approval and consent to participate
Not Applicable.
Consent for publication
Not Applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Additional files
Additional file 1
Table S1. Simulation details. (52.6 KB)
Additional file 2
Figure S1. Relationship between λ and Cohen’s κ across all simulation scenarios. (406 KB)
Additional file 3
Figure S2. Relationship between λ and σ across all simulated scenarios. (401 KB)
Additional file 4
Figure S3. Left) Relationship between κ and its variance for the simulated scenario with 5 conditions and a single subject, corresponding to a single draw of the binomial process in Eq. (16). The functional dependence of κ variance with κ is unchanged but reduced in magnitude with increasing N. Right) Relationship between λ and σ in the same scenario. (120 KB)
Additional file 5
Figure S4. Bootstrap 5% confidence intervals for λ for a simulation with 5 conditions and a varying number of individuals. We see that the length of the CI is proportional to σ and independent of the number of subjects. (7.61 KB)
Additional file 6
Figure S5. Power estimations for true agreement r=3/5 when the true correlations between matching tissues (diagonal terms in the preservation matrix) are fixed at cor=0.5 and all other background correlations c0, given by all other elements in the preservation matrix, varied from 0 to 0.5. The figure shows the differences between fixed and varying correlations (corDiff=cor−cor0) by a different number of genes N in the network. Networks were simulated as random variables from multivariate distributions. (4.90 KB)
Additional file 7
Figure S6. Bootstrap 5% confidence intervals for λ, between GTEx and BRAINEAC, for 287 KEGG pathways. We see that the length of the CI is proportional to σ, regardless of the different number of genes of each pathway. (13.4 KB)
Additional file 8
Tables S2S3. Agreement measure λ between GTEx and BRAINEAC for 287 KEGG pathways across four brain regions. (56.0 KB)
Additional file 9
Computer Code. Entire computer code in R to reproduce the results discussed in the manuscript. (12.1 KB)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Received
Accepted
Published
DOI
Keywords
 Brain
 Transcriptome
 Coexpression networks
 Reliability
 Coehn’s kappa
 GTEx
 RNAsequencing
 Addiction
 Nicotine