Skip to main content

Long non-coding RNA pairs to assist in diagnosing sepsis



Sepsis is the major cause of death in Intensive Care Unit (ICU) globally. Molecular detection enables rapid diagnosis that allows early intervention to minimize the death rate. Recent studies showed that long non-coding RNAs (lncRNAs) regulate proinflammatory genes and are related to the dysfunction of organs in sepsis. Identifying lncRNA signature with absolute abundance is challenging because of the technical variation and the systematic experimental bias.


Cohorts (n = 768) containing whole blood lncRNA profiling of sepsis patients in the Gene Expression Omnibus (GEO) database were included. We proposed a novel diagnostic strategy that made use of the relative expressions of lncRNA pairs, which are reversed between sepsis patients and normal controls (eg. lncRNAi > lncRNAj in sepsis patients and lncRNAi < lncRNAj in normal controls), to identify 14 lncRNA pairs as a sepsis diagnostic signature. The signature was then applied to independent cohorts (n = 644) to evaluate its predictive performance across different ages and normalization methods. Comparing to common machine learning models and existing signatures, SepSigLnc consistently attains better performance on the validation cohorts from the same age group (AUC = 0.990 & 0.995 in two cohorts) and across different groups (AUC = 0.878 on average), as well as cohorts processed by an alternative normalization method (AUC = 0.953 on average). Functional analysis demonstrates that the lncRNA pairs in SepsigLnc are functionally similar and tend to implicate in the same biological processes including cell fate commitment and cellular response to steroid hormone stimulus.


Our study identified 14 lncRNA pairs as signature that can facilitate the diagnosis of septic patients at an intervenable point when clinical manifestations are not dramatic. Also, the computational procedure can be generalized to a standard procedure for discovering diagnostic molecule signatures.

Peer Review reports


Sepsis is a severe disease that threatens patients’ life and is the main cause of death in Intensive Care Unit (ICU). Based on the report from Centers for Disease Control and Prevention, over 1.7 million people get sepsis each year, up to 270,000 Americans are killed by the disease per year, and one in three patients who die in a hospital have sepsis [1]. It is occasioned by the over response of the immune system to the infection. The chemicals released by the immune system diffuse throughout all the body and lead to inflammation. Septic shock is a subtype of sepsis when the blood circulation and cellular metabolism of the patients become deadly abnormal [2]. The diagnosis of sepsis in clinic bases on the symptoms and a series of medical tests involving blood, urine, wound secretion, and mucus secretion tests. Taking these typical tests may result in delays in diagnosis and intervention. Besides, it is challenging to distinguish sepsis from non-infectious inflammation based on existing tests [3].

Molecular detection provides a rapid way for early diagnosis and initial evaluation of sepsis. Moreover, it can determine whether a patient with systemic inflammation has an underlying infection by identifying biomarkers or signatures for sepsis. Procalcitonin (PCT) is widely studied as a “standard of care” component for sepsis and has been applied in blood tests for quick diagnosis [4]. However, some studies on PCT in intra-abdominal infections led to contradictory results and restricted the application of PCT as a diagnostic biomarker [5,6,7]. Protein-coding genes are also investigated as diagnostic signatures for sepsis with high accuracy according to a series of in silico experiments [8]. sNIP [9] and SeptiCyte Lab [10] are the two most effective genetic signatures for sepsis diagnosis proposed recently. SeptiCyte Lab is based on the sum of two ratio classifiers, each of which consists of a ratio of two genes (PLAC8/PLA2G7 or LAMP1/CEACAM4) and performed AUC of 0.92 in average in three validation cohorts while PCT only attained AUC of 0.81. It was regarded as the benchmark of sepsis diagnosis [11]. Scicluna et al. identified three genes and developed the sNIP score [(NLRP1 - IDNK)/PLAC8] from a discovery cohort including 60 abdominal sepsis patients and 42 controls. sNIP achieved AUC of 0.97 in the discovery cohort, which outperformed SeptiCyte (AUC = 0.89).

Long non-coding RNA (lncRNA) is a category of RNAs longer than 200 base pairs in length with little potential of encoding proteins [12]. lncRNAs are involved in the mediation of transcriptional and post-transcriptional regulation, which is a canonical epigenetic mechanism of cell dysfunction linked to a variety of immune-related diseases [13,14,15]. Ho et al. reviewed the regulatory non-coding RNAs in sepsis including lncRNA MIR210HG, linc-ATP13A4–8, linc-KIAA1737–2, AL132709.5, CTC-459I6.1 and IL7R [16]. Reddy et al. identified lncRNA E33 to regulate expression of inflammation related gene in macrophage and the response to the inflammatory signals through diabetic mice model [17]. Chen et al. investigated MALAT1 in rat sepsis model and found it regulates sepsis-induced cardiac inflammation [18]. Q. Huang et al. [19] and S. Huang et al. [20] showed that NEAT1 is overexpressed in sepsis patients and the expression level is correlated to severity of sepsis. Furthermore, ZFAS1 was detected to be downregulated in sepsis patients comparing to normal controls and achieved an Area Under the Receiver Operating Characteristic (AUROC) of 0.814 in sepsis diagnosis [21].

At present, machine learning methods are widely used for the biological modeling and biomarker detection. For instance, Wang et al. reviewed the machine learning tools in CRISPR gRNA design [22] and proposed GNL-Scorer to predict the CRISPR on-target activity with well cross-species generalization using Gradient Boosted Regression Tree and Bayesian Ridge regression [23]. B. Elkarami et al. proposed a cost-sensitive classifier and ensembling based on random forest method for handling the imbalanced quantity of miRNAs versus other non-coding RNA [24]. Liu et al. applied decision tree to identify gut microbial biomarkers as potential therapeutic target for atherosclerosis patients [25]. Liu et al. identified a 28-lncRNA signature for sepsis diagnosis using least absolute shrinkage and selection operator (LASSO) [26].

To the best of our knowledge, all the sepsis related transcriptome studies are based on the absolute expression abundance of the detected transcripts, e.g., mRNAs or lncRNAs. Nevertheless, using the absolute abundance for diagnosis is not reliable and even biased due to the systematic experimental bias and technical variation. Moreover, the investigation of biomarkers using absolute expression values may suffer from batch effect and the effect of pre-processing methods. Here, we proposed a signature for sepsis diagnosis using the relative expressions of lncRNA pairs within sample. We firstly performed the intra-sample comparison to obtain the relative expressions of all possible lncRNA pairs and conducted cross-sample analysis to screen out lncRNA pairs that significantly altered in direction between sepsis and control samples. Then, the identified lncRNA signature (lncRNA pairs) was evaluated in eight sepsis expression cohorts and compared with a series of machine learning models and benchmark signatures. Finally, these lncRNA pairs were investigated for their biological functions using enrichment and semantic analysis.

Material and methods

Expression datasets

We collected nine sepsis expression datasets with normal controls from the Gene Expression Omnibus (GEO) database. Only the platform Affymetrix Human Genome U133 Plus 2.0 Array (AffyU133p2) was considered, because it is the most comprehensive array platform with the largest transcript coverage, which is able to reannotate as many lncRNAs as possible from the arrays originally designed for coding genes. Detailed description for each dataset was shown in Table 1. For fair comparison, all the raw data were normalized by the Robust Multi-array Average (RMA) method [35] and MicroArray Suite 5.0 (MAS5.0) [36], although the normalization step is not necessary for our proposed algorithm [37].

Table 1 Datasets included in the analysis

Reannotation of lncRNAs

Based on the annotation files of the RefSeq database (release 79), the NetAffx file (release 36, 7/12/16), and the GENCODE project (release 25), the probes in the AffyU133p2 array platform were reannotated to achieve the expression profiles of lncRNAs [12]. Expression data of 3746 lncRNAs summarized by 4602 probe sets were obtained according to the selection criteria, (1) probe sets with RefSeq IDs labeled “NR_” and annotated with “long non-coding RNA” in the RefSeq; (2) probe sets with Ensembl gene IDs annotated as “long non-coding RNA” in GENCODE. Probe sets meet any of the two criteria were recruited as lncRNAs.


An algorithm, individualized Pair Analysis of Gene Expression (iPAGE), was used to screen the differential discoveries. The absolute expression abundance of genes is frequently varied by plenty of technical variations including experimental designs, sample handling, RNA amounts and extraction procedures, library preparation steps, as well as normalization methods and batch effects. As evidenced by the previous studies [38, 39], the relative expression between genes within a sample is reliable and much more powerful in detecting biological signals. So, we took advantage of the expression levels between every possible pair of lncRNAs to retrieve disease-related lncRNA pairs. The reverse pairs are defined as the lncRNA pairs with the same relative expression order (lncRNAi > lncRNAj) in normal cases while the opposite order (lncRNAi < lncRNAj) in sepsis patients. After that, we identified the top reverse lncRNA pairs and used them as a signature for the diagnostic prediction of sepsis. The workflow of iPAGE is described in Fig. 1.

Fig. 1
figure 1

The workflow of individualized pair analysis of gene expression (iPAGE). a The long non-coding RNA (lncRNA) expression data including both sepsis and normal samples. b Relative expression between every possible lncRNA pairs were calculated in each sample. c Statistical analysis across all samples in the discovery cohort. d Top lncRNA pairs with significant differentiation ability were identified as the signature. e Validation using independent datasets. f Diagnostic application of the proposed signature

In our iPAGE algorithm, exhaustive comparisons were carried out between every lncRNAs based on their absolute abundance. As shown in Fig. 1a, the lncRNAs detected by a sample are represented by a vector \( {G}^{(k)}=\left({RNA}_1^{(k)},{RNA}_2^{(k)},\dots, {RNA}_m^{(k)}\right) \), where \( {RNA}_1^{(k)},{RNA}_2^{(k)},\dots, {RNA}_m^{(k)} \) are the absolute abundance of lncRNAs and the superscript (k) represents the k-th sample in a given dataset. The relative expression of a lncRNA pair \( \left({RNA}_i^{(k)},{RNA}_j^{(k)}\right) \) is defined as

$$ {r}^{(k)}=I\left({RNA}_i^{(k)}-{RNA}_j^{(k)}\right) $$

\( \mathrm{where}\ \mathrm{I}(x)=\left\{\begin{array}{c}1, if\ x\ge 0\\ {}-1, if\ x<0\end{array}\right. \) is an indicator revealing whether x is greater or less than zero. If \( {RNA}_i^{(k)} \) is greater than or equal to \( {RNA}_j^{(k)} \), the relative expression of the lncRNA pair \( \left({RNA}_i^{(k)},{RNA}_j^{(k)}\right) \) will be 1. Otherwise, the relative expression will be − 1. To convert the absolute expression abundance into relative expression within each sample, subtraction was performed between every two lncRNAs (Fig. 1b), which was \( {RNA}_i^{(k)}-{RNA}_j^{(k)},\forall i,j\in \left\{1,\dots, \mathrm{m}\right\},\mathrm{i}\ne \mathrm{j} \). R(k) is a vector constituted by the relative expressions of all lncRNA pairs within the k-th sample (Eq. 2).

$$ {\displaystyle \begin{array}{l}{R}^{(k)}=\left({r}_{12}^{(k)},{r}_{13}^{(k)},\dots, {r}_{1m}^{(k)},{r}_{23}^{(k)},{r}_{24}^{(k)},\dots, {r}_{2m}^{(k)},\dots, {r}_{\left(m-1\right)m}^{(k)}\right)\\ {}=\Big(I\left({RNA}_1^{(k)}-{RNA}_2^{(k)}\right),I\left({RNA}_1^{(k)}-{RNA}_3^{(k)}\right),\dots, I\left({RNA}_1^{(k)}-{RNA}_m^{(k)}\right),\\ {}\begin{array}{l}\dots, I\left({RNA}_2^{(k)}-{RNA}_3^{(k)}\right),I\left({RNA}_2^{(k)}-{RNA}_4^{(k)}\right),\dots, I\left({RNA}_2^{(k)}-{RNA}_m^{(k)}\right),\\ {}\dots, I\left({RNA}_{m-1}^{(k)}-{RNA}_m^{(k)}\right)\Big)\end{array}\end{array}} $$

R(k) is in \( {C}_m^2 \) dimensions, as there are totally \( {C}_m^2=\frac{m!}{2!\left(m-2\right)!} \) lncRNA pairs.

The relative expression R(k) of lncRNAs together with label Y(k) forms a group of training data S = {(R(1), Y(1)), (R(2), Y(2)), …, (R(n), Y(n))}, where Y(k) equals to 0 for a normal sample or 1 for a sepsis sample. The relative expression of lncRNAs pairs within sample is stable, even though the absolute expression value may bias among samples ubiquitously. After the relative expression within sample extracted, the cross-sample analysis was conducted following the steps below.

To screen out the intra-sample signature for sepsis, cross-sample analysis was performed between sepsis and normal samples (Fig. 1c). The lncRNA pairs with a significant difference in their relative expression \( {r}_{ij}^{(k)} \) between sepsis and normal samples were extracted. Two situations, \( {r}_{ij}^{(k)}=1 \) (\( {RNA}_i^{(k)}>{RNA}_j^{(k)} \)) and \( {r}_{ij}^{(k)}=-1 \) (\( {RNA}_i^{(k)}<{RNA}_j^{(k)} \)), were taken into consideration. For each lncRNA pairs in the normal group, the number of RNAi > RNAj across n samples was calculated as

$$ \mathrm{a}=\frac{1}{2}{\sum}_{k=1}^n\left({r}_{ij}^{(k)}+1\right)\times \left(1-{Y}^{(k)}\right), $$

and the number of RNAi < RNAj is

$$ \mathrm{b}=\frac{1}{2}{\sum}_{k=1}^n\left(1-{r}_{ij}^{(k)}\right)\times \left(1-{Y}^{(k)}\right). $$

For the sepsis group, the number of RNAi > RNAj is

$$ \mathrm{c}=\frac{1}{2}{\sum}_{k=1}^n\left({r}_{ij}^{(k)}+1\right)\times {Y}^{(n)}, $$

while RNAi < RNAj is

$$ \mathrm{d}=\frac{1}{2}{\sum}_{k=1}^n\left(1-{r}_{ij}^{(k)}\right)\times {Y}^{(n)}. $$

Hence, the contingency table was obtained as follows,

  \( {RNA}_i^{(k)}>{RNA}_j^{(k)} \) \( {RNA}_i^{(k)}<{RNA}_j^{(k)} \)
Normal \( \mathrm{a}=\frac{1}{2}\sum \limits_{k=1}^n\left({r}_{ij}^{(k)}+1\right)\times \left(1-{Y}^{(k)}\right) \) \( \mathrm{b}=\frac{1}{2}\sum \limits_{k=1}^n\left(1-{r}_{ij}^{(k)}\right)\times \left(1-{Y}^{(k)}\right) \)
Sepsis \( \mathrm{c}=\frac{1}{2}\sum \limits_{k=1}^n\left({r}_{ij}^{(k)}+1\right)\times {Y}^{(n)} \) \( \mathrm{d}=\frac{1}{2}\sum \limits_{k=1}^n\left(1-{r}_{ij}^{(k)}\right)\times {Y}^{(n)} \)

After that, Fisher’s Exact Test was utilized to measure the ability of differentiating sepsis ones from the control samples for each lncRNA pair. The p value was calculated by

$$ \mathrm{p}=\frac{\left(\begin{array}{c}a+b\\ {}a\end{array}\right)\left(\begin{array}{c}c+d\\ {}c\end{array}\right)}{\left(\begin{array}{c}n\\ {}a+c\end{array}\right)}=\frac{\left(a+b\right)!\left(c+d\right)!\left(a+c\right)!\left(b+d\right)!}{a!b!c!d!n!} $$

, where n = a + b + c + d.

Bonferroni correction was then applied for multiple comparison correction. Base on the adjusted p values, the significantly altered lncRNA pairs (SALPs) were screened out between normal samples and septic samples as the signature (Fig. 1d). In this study, the SALPs rij was selected with \( {p}_{r_{ij}} \) smaller than 1 × 10−16.

Classification and evaluation

lncRNA pairs selected in the previous step may serve as a diagnostic signature. A classifier was built based on the relative expressions of lncRNA pairs, which were represented by r1, r2, …, rl for simplification. r is assigned 1 when a pair RNAi > RNAj indicates sepsis and r is assigned −1 otherwise. A risk score ρ indicating the sepsis possibility was calculated by taking the sum of all the differential lncRNA pairs, which is \( \uprho =\frac{1}{2}{\sum}_{q=1}^l\left({r}_q+1\right). \) The classifier was evaluated by Area Under the Receiver Operating Characteristic (AUROC) on eight independent validation cohorts (Fig. 1e) and then can be applied for diagnosis prediction (Fig. 1f). All the above experiments were conducted using Python 2.7.

Functional analysis

The interactions between lncRNAs and proteins were obtained from the RNAinter database [40]. Proteins interacted with a specific lncRNA were assumed to be involved in the similar biological functions with the lncRNA. We used hypergeometric test to evaluate the statistical significance of functional enrichment, mining functional terms or pathways that a set of proteins are overrepresented [41,42,43]. The functional similarity between two sets of genes or proteins was measured by semantic similarity using Wang’s algorithm [44]. These calculations were carried out using clusterProfiler [45] and GOSemSim [46] in R environment.


Data curation

We performed a systematic search for microarray datasets that collected whole blood from sepsis samples and identified three adult datasets and six children’s datasets (Table 1). The dataset GSE95233 with most sepsis samples was selected as discovery cohort. The other two adult datasets served as independent validation sets for the same group and six children’s dataset were utilized for divergent group validation. After Robust Multiarray Averaging (RMA) [35] normalization and reannotation, 3745 lncRNAs were obtained for individualized pair analysis of gene expression (iPAGE) to identify the significantly altered lncRNA pairs (SALPs). Moreover, the eight validation datasets normalized by MicroArray Suite 5.0 (MAS5.0) [36] were applied to evaluate the model performance across distinct normalization methods.

Intra-sample signature discovery

A total of 7,010,640 intra-sample lncRNA pairs were acquired from 3745 lncRNAs using exhaustive comparison between every lncRNA. Among these pairs, 14 lncRNA pairs are the most significantly different between the septic patients and control samples after performing cross-sample analysis (P value < 1 × 10−16, Fisher’s Exact test). The 14 lncRNA pairs are named SepSigLnc and can be used as a transcriptional signature for sepsis diagnosis (listed in Fig. 2b). In the discovery cohort, the lncRNAs on the left are smaller than the ones on the right column in expression level among all the normal samples, whereas the relative expression reverse in the sepsis samples, i.e., from RNAi < RNAj to RNAi > RNAj. Three lncRNAs are involved in multiple pairs among the 14 pairs, resulting in 19 individual lncRNAs. Specifically, ECRP, AC090627.1, and LOC101927974 are implicated in five, four, and three pairs, respectively.

Fig. 2
figure 2

Statistics and characteristics of 14 lncRNA pairs in SepSigLnc. a A circular plot showing the location of the identified lncRNA pairs in chromosome. Nodes represent lncRNAs and edges represent the lncRNA pairs between sepsis and normal states. Red and blue bars represent the mean absolute abundance in discovery cohort. b A grid of pie charts showing the percentage of the lncRNA pairs’ relative expression between sepsis and normal samples in the eight validation cohorts. Each pie represents the percentage of samples in a cohort whose lncRNA pair follow the relative expression (equal to − 1) shown on the left. c The genetic information of the lncRNAs in SepSigLnc

A circos plot shows the location of SepSigLnc on chromosomes with each line linking a pair of lncRNAs (Fig. 2a). Mean expression levels of lncRNAs in normal and sepsis samples among discovery cohort are displayed in blue and red bars, respectively. The expression levels of lncRNAs in a pair reverse between sepsis and control samples. For instance, C5orf66 has higher expression level than AC090627.1 among the sepsis patients, whereas it expresses lower than AC090627.1 among the normal samples. For the eight validation cohorts, the identified 14 lncRNA pairs illustrate distinct expression patterns between normal and sepsis samples (Fig. 2b). Specifically, the proportion of samples with relative expression equal to − 1 among normal samples is no less than that among sepsis patients in each cohort. In another aspect, normal individuals have fewer reverse lncRNA pairs than sepsis ones (Supplementary Figure S2 in Additional file). Figure 2c provides the detailed information of the 19 lncRNAs in SepSigLnc.

Model comparison

For comparison, we also trained machine learning methods on the discovery cohort including logistic regression, nearest neighbors classifier, linear support vector machine (SVM), gaussian process classification (GPC), random forest, and neural network. As the workflow presented in Supplementary Figure S1 (see Additional file), the top 19 significantly differential lncRNAs, the same number of lncRNAs as SepSigLnc, were extracted by independent student’s T-test as features for machine learning models. After tuning with 10-fold cross validation and grid search, the optimal number of estimators for random forest was 10, maximum depth was 2. As for neural network, two hidden layers with 20 neurons for each layer had the best performance through grid search. Also, we compared SepSigLnc with existing genetic signatures SeptiCyte [10] and sNIP [9]. The AUROCs for all the methods on the discovery cohort are shown in Fig. 3a.

Fig. 3
figure 3

AUROC of SepSigLnc, machine learning methods, and benchmarks in discovery cohort and validation data sets. a AUROC of SepSigLnc, machine learning methods, and benchmarks on discovery cohort GSE95233. b Performance on adult validation cohort GSE57065 measured by AUROC. c Performance on adult validation cohort GSE28750 measured by AUROC. d Boxplots of performance in six children’s validation cohorts measured by AUROC. e Boxplots of performance measured by AUPRC on totally eight validation cohorts normalized with RMA and MAS5.0

Validation by adult data sets

GSE57065 and GSE28750, the same age group as the training cohort, were employed as independent validation cohorts to test the performance of SepSigLnc on adults. SepSigLnc achieved the AUROC of 0.995 on GSE57065 and AUROC of 0.990 on GSE28750 (Fig. 3b and c). We compared the performance of SepSigLnc with other machine learning methods using the same number of lncRNAs. Logistic regression and linear SVM had the highest AUROC of 1 on GSE57065, while the SepSigLnc achieved a comparable result of 0.995. Nevertheless, SeptiCyte and sNIP performed not as well as SepSigLnc. As for GSE28750, SepSigLnc performed better than all the other models listed, especially logistic regression (AUROC = 0.600) and linear SVM (AUROC = 0.650). Taking the results on GSE57065 and GSE28750 into overall consideration, SepSigLnc has the comparably best performance.

Validation by children’s data sets

To test the performance of relative expression on various groups of cohorts, we applied SepSigLnc to six children’s data sets GSE8121, GSE9692, GSE13904, GSE26378, GSE4607, and GSE26440. SepSigLnc attained AUROCs over 0.900 on most of the cohorts except GSE9692 and GSE13904(Fig. 3d, supplementary Table S1, and Figure S3 in Additional file). Compared with other machine learning models, SepSigLnc performed better overall. Besides, it surpassed SeptiCyte and was comparative to sNIP. Although the nearest neighbor classifier performed better with AUROC = 0.933 on GSE9692 and AUROC = 0.865 on GSE13904, it almost randomly classified sepsis and normal cases on GSE4607. Neural network achieved AUROC between 0.790 and 0.900 among all the children’s cohorts. Comprehensively, no model can dominate on all cohorts, but SepSigLnc is superior to the others on average.

Validation by MAS normalization

Validation was also conducted on data with different normalization methods to evaluate the robustness of SepSigLnc. We exerted SepSigLnc and machine learning methods on the same independent adult and children’s cohorts as the previous experiments but normalized by MAS5.0. For the MAS5.0 normalized cohorts, SepSigLnc outperformed machine learning models and other signatures (Fig. 3e, supplementary Table S2, and Figure S4 in Additional file). Furthermore, SepSigLnc appeared to maintain the same level of performance between the two normalization methods, while machine learning methods, SeptiCyte, and sNIP declined on MAS5.0 normalization in comparison to RMA normalization. In consequence, SepSigLnc is superior to the listed machine learning models, SeptiCyte and sNIP across different normalization methods.

Functional analysis of SepSigLnc

We investigated the functional mechanisms of the lncRNAs in SepSigLnc using the rule of guilt by association. First, the interactions between lncRNAs and proteins were established based on the RNAInter database [40]. Then, the common proteins interacted by a pair of lncRNAs were used for function enrichment analysis (Fig. 4a). Specifically, the lncRNA RP11-533E19.7 interacts with 1095 proteins and AC100830.4 interacts with 307 proteins, 273 out of which are shared by the lncRNA pair (Fig. 4b). Gene Ontology (GO) enrichment analysis shows the overlapping proteins are implicated in the biological processes of covalent chromatin modification, histone modification, DNA-templated transcription and initiation, cell fate commitment, steroid hormone mediated signaling pathway, etc. (Fig. 4c).

Fig. 4
figure 4

Functional analysis of the lncRNAs in SepSigLnc. a The common proteins targeted by a pair of lncRNAs were used to derive the functions of the lncRNA pair. b Venn diagram of the proteins targeted by RP11-533E19.7 and AC100830.4. c The most significantly enriched biological processes for the overlap proteins targeted by RP11-533E19.7 and AC100830.4. The node color represents the enrichment significance and the node size refers to the number of genes in the functional category. d Venn diagram of the proteins targeted by AC008753.4 and LOC101928817. e The most significantly enriched biological processes for the overlap genes targeted by AC008753.4 and LOC101928817

Concentrating on the common target proteins of the AC008753.4 - LOC101928817 pair (Fig. 4d), we found most of the enriched functions are the same as the RP11-533E19.7 - AC100830.4 pair, including DNA-templated transcription and initiation, cell fate commitment, steroid hormone mediated signaling pathway, etc. (Fig. 4e). Interestingly, the two lncRNA pairs have an extremely high function similarity, revealing that the SepSigLnc members tend to be implicated in the same pathway associated with sepsis.

On top of the two independent pairs mentioned above, the other 12 pairs composed three connected motifs, including five, four, and three pairs, respectively (Fig. 5a). For instance, the largest motif contains five lncRNA pairs among six lncRNAs, i.e., ECRP, CTD-2012 K14.6, LOC101926943, MCM3AP-AS1, MGC27345, and STARD7-AS1. All the five pairs share a single lncRNA ECRP. Importantly, a substantially high proportion of target proteins are shared by the five lncRNA pairs (Fig. 5b), indicating that the motif tends to be involved in similar biological processes. Apart from one pair without common target proteins, indeed, the other four lncRNA pairs functions are remarkably similarly based on the enrichment analysis (Fig. 5c). Specifically, ECRP separately shares 62, 46, 66, and 62 proteins with STARD7-AS1, LOC101926943, MCM3AP-AS1, and CTD-2012 K14.6, the functions enriched by which are pretty consistent, including covalent chromatin modification, histone modification, DNA-templated transcription and initiation, cell fate commitment, cellular response to steroid hormone stimulus, etc. (Fig. 5c).

Fig. 5
figure 5

Functional analysis of three connected motifs in SepSigLnc. a Twelve out of 14 lncRNA pairs in SepSigLnc compose three connected motifs. b Venn diagram of the proteins targeted by ECRP, CTD-2012 K14.6, LOC101926943, MCM3AP-AS1, and STARD7-AS1. c The most significantly enriched biological processes for the overlap proteins targeted by ECRP, CTD-2012 K14.6, LOC101926943, MCM3AP-AS1, and STARD7-AS1. d The heat map of function similarity between the above five lncRNAs. e Semantic similarity (SS) between lncRNAs. Blue line represents the distribution of the SS between all lncRNAs pairs. Bright red boxplot shows the SS between the lncRNAs in SepSigLnc. Dark red boxplot displays the SS between lncRNA ECRP, CTD-2012 K14.6, LOC101926943, MCM3AP-AS1, and STARD7-AS1. f The heat map of function similarity between 14 lncRNA pairs in SepSigLnc

We further used semantic similarity (SS) to measure the functional similarity between different groups of proteins. As shown in Fig. 5d, the SS scores of all these lncRNA pairs are close to 1 except MGC27345 (about 0.6), which is significantly higher than the randomly picked up lncRNA pairs (Ranksum test, P < 1.34e-22, Fig. 5e). To simulated the SS of random pairs, we grabbed around 20,000 proteins from the RNAinter database [40] and calculated the similarity among any two arbitrarily picked-up protein sets with size ranging from ten to 100 for 10,000 times. Moreover, we evaluated the functional similarity of all the lncRNAs in SepSigLnc and observed a high consistency of them in semantic similarity. Most of the SS scores were over 0.7 (Fig. 5f), which is also significantly higher than the simulated data (Ranksum test, P < 1.51e-18, Fig. 5e). Since no target proteins have been detected yet for AC090627 and LOC101927974, the other two motifs were not deciphered in this study. These results indicated that an expression alteration of a pair of lncRNAs may be involved in critical biological pathways that affect sepsis progression through disrupting the balance of the lncRNA-gene regulatory network.


Rapid diagnosis through molecular detection contributes to early intervention for rescue which can reduce the morbidity and mortality of sepsis. Studies revealed that lncRNAs regulate inflammation-related genes and may serve as potential biomarkers or signatures for sepsis diagnosis. Nevertheless, prediction using the absolute expression levels of lncRNAs may deviate since high-throughput platforms are sensitive to various forms of technical bias. The generated continuous measurements were not measurable and comparable, even though they were preprocessed by plausible normalization methods. Using individualized pairwise analysis of gene expression (iPAGE) based on relative expression instead of the absolute abundance, we identified 14 lncRNA pairs named SepSigLnc. The signature’s relative expression of normal controls and sepsis patients are complements to each other respectively and very stable. Importantly, we examined the SepSigLnc on two independent adult as well as six children validation cohorts and demonstrated our model overall outperforms machine learning models and existing biomarkers on the basic of absolute gene expression level. Therefore, SepSigLnc is a reliable and robust signature for the diagnosis and initial evaluation of sepsis. Our results also revealed that the relative expression is more reliable than the absolute abundance in the sepsis high throughput data, which further confirm the previous discoveries [39, 47].

We further tested the diagnosis capability of SepSigLnc across different normalization methods. As shown in Fig. 3e, the performance of SepSigLnc does not reduce (or is even better) when examined on the expression cohorts normalized using another method MAS5.0, whereas it does for other machine learning methods as well as Septicyte and sNIP. To retrieve the absolute abundance of genes from systematic bias of experiments, normalization methods such as RMA and MAS5.0 adjust the expression value according to specific assumptions. Using relative expression of lncRNAs pairs obviates the disturbance brought by normalization, since different normalization methods result in different gene expression patterns, based on disparate assumptions [37, 48, 49].

Theoretically, iPAGE is able to perform cross-platform analysis by integrating expression cohorts from different resources. The absolute expression levels of genes may vary when detected by different high throughput technics such as RNA-seq and microarray, different profiling platforms, or different production batches. iPAGE extracts the relative expression of lncRNA pairs within samples for diagnosis instead of using the absolute abundance and consequently stable across experimental assays and platforms. Validation for the assumption are left for future work.

From the function enrichment analysis of SepSigLnc, the members in a five-lncRNA motif are significantly similar and they share a large amount of interacting proteins, which are implicated in the biological processes of covalent chromatin modification, histone modification, DNA-templated transcription and initiation, cell fate commitment, steroid hormone mediated signaling pathway, etc. (Figs. 4 and 5). In particular, steroid hormone mediated signaling pathway involves in the biological process related to sepsis, which shows high consistence with the previous clinical findings. Sepsis patients often show acute alterations in hormones, including downregulation of hormones such as thyroid stimulating hormone, triiodothyronine, testosterone and estrogen, and upregulation of other hormones such as cortisol and vasopressin [50]. The alteration of hormones partially reflects the physiological response of sepsis patients and it has significant effects on clinical outcomes. For instance, thyroid hormone levels were reduced by different mechanisms during sepsis. Patients with mild illness often have euthyroid sick syndrome, i.e., a decrease in serum levels of triiodothyronine and a mild increase in thyroxine. Moreover, the degree of hormone alteration is associated with disease severity and has been served as a predictor of poor outcome in sepsis [51].

Notably, the function analysis of lncRNA pairs in SepSigLnc implies the regulation mechanism may not only rely on the upregulation or downregulation of a single lncRNA or protein-coding gene, but also through the switch of ordering between lncRNAs in a pair. For instance, SPI1 and GATA1 is a well-studied pair in the hematopoietic system. SPI1 specifies the myeloid lineage, where SPI1> > GATA1, while GATA1 specifies the erythroid lineage, featured by GATA1> > SPI1 [52].

Although we constructed a large-scale study covering nine cohorts and 768 samples, all the transcriptome datasets were detected using microarray instead of sequencing techniques. We call for large-scale RNA-seq datasets of sepsis to promote and facilitate the development molecular detection. Additionally, converting the lncRNAs into relative expressions loses information contained in absolute value and hence needs more lncRNA pairs for accurate diagnosis. In the near future, we will validate SepSigLnc through reverse transcription-polymerase chain reaction (RT-PCR) for clinical utility. While the specific role of SepSigLnc in sepsis diagnosis remains unclear, the regulation mechanism needs to be investigated through in vivo experiments such as the cecal ligation and puncture (CLP) model of mouse.

Availability of data and materials

All the datasets supporting the conclusions of this article are collected from Gene Expression Omnibus (GEO) repository and public available at their stated accession IDs: GSE95233, GSE57065, GSE28750, GSE8121, GSE9692, GSE13904, GSE26378, GSE4607 and GSE26440 (Table 1).



Intensive care unit




Long non-coding RNAs


Sepsis signature lncRNA pairs


Area under the receiver operating characteristic


Gene expression omnibus


Robust multi-array average


Microarray suite 5.0


Individualized pair analysis of gene expression


Significantly altered lncRNA pairs


Support vector machine


Gaussian process classification


Gene ontology


Semantic similarity


Reverse transcription-polymerase chain reaction


Cecal ligation and puncture


  1. Sepsis: data & reports. Accessed 25 May 2020.

  2. Annane D, Bellissant E, Cavaillon JM. Septic shock. Lancet. 2005;365(9453):63–78.

    CAS  Article  PubMed  Google Scholar 

  3. Cohen J, Vincent JL, Adhikari NK, Machado FR, Angus DC, Calandra T, et al. Sepsis: a roadmap for future research. Lancet Infect Dis. 2015;15(5):581–614.

  4. Meynaar IA, Droog W, Batstra M, Vreede R, Herbrink P. In critically ill patients, serum Procalcitonin is more useful in differentiating between Sepsis and SIRS than CRP, Il-6, or LBP. Crit Care Res Pract. 2011;2011:594645.

    PubMed  PubMed Central  Google Scholar 

  5. Hoenigl M, Raggam RB, Wagner J, Prueller F, Grisold AJ, Leitner E, et al. Procalcitonin fails to predict bacteremia in SIRS patients: a cohort study. Int J Clin Pract. 2014;68(10):1278–81.

  6. Cruciani M. Meta-analyses of diagnostic tests in infectious diseases: how helpful are they in the intensive care setting? HSR Proc Intensive Care Cardiovasc Anesth. 2011;3(2):103–8.

    CAS  PubMed  PubMed Central  Google Scholar 

  7. Tang BMP, McLean AS, Dawes IW, Huang SJ, Lin RCY. The use of gene-expression profiling to identify candidate genes in human sepsis. Am J Respir Crit Care Med. 2007;176(7):676–84.

    CAS  Article  PubMed  Google Scholar 

  8. Cheng L, Fan K, Huang Y, Wang D, Leung K-S. Full characterization of localization diversity in the human protein Interactome. J Proteome Res. 2017;16(8):3019–29.

    CAS  Article  PubMed  Google Scholar 

  9. Scicluna BP, Wiewel MA, van Vught LA, Hoogendijk AJ, Klarenbeek AM, Franitza M, et al. Molecular biomarker to assist in diagnosing abdominal sepsis upon ICU admission. Am J Respir Crit Care Med. 2018;197(8):1070–3.

  10. McHugh L, Seldon TA, Brandon RA, Kirk JT, Rapisarda A, Sutherland AJ, et al. A molecular host response assay to discriminate between Sepsis and infection-negative systemic inflammation in critically ill patients: discovery and validation in independent cohorts. PLoS Med. 2015;12(12):e1001916.

  11. Sweeney TE, Khatri P. Benchmarking sepsis gene expression diagnostics using public data. Crit Care Med. 2017;45(1):1–10.

    Article  PubMed  PubMed Central  Google Scholar 

  12. Cheng L, Nan C, Kang L, Zhang N, Liu S, Chen H, et al. Whole blood transcriptomic investigation identifies long non-coding RNAs as regulators in sepsis. J Transl Med. 2020;18(1):217.

  13. Liu X, Xu Y, Wang R, Liu S, Wang J, Luo Y, et al. A network-based algorithm for the identification of moonlighting noncoding RNAs and its application in sepsis. Brief Bioinform. 2021;22(1):581–8.

  14. Cheng L, Leung KS. Identification and characterization of moonlighting long non-coding RNAs based on RNA and protein interactome. Bioinformatics. 2018;34(20):3519–28.

    CAS  Article  PubMed  Google Scholar 

  15. Cheng L, Leung KS. Quantification of non-coding RNA target localization diversity and its application in cancers. J Mol Cell Biol. 2018;10(2):130–8.

    CAS  Article  PubMed  Google Scholar 

  16. Ho J, Chan H, Wong SH, Wang MH, Yu J, Xiao Z, et al. The involvement of regulatory non-coding RNAs in sepsis: a systematic review. Crit Care. 2016;20(1):383.

  17. Reddy MA, Chen Z, Park JT, Wang M, Lanting L, Zhang Q, et al. Regulation of inflammatory phenotype in macrophages by a diabetes-induced long noncoding RNA. Diabetes. 2014;63(12):4249–61.

  18. Chen H, Wang X, Yan X, Cheng X, He X, Zheng W. LncRNA MALAT1 regulates sepsis-induced cardiac inflammation and dysfunction via interaction with miR-125b and p38 MAPK/NFkappaB. Int Immunopharmacol. 2018;55:69–76.

    CAS  Article  PubMed  Google Scholar 

  19. Huang Q, Huang C, Luo Y, He F, Zhang R. Circulating lncRNA NEAT1 correlates with increased risk, elevated severity and unfavorable prognosis in sepsis patients. Am J Emerg Med. 2018;36(9):1659–63.

    Article  PubMed  Google Scholar 

  20. Huang S, Qian K, Zhu Y, Huang Z, Luo Q, Qing C. Diagnostic value of the lncRNA NEAT1 in peripheral blood mononuclear cells of patients with Sepsis. Dis Markers. 2017;2017:7962836.

    PubMed  PubMed Central  Google Scholar 

  21. Xu Y, Shao B. Circulating long noncoding RNA ZNFX1 antisense RNA negatively correlates with disease risk, severity, inflammatory markers, and predicts poor prognosis in sepsis patients. Medicine (Baltimore). 2019;98(9):e14558.

    CAS  Article  Google Scholar 

  22. Wang J, Zhang X, Cheng L, Luo Y. An overview and metanalysis of machine and deep learning-based CRISPR gRNA design tools. RNA Biol. 2020;17(1):13–22.

    CAS  Article  PubMed  Google Scholar 

  23. Wang J, Xiang X, Bolund L, Zhang X, Cheng L, Luo Y. GNL-scorer: a generalized model for predicting CRISPR on-target activity by machine learning and featurization. J Mol Cell Biol. 2020;12(11):909–11.

  24. Elkarami B, Alkhateeb A, Rueda L. Cost-sensitive classification on class-balanced ensembles for imbalanced non-coding RNA data. In: 2016 IEEE EMBS International Student Conference (ISC): 29–31 May 2016; 2016. p. 1–4.

    Google Scholar 

  25. Liu S, Zhao W, Liu X, Cheng L. Metagenomic analysis of the gut microbiome in atherosclerosis patients identify cross-cohort microbial signatures and potential therapeutic target. FASEB J. 2020;34(11):14166–81.

    CAS  Article  PubMed  Google Scholar 

  26. Liu X, Zheng X, Wang J, Zhang N, Leung K-S, Ye X, et al. A long non-coding RNA signature for diagnostic prediction of sepsis upon ICU admission. Clin Transl Med. 2020;10(3):e123.

  27. Venet F, Schilling J, Cazalis MA, Demaret J, Poujol F, Girardot T, et al. Modulation of LILRB2 protein and mRNA expressions in septic shock patients and after ex vivo lipopolysaccharide stimulation. Hum Immunol. 2017;78(5–6):441–50.

  28. Cazalis MA, Lepape A, Venet F, Frager F, Mougin B, Vallin H, et al. Early and dynamic changes in gene expression in septic shock patients: a genome-wide approach. Intensive Care Med Exp. 2014;2(1):20.

  29. Sutherland A, Thomas M, Brandon RA, Brandon RB, Lipman J, Tang B, et al. Development and validation of a novel molecular biomarker diagnostic test for the early detection of sepsis. Crit Care. 2011;15(3):R149.

  30. Shanley TP, Cvijanovich N, Lin R, Allen GL, Thomas NJ, Doctor A, et al. Genome-level longitudinal expression of signaling pathways and gene networks in pediatric septic shock. Mol Med. 2007;13(9–10):495–508.

  31. Cvijanovich N, Shanley TP, Lin R, Allen GL, Thomas NJ, Checchia P, et al. Validating the genomic signature of pediatric septic shock. Physiol Genomics. 2008;34(1):127–34.

  32. Wong HR, Cvijanovich N, Allen GL, Lin R, Anas N, Meyer K, et al. Genomic expression profiling across the pediatric systemic inflammatory response syndrome, sepsis, and septic shock spectrum. Crit Care Med. 2009;37(5):1558–66.

  33. Wong HR, Cvijanovich NZ, Allen GL, Thomas NJ, Freishtat RJ, Anas N, et al. Corticosteroids are associated with repression of adaptive immunity gene programs in pediatric septic shock. Am J Respir Crit Care Med. 2014;189(8):940–6.

  34. Wong HR, Cvijanovich N, Wheeler DS, Bigham MT, Monaco M, Odoms K, et al. Interleukin-8 as a stratification tool for interventional trials involving pediatric septic shock. Am J Respir Crit Care Med. 2008;178(3):276–82.

  35. Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, et al. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics. 2003;4(2):249–64.

  36. Hubbell E, Liu WM, Mei R. Robust estimators for expression analysis. Bioinformatics. 2002;18(12):1585–92.

    CAS  Article  PubMed  Google Scholar 

  37. Liu X, Li N, Liu S, Wang J, Zhang N, Zheng X, et al. Normalization methods for the analysis of unbalanced Transcriptome data: a review. Front Bioeng Biotechnol. 2019;7:358.

  38. Heinaniemi M, Nykter M, Kramer R, Wienecke-Baldacchino A, Sinkkonen L, Zhou JX, et al. Gene-pair expression signatures reveal lineage control. Nat Methods. 2013;10(6):577–83.

  39. Wang H, Sun Q, Zhao W, Qi L, Gu Y, Li P, et al. Individual-level analysis of differential expression of genes and pathways for personalized medicine. Bioinformatics. 2015;31(1):62–8.

  40. Lin Y, Liu T, Cui T, Wang Z, Zhang Y, Tan P, et al. RNAInter in 2020: RNA interactome repository with increased coverage and annotation. Nucleic Acids Res. 2020;48(D1):D189–97.

  41. Li L, Liu M, Yue L, Wang R, Zhang N, Liang Y, et al. Host-guest protein assembly for affinity purification of Methyllysine proteomes. Anal Chem. 2020;92(13):9322–9.

  42. Cheng L, Liu P, Wang D, Leung KS. Exploiting locational and topological overlap model to identify modules in protein interaction networks. BMC Bioinformatics. 2019;20(1):23.

    Article  PubMed  PubMed Central  Google Scholar 

  43. Cheng L, Liu P, Leung KS. SMILE: a novel procedure for subcellular module identification with localisation expansion. IET Syst Biol. 2018;12(2):55–61.

    Article  PubMed  PubMed Central  Google Scholar 

  44. Wang JZ, Du Z, Payattakool R, Yu PS, Chen C-F. A new method to measure the semantic similarity of GO terms. Bioinformatics. 2007;23(10):1274–81.

    CAS  Article  PubMed  Google Scholar 

  45. Yu G, Wang L-G, Han Y, He Q-Y. clusterProfiler: an R package for comparing biological themes among gene clusters. Omics. 2012;16(5):284–7.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  46. Yu G, Li F, Qin Y, Bo X, Wu Y, Wang S. GOSemSim: an R package for measuring semantic similarity among GO terms and gene products. Bioinformatics. 2010;26(7):976–8.

    CAS  Article  PubMed  Google Scholar 

  47. Guan Q, Chen R, Yan H, Cai H, Guo Y, Li M, et al. Differential expression analysis for individual cancer samples based on robust within-sample relative gene expression orderings across multiple profiling platforms. Oncotarget. 2016;7(42):68909–20.

  48. Cheng L, Lo LY, Tang NL, Wang D, Leung KS. CrossNorm: a novel normalization strategy for microarray data in cancers. Sci Rep. 2016;6(1):18898.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  49. Cheng L, Wang X, Wong PK, Lee KY, Li L, Xu B, et al. ICN: a normalization method for gene expression data considering the over-expression of informative genes. Mol BioSyst. 2016;12(10):3057–66.

  50. Gheorghita V, Barbu AE, Gheorghiu ML, Caruntu FA. Endocrine dysfunction in sepsis: a beneficial or deleterious host response? Germs. 2015;5(1):17–25.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  51. Stathatos N, Levetan C, Burman KD, Wartofsky L. The controversy of the treatment of critically ill patients with thyroid hormone. Best Pract Res Clin Endocrinol Metab. 2001;15(4):465–78.

    CAS  Article  PubMed  Google Scholar 

  52. Huang S, Guo YP, May G, Enver T. Bifurcation dynamics in lineage-commitment in bipotent progenitor cells. Dev Biol. 2007;305(2):695–713.

    CAS  Article  PubMed  Google Scholar 

Download references


We are grateful to Mr. Lei Zhu and Mr. Ziliang Lai for providing the server for our computational experiment.


This work was supported by the Guangdong Basic and Applied Basic Research Foundation [grant numbers: 2019A1515110097 to L.C].

Author information

Authors and Affiliations



X. Z. and L. C. conceived and designed the algorithms and experiments. X. Z. performed the experiments under the supervision of M.H. W., K.S. L., and L. C.. L.C. preprocessed the data and analyzed the results. X. Z. and L. C. jointly wrote the manuscript. K.S. L. reviewed and helped writing the manuscript. The author(s) read and approved the final manuscript.

Corresponding author

Correspondence to Lixin Cheng.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

We declare that we do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1: Figure S1.

Workflow of machine learning methods. Figure S2. The number of reversal lncRNA pairs in SepSigLnc between normal and sepsis samples in eight validation sets. Figure S3. AUROC curves on discovery cohort (GSE95233) and eight validation cohorts (others) normalized by RMA. Figure S4. AUROC curves on discovery cohort (GSE95233) normalized by RMA and eight validation cohorts (others) normalized by MAS5.0 normalization. Table S1. Performance measured by AUROC of SepSigLnc and machine learning methods on independent validation cohorts with RMA normalization. Table S2. Performance measured by AUROC of SepSigLnc and machine learning methods on independent validation cohorts with MAS5.0 normalization.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Zheng, X., Leung, KS., Wong, MH. et al. Long non-coding RNA pairs to assist in diagnosing sepsis. BMC Genomics 22, 275 (2021).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI:


  • Sepsis
  • Diagnostics
  • Signature
  • Long non-coding RNA
  • Relative expression