 Research
 Open Access
 Published:
Sequencespecific bias correction for RNAseq data using recurrent neural networks
BMC Genomics volume 18, Article number: 1044 (2017)
Abstract
Background
The recent success of deep learning techniques in machine learning and artificial intelligence has stimulated a great deal of interest among bioinformaticians, who now wish to bring the power of deep learning to bare on a host of bioinformatical problems. Deep learning is ideally suited for biological problems that require automatic or hierarchical feature representation for biological data when prior knowledge is limited. In this work, we address the sequencespecific bias correction problem for RNAseq data redusing Recurrent Neural Networks (RNNs) to model nucleotide sequences without predetermining sequence structures. The sequencespecific bias of a read is then calculated based on the sequence probabilities estimated by RNNs, and used in the estimation of gene abundance.
Result
We explore the application of two popular RNN recurrent units for this task and demonstrate that RNNbased approaches provide a flexible way to model nucleotide sequences without knowledge of predetermined sequence structures. Our experiments show that training a RNNbased nucleotide sequence model is efficient and RNNbased bias correction methods compare well with thestateoftheart sequencespecific bias correction method on the commonly used MAQCIII data set.
Conclustions
RNNs provides an alternative and flexible way to calculate sequencespecific bias without explicitly predetermining sequence structures.
Background
In recent years, deep learning techniques have been successfully applied to a number of challenging problems, including speech recognition, image processing, and machine translation. One major advantage of deep learning over traditional machine learning methods is the automatic extraction of a hierarchical feature representation for raw data and subsequent learning of a mapping to some desired output space. This is especially useful for those datasets for which manual feature extraction is difficult. In the bioinformatics community, more and more effort has been devoted to the application of deep learning methods to specific problems, such as gene expression [1, 2], alternative splicing [3, 4], gene regulation [5–8], and protein secondary structure prediction [9]. Progress of this kind has encouraged researchers to further explore other challenging bioinformatics problems in a deep learning way. In this work, we focus on the sequencespecific bias problem for RNAseq data and proposed to solve it using RNNs.
RNA sequencing (RNAseq) is a widely used wholegenome scale transcriptome profiling tool, which provides a highthroughput approach to measure gene and transcript expression for cell activities. For biological experiments using RNAseq, many downstream analyses are heavily dependent on RNAseq results. Therefore, the accuracy of expression estimation for RNAseq data is important for downstream analysis.
Currently, most RNAseq protocols require the preparation of a cDNA library, which coverts target RNA molecules to cDNA molecules for sequencing. In different stages of cDNA library preparation, such as RNA extraction and fragmentation, reverse transcription and amplification, different technical biases are incorporated. These biases inevitably effect the resulting expression estimation for RNAseq data. Characterizing these technical biases and correcting their effect can go a long way to improve expression estimation accuracy. Here, we focus on one particular type of technical bias called sequencespecific bias. This bias, which happens when a read being selected for sequencing is affected by nucleotide sequences of primers, is introduced during reverse transcription due to differing binding efficiencies of random primers.
In the previous work, Hansen et al. [10] proposed to use a ratio of heptamer (7mer) frequencies at the beginning and the middle position of a read as the sequencespecific bias weight. Cufflinks [11] uses a variable length Markov model to model sequencespecific bias and estimate the bias jointly with isoform expression estimation. Compared with Hansen’s method, Cufflinks takes the upstream nucleotides of a read’s aligned position into account and models positionspecific bias cross transcript as well. Jones et al. [12] trained a Bayesian network with limited restrictions to determine the nucleotide sequence structure (nucleotide dependency topology in a sequence) that best distinguishes between biased and unbiased sequences. Read bias weights are then calculated based on the estimated sequence probabilities given the determined sequence structure.
In this paper, we propose to use Recurrent Neural Networks to estimate sequencespecific bias for RNAseq reads. We are motivated to make use of the advantages of RNNs — namely that any arbitrary dependency of elements in a sequence can be automatically learned from data. Instead of explicitly determining sequence structure in advance, we use RNNs to characterize nucleotide sequences as a characterbased language modelling task and calculate read bias weights according to RNN predicted sequence scores.
Method
In this section, we describe the RNNbased sequencespecific bias correction pipeline for gene expression estimation (Additional file 1: Figure S1).
The pipeline can be divided into two major parts: (1) RNNbased bias estimation, and (2) gene expression quantification with bias reweighing. We describe each part in the following sections.
RNNbased sequencespecific bias estimation for RNAseq reads
In the process of cDNA library preparation, due to differing random primer binding efficiencies, a sequencespecific bias is introduced that results in the likelihood of a read being selected for sequencing is effected by the nucleotide sequence surrounding its read startend.
To characterize sequencespecific bias, we follow a general approach introduced in previous work [10–12]. We sample a set of biased sequences and unbiased sequences, and train two sequence models for describing these sequences separately. The biased sequences are extracted surrounding the reference genomic coordinates where read startends are located. The unbiased sequences are extracted by randomly offsetting biased positions. We refer to biased sequences as foreground sequences, and unbiased sequences as background sequences. For a given read r, we use foreground and background models to calculate the sequence probability for the context nucleotide sequence of the read. The bias of the read is defined as the ratio of context genomic sequence probabilities as predicted by background and foreground models
where p _{ f } and p _{ b } are foreground and background sequence probabilities, respectively. The function contextSeq returns the context genomic sequence surrounding the read startend location loci(r) in reference genome.
The major difference among these types of sequencespecific bias correction methods is the model used to calculate nucleotide sequence probabilities. For example, Hansen et al. [10] counted hexamer (7mer) positional statistics at the beginning and middle positions of a read as p _{ f } and p _{ b }. In the 7mer case, there are 16383 (4^{7}−1) parameters for different positions so that the training data may not always be large enough to ensure accurate parameter estimation. Cufflinks [11] circumvents this problem by using a variable length Markov model, which only considered higher order dependencies for positions surrounding read startend positions. Jones et al. [12] trained a Bayesian network to find principle structures of sequences that can be used to best distinguish between foreground and background sequences. Here, we propose to use RNNs for modeling sequence probabilities.
RNNs are artificial neural networks featured with directedcycle connections between hidden neuron units. In the recurrent loop connection, shown in Additional file 1: Figure S1(c), a hidden unit from a previous timestep is connected to a current hidden unit. The information carried in the previous hidden unit is deemed as the “memory” of any previously visited sequence elements. Formally, at each timestep of the RNN an observed input vector s _{ t } and hidden state vector are used as inputs to calculate the current hidden vector h _{ t } through the following recursive operation:
Here, and are model parameters. The output y _{ t } is calculated with the softmax function on hidden units:
Here, c and b are bias units for the output and hidden layer, respectively.
Although an RNN has the ability to model arbitrarily long dependencies in theory, it is not easy to train an RNN, because of the nonlinear connection between hidden units. In the backward step of the training phase, the gradient update at the end of a sequence may not be able to be backpropagated to the beginning of the sequence for a parameter update. This is called “gradient vanish” [13]. A commonly used strategy to solve this problem involves incorporating a gating mechanism for recurrent hidden units in an RNN, which forces a linear connection between recurrent units. In our pipeline, we use two types of gating RNNs: Long Shortterm Memory Model (LSTM) [14] and Gated Recurrent Unit (GRU) [15].
Formally, given a trained foreground or background RNN model, the sequence probability of S=s _{1} s _{2}…s _{ m } is calculated according to
where m is the sequence length. The conditional probability p(s _{ t }s _{1}…s _{ t−1}) is expanded in the RNN as follows:
The sequence probability factorizes elementwise for each timestep in the sequence, and each pointwise probability is calculated based on hidden units h _{ i } in the current timestep.
Bias correction for gene expression
After estimating sequencespecific bias weight for all reads, we correct the gene expression levels in light of sequencespecific bias. In the pipeline, after reads are mapped to a reference genome, we count uniquely mapped reads for each position in annotated exon regions. In the nonbias correction case, we can derive raw gene expression levels based on the counted reads in the gene with normalization. In particular, we perform FPKM normalization [16] normalized with gene length and the number of totally mapped reads:
where i is the exon genomic coordinate of the gene. In the bias correction case, the read counts are reweighted with according bias weight in each position:
Experiments and results
We evaluated the RNNbased sequence specific bias correction method using the MAQCIII dataset (GSE47774) [17]. In the dataset, each sample contains multiple expression level measurements from different technology platforms. We used the RNAseq and qPCR data from Universal Human Reference RNA (UHRR) sample A and Human Brain Reference (HBRR) sample B for evaluation.
According to the pipeline, as shown in Additional file 1: Figure S1, we first aligned RNAseq reads to the reference human genome (hg19/GRCg37) using “STAR” [18]. Over 87% of the approximately 100 million 100bp pairedend reads were uniquely mapped for samples A and B. Based on the alignment results, we counted the uniquely mapped reads at each position for all exonic regions and sampled sets of foreground and background sequences for our training RNN models using the sequence sampling function in the seqBias (1.20.0) package. It will be noted that we used the seqBias function to ensure that subsequent comparisons of our method to alternatives are fair. However, we can use our own sampling method in our implementation; see Additional file 1: Figure S1. In total, we sampled 100 k sequences for each of the training foreground and background models. These sequences are also used for training the Bayesian network based sequencespecific bias correction method seqBias. The sequence length is 21bp by default, which covers the 10 nucleotides before and after the read startend positions in the reference genome. We filtered out any sequences near exon boundaries that are shorter than 21bp. We trained the foreground and background models in parallel, and estimated bias weights for the counted reads in exonic regions. The gene expression levels were estimated based on the reweighted read counts using the estimated bias weights.
RNN network scale
We trained two types of RNN models (LSTM and GRU) on the foreground and background sequences, respectively. The 100 k sequences are split into training/validation/test setsaccording to the proportions 90%:5%:5%. We used the character RNN language package^{1} to train our RNN models. The RNN parameters are tested for predefined parameter candidates and selected based on the minimum prediction error rate on validation data. As for the network scale, we use a “lightweight” RNN consisting of layer with the number of hidden nodes not more than 20 for each timestep. Based on trail and error, we found that increasing model complexity through the adding more layers and hidden nodes improves sequence modelling performance by lowering prediction error and perplexity (\(Perplexity(S)=2^{\frac {1}{N}\sum _{i=1}{N}log_{2}p(s_{i})}\), where s _{ i } is a sequence in the test set S and N is the number of sequences in S. The lower perplexity is, the better model is). But for the sequencespecific bias correction task, the performance of correlations (correlation score between qPCR and predicted values) did not improves accordingly and significantly, while at the same time adding more computational cost. A simpler model is preferable on the grounds that it is more likely to be robust enough to capture the difference between foreground and background sequences and be less sensitive to the overfitting problem. Therefore, we choose “lightweight” RNN structures in our experiments. Table 1 shows training times of the GRU and LSTM models for different numbers of hidden units. For the sequence modelling aspect, the perplexities of RNN models on the test set decrease as more hidden number units are included.
Bias correction effect of RNNbased methods
To access the bias correction effect, we used the TaqMan RTPCR data as the gold standard and evaluated the correlation between gene expression values (with/without bias correction) and corresponding nonzero RTPCR values. We evaluated logadjusted Pearson correlation, Spearman correlation, and r ^{2} values for the of goodnessoffit for evaluation. We compared the performance of our method with the stateoftheart seqBias method and the raw normalized counts.
As shown in Tables 2 and 3, both seqBias and RNNbased methods improve correlations between original predictions and RTPCR values. For both sample A and B, GRUbased bias correction models (GRUBC) yield results that are competitive with seqBias. In the evaluation with Pearson correlation and r ^{2} metrics, the GRUbased bias correction model with 10 hidden units performs slightly better, but this is not significant. The Pearson correlation between seqBias and GRU(4104) is 0.9978 (logscaled), and is 0.9989 between GRU(4104) and LSTM(4104), which indicates the three bias correction methods perform very similarly on the dataset.
Differences can also be observed between the RNNbased approaches and seqBias. First, RNNbased bias correction methods are more conservative to change original predictions when compared with seqBias. The Euclidean distances between gene expression predictions before and after bias correction are 0.028 for GRU(4104), 0.036 for LSTM(4104), and 0.137 for seqBias. We think this is on account of the simple structure of RNNs that we have used. In Additional file 1: Figure S2, we observe that the RNN model bias corrected predictions are not too far away from the original predictions, when compared with seqBias. In the bottomleft region of the figure, we observe more differences for less abundant genes. Second, the proportion of increased and decreased changes in gene expression is also different. seqBias is rightskewed on the dataset with 867 decreased changes and 55 increased changes. In the histograms of logfold changes in Additional file 1: Figure S3, the majority genes with decreased expression exhibit changes under zero, while for the RNNbased models, decreased changes and increased changes are approximately similar in a more balance proportion. Although which one is better can not be directly evaluated without knowledge of the true sequencespecific bias. That said, this result stands as an interesting difference between RNNbased bias correction methods and seqBias.
The sequencespecific bias of an RNAseq read is examined by the context genomic sequence surrounding its read startend. Usually, we make the length of the context window large enough to cover the positions that have a nonuniform positional nucleotide distribution. We extended the context window length from 21bp to 41bp on sample B and investigated whether longer context sequences effect bias correction results. As shown in Table 3, the performance of bias correction methods exhibit similar performance for both length scales. The LSTMBC acquires additional improvements with longer sequences used as training samples. As LSTM has more parameters than GRU, extending the the context genomic sequence length incorporates more training data, which explains the improvement of LSTMBC model. But the bias correction result does not improve significantly when compared with GRUBC in Table 3.
Discussion and conclusion
The biggest challenge in the sequence specific bias correction task is modelling the difference between foreground and background sequences. One simple approach is to describe the differences with a positional weight matrix that reflects the nucleotide distribution independently calculated at each position of the sequence. Although the sequence probability is easily calculated using a positional weight matrix, some complicate sequence patterns with differing foreground and background sequences might be ignored. On the other hand, directly counting whole sequences or enumerating long kmers is liable to make the data too sparse to fit a model with a large number of parameters. To solve this tradeoff, sequence structures are usually predetermined heuristically or in datadriven way. RNNs make for an attractive alternative because information about past history is automatically encoded in hidden units so that no explicit structures need be determined in advance.
Of course, uncovering potential bias patterns in foreground sequences is no easy task. Training an RNN model on foreground sequences is more difficult than on background sequences, as the training and validation errors are prone to be higher on foreground sequences for the same model setting in our experiments. Simply increasing RNN model complexity runs the risk of overfitting, which, in this context, means that the model learns unexpected overfitted pattern for both background and foreground sequences. To mitigate this risk we used RNN models in a “shallow” way that use fewer layer and neurons for each recurrent hidden unit. What is more, shallow structures make training and testing an RNN model more efficient for the bias correction task.
In summary, we propose an alternative approach to current that employs RNNs to solve the sequencespecific bias correction problem for RNAseq data. Our RNNbased model for bias weight estimation is lightweight and can be trained efficiently. The biggest advantage of this approach is that sequence structure is learned automatically during the training of RNNs. We conducted experiments on the MAQCIII dataset and achieved competitive results in comparison with the stateoftheart Bayesian network method seqBias.
Endnote
^{1} http://github.com/karpathy/charrnn
Abbreviations
 GRU:

Gated recurrent unit
 LSTM:

Long short term memory networks
 MAQC/SEQC:

MicroArray/sequencing quality control
 RNNs:

Recurrent neural networks
References
 1
Tan J, Hammond JH, Hogan DA, Greene CS. Adage analysis of publicly available gene expression data collections illuminates pseudomonas aeruginosahost interactions. bioRxiv. 2015. doi:10.1101/030650. http://biorxiv.org/content/early/2015/11/05/030650.full.pdf.
 2
Chen Y, Li Y, Narayan R, Subramanian A, Xie X. Gene expression inference with deep learning. Bioinformatics. 2016; 32(12):1832–1839. doi:10.1093/bioinformatics/btw074. http://bioinformatics.oxfordjournals.org/content/32/12/1832.abstract.
 3
Leung MK, Xiong HY, Lee LJ, Frey BJ. Deep learning of the tissueregulated splicing code. Bioinformatics. 2014; 30(12):i121–i129.
 4
Lee B, Lee T, Na B, Yoon S. DNAlevel splice junction prediction using deep recurrent neural networks. CoRR. 2015;abs/1512.05135. http://arxiv.org/abs/1512.05135.
 5
Kelley DR, Snoek J, Rinn J. Basset: Learning the regulatory code of the accessible genome with deep convolutional neural networks. bioRxiv. 2015. doi:10.1101/028399. http://biorxiv.org/content/early/2015/10/05/028399.
 6
Quang D, Xie X. DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences. Nucleic Acids Res. 2016;:gkw226.
 7
Alipanahi B, Delong A, Weirauch MT, Frey BJ. Predicting the sequence specificities of DNAand RNAbinding proteins by deep learning. Nat Biotechnol. 2015; 33(8):831–8.
 8
Zhou J, Troyanskaya OG. Predicting effects of noncoding variants with deep learningbased sequence model. Nat Methods. 2015; 12(10):931–4.
 9
Sønderby SK, Winther O. Protein secondary structure prediction with long short term memory networks. arXiv preprint arXiv:1412.7828. 2014. https://arxiv.org/abs/1412.7828.
 10
Hansen KD, Brenner SE, Dudoit S. Biases in illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Res. 2010; 38(12):e131–e131.
 11
Roberts A, Trapnell C, Donaghey J, Rinn JL, Pachter L, et al. Improving rnaseq expression estimates by correcting for fragment bias. Genome Biol. 2011; 12(3):R22.
 12
Jones DC, Ruzzo WL, Peng X, Katze MG. A new approach to bias correction in rnaseq. Bioinformatics. 2012; 28(7):921–8.
 13
Hochreiter S, Bengio Y, Frasconi P, Schmidhuber J. Gradient flow in recurrent nets: the difficulty of learning longterm dependencies. A field guide to dynamical recurrent neural networks. IEEE Press; 2001.
 14
Hochreiter S, Schmidhuber J. Long shortterm memory. Neural Comput. 1997; 9(8):1735–80.
 15
Cho K, van Merriënboer B, Bahdanau D, Bengio Y. On the properties of neural machine translation: Encoderdecoder approaches. arXiv preprint arXiv:1409.1259. 2014. https://arxiv.org/abs/1409.1259.
 16
Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by rnaseq. Nat Methods. 2008; 5(7):621–8.
 17
Seqc/MaqcIii Consortium, et al. A comprehensive assessment of rnaseq accuracy, reproducibility and information content by the sequencing quality control consortium. Nat Biotechnol. 2014; 32(9):903–14.
 18
Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR. Star: ultrafast universal rnaseq aligner. Bioinformatics. 2013; 29(1):15–21.
Acknowledgments
We are grateful to the reviewers for their valuable comments on this work. We also thank Paul Sheridan for his help proofreading and revising this manuscript.
Declarations
This article has been published as part of BMC Genomics Volume 18 Supplement 1, 2016: Proceedings of the 27th International Conference on Genome Informatics: genomics. The full contents of the supplement are available online at http://bmcgenomics.biomedcentral.com/articles/supplements/volume18supplement1.
Funding
The publication costs for this article were funded by the project of conquering cancer through neodemential systems understanding (GrantinAid for Scientific Research on Innovative Areas from the Ministry of Education, Culture, Sports, Science and Technology, Japan).
Availability of data and materials
Raw RNAseq data of sample A and B from MAQCIII dataset are accessible through GEO series accession number GSE47774.
Authors’ contributions
The work presented here was carried out in collaboration between all authors. RY, SI and SM directed the research project. YZ conceived the study. RY accumulated and organized the source data. YZ performed the experiments. YZ drafted different sections of the manuscript. All authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Consent for publication
Not applicable.
Ethics approval and consent to participate
Not applicable.
Author information
Additional information
From The 27th International Conference on Genome Informatics Shanghai, China.35 October 2016
Additional file
Additional file 1
Supplement Figures. (PDF 577 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Published
DOI
Keywords
 RNAseq
 Recurrent neural network
 Sequencespecific bias
 Gene expression analysis