Sequence data submission
[EMBL:E-ERAD-244, EMBL:E_ERAD-121, EMBL:E-ERAD-91, EMBL:E-ERAD-384].
Sample collection
Breeding zebrafish (Danio rerio) were maintained at 28.5 °C on 14h light/10h dark cycle. Fertilised eggs were obtained from pairs of heterozygous fish carrying nonsense mutations in transcripts of specific genes [16] by natural spawning. They were then grown in incubators at 28 °C (except samples prefixed with zmp_ph9 which were grown at 32 °C), separated into morphologically abnormal and morphologically normal sibling embryos at the correct developmental stage and snap frozen in dry ice or liquid nitrogen. Over the course of the protocol development we have refined the RNA extraction protocol to allow increased sample throughput. RNA was extracted for samples prefixed lamc1
sa379, mdn1
sa1349, zmp_ph35 and zmp_ph40 by lysis in Trizol (Invitrogen). The lysate was mixed with 0.2 volumes of chloroform and processed in Phase Lock Gel heavy 2 ml tubes (5 Prime) according to the manufacturer’s instructions. For lamc1
sa379 and mdn1
sa1349 the aqueous phase was transferred to an RNase free 1.5ml tube (Ambion) and precipitated using 0.5 ml isopropyl alcohol per 1 ml Trizol reagent used. The samples were then spun at 2–8 °C for 10 min at 12,000 rpm. The supernatant was discarded and the pellet was washed twice with 75 % ethanol. The pellet was then dried at room temperature for 10 min and dissolved in 30 μl of RNAse free water. RNA was quantified using a Nanodrop. For samples zmp_ph35 and zmp_ph40 the aqueous phase from the phase lock column was collected, mixed with equal volumes of 70 % ethanol and applied to an RNeasy MinElute column (Qiagen). The columns were spun for 15 s at 8000 x g at room temperature in a centrifuge. The columns were then washed with 500 μl of RPE buffer (Qiagen) followed by 500 μl of 80 % ethanol. After drying the columns at full speed in a centrifuge, RNA was eluted from them in 16 μl of nuclease free water. RNA was quantified using Qubit RNA HS assay (Invitrogen). For samples zmp_ph45 and zmp_ph46 the embryos were lysed in 100 μl of RLT buffer (Qiagen) containing 1 μl of 14.3M beta mercaptoethanol (Sigma) in 2ml RNAse free tubes. The lysate was mixed with 1.8 volumes of Agencourt RNAClean XP (Beckman Coulter) beads, mixed by pipetting and allowed to bind for 10 min. The tubes were then applied to a magnetic rack (Invitrogen) until the solution turned clear and the supernatant was removed without disturbing the beads. While still on the magnet, the beads were washed three times with 70 % ethanol and air dried for 10 mins. The beads were re-suspended in 50 μl of RNAse free water by pipetting. The RNA was then eluted from the beads by applying the tubes to the magnetic rack. RNA was quantified using Qubit RNA HS assay (Invitrogen).
Library preparation and sequencing
Libraries for zmp_ph35, zmp_ph40, zmp_ph45 and zmp_ph46 were made in 96 well plates. DNA was removed from 300 ng of total RNA by treatment with 2 units of RNase-Free DNase I (NEB) in 100 μl reaction using the manufacturer’s buffer for 10 min at 37 °C and heated to 75 °C for 90 min to fragment the RNA. For each library 12 μl of Streptavidin magnetic beads were washed in 1x wash/binding buffer (20 mM Tris pH 7.5, 0.5 M NaCl, 1 mM EDTA) and 1 μl of 10 μM biotinylated polyT primer (B-TAATGCGGCCGCABCBTBTCAGTCTTTTTTTTTTTTTTVN (note the sequence 5′ of the polyT including the NotI site is not required for this protocol and an anchor polyT30 will suffice) was added. The primer was bound with rotation for 5 min at room temperature and the beads washed. After adding an equal volume of 2x wash/binding buffer and 40 units of RNase Inhibitor (NEB) the cold RNA was added and allowed to bind for 20 min at room temperature with rotation. The beads were washed twice in 1x wash/binding buffer, once in cold low salt buffer (0.15 M NaCl, 20 mM Tris-HCl pH 7.5) and suspended in water. The RNA was phosphorylated with 1 unit of T4 Polynucleotide Kinase (3′ phosphatase minus) (NEB) for 30 min at 37 °C with 40 units of RNase Inhibitor and the RNA oligo stRSSA4 (5′ Am-CUCGGCAUUCCUGCUGAACCGCUCUUCCGAUCU; all Illumina adapter sequences are from [34]) ligated with 20 units of T4 RNA ligase (NEB) in the presence of 20 % PEG 8000 (Promega) for 120 min at 37 °C. After adding an equal volume of 2x wash/binding buffer and incubating at room temperature for 2 min the beads were washed twice in 1x wash/binding buffer, once in cold low salt buffer and re-suspended in water. The RNA was eluted from the beads by heating at 80 °C for 2 min and separated on a magnet. One of 96 indexed primers (8mer_SC_TC 1 to 96 - ACACTCTTTCCCTACACGACGCTCTTCCGATCTNNNNBBBBNNNNXXXXXXXXCGTTTTTTTTTTTTTTVN - generic primer where X represents an 8 base index as described in Additional file 1, N is A, C, G or T, B is C, G or T and V is A, C or G) was added to each RNA sample (1 μl of 10 μM), then heated to 70 °C and snap chilled on ice. Reverse transcription was performed using SuperScript II (Invitrogen) in the presence of 40 units of RNase Inhibitor according to the manufacturer’s instructions, followed by the addition of 1 unit of Exo1 (NEB), incubated at 37 °C for 30 min and then 80 °C for 20 min and finally cleaned with the QIAgen PCR clean-up kit. Libraries were amplified to complete the Illumina adapter sequence using SAPCRS. 1 (5′-AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGA-3′) and SAPCRS. 2 (5′-CAAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAAC-3′) in a 50 μl reaction containing 35 μl of library, 5 μl of 10X KOD buffer, 5 μl of 2mM dNTPs, 2 μl MgSO4 , 2 μl of 10 μM primer mix and 1 μl of KOD HOT start polymerase (Novagen) by incubating in a pre-heated DNA Engine Tetrad (MJ Research) at 94 °C for 2 min, then 94 °C for 15 s, 60 °C for 30 s and 68 °C for 3 min for 20 cycles and finishing with 68 °C for 5 min. Libraries were cleaned with the QIAgen PCR clean-up kit, quantified using a BioPhotometer (Eppendorf), mixed in equimolar quantities, size selected with Spri beads for an insert size of 70-270 bases and quantified by qPCR. Sequencing was performed on an Illumina HiSeq 2500. The lamc1
sa379 and mdn1
sa1349 libraries were made in 1.5 ml RNase free tubes (Ambion) using an earlier protocol which is the same as that described above for zmp_ph35 except 5 μg of total RNA from pools of embryos were used for each library, total RNA was treated with DNase for 10 min at 37 °C followed by inactivation at 75 °C for 10 min in EDTA, then ethanol precipitated, fragmented with Ambion fragmentation reagent for 5 min at 70 °C, pulled down with 62.5 μl of streptavidin beads, the reverse transcription primer contained the sequence NNNNB instead of NNNNBBBBNNNN, there was no ExoI digestion step and only 15 cycles of amplification. The libraries were sequenced on HiSeq 2000.
DeTCT analysis pipeline
The source code for the DeTCT pipeline is available from DeTCT github [35]. Prior to running the DeTCT pipeline, the sequencing reads were processed with the detag_fastq.pl script, which trimmed reads to improve quality and rejected read pairs where the first read of the pair did not begin with the unique molecular identifier (UMI), followed by a sample specific index sequence and polyT. These sequences were removed from the read, and the index and UMI were added to the read name. The reads were aligned to the Zv9 zebrafish reference genome [36] with BWA 0.5.10 [37] and converted to BAM format with SAMtools [38]. The resulting BAM files were processed with Picard MarkDuplicates [26] to fix mate information and add read groups. Duplicate reads were identified using a modified version of Picard’s MarkDuplicates called Picard-detct [26, 27], which took into account the UMI in the read name. The final BAM files were used as input for the DeTCT pipeline. In the first stage, an HMM-based peak caller, HPeak [39], was used to identify regions where read 2 of each read pair was aligned. All the second reads in an experiment were put into 100 bp bins, with duplicate reads and reads with more than 2 mismatches being ignored, and these bins were used as input to HPeak. The output is the probability that each bin represents a peak, with adjacent bins being merged to create regions. For each region, all the read 2s aligned in that region were identified and then the read 1 associated with each read 2 was determined. Due to the library construction method, the alignment of read 1 marks the 3′ end of a transcript. Transcript counting 3′ ends (TC 3′ ends) were ignored if they were supported by fewer than 3 reads or if the 10 bp sequence downstream was significantly enriched in A bases (4 As at the start or more than 6 As in total or matching one of the following empirically determined patterns: AAABAAABBB, AAABAABABB, AAABABAABB, AABAAAABBB, AABAAABABB, AABABAAABB, ABAAAAABBB, ABAAAABABB, ABAAABAABB, ABAABAAABB, ABABAAAABB, AABAABAABB). The TC 3′ end with the highest read count was associated with each region. Finally, the number of read 2s aligned in each region was determined for each sample and these counts were used for differential expression analysis using DESeq2 [28]. All regions with a TC 3′ end were associated with Ensembl gene annotation based on the nearest transcript in the appropriate direction on the correct strand. The final output was a table (in CSV, TSV or HTML formats) containing region coordinates, associated TC 3′ end coordinates and read counts, differential expression p-value and adjusted p-value, gene and transcript annotation, distance of TC 3′ end to nearest Ensembl 3′ end, count data, normalised count data and log2 fold changes.
Technical replicate
Twelve transcript counting libraries were prepared using 1 μg of total RNA extracted from a pool of zebrafish embryos. As recommended by the manufacturer for 1 μg of total RNA a 1:100 dilution of spike mix 1 (Ambion) was made and 2 μl added to the three x1 samples. We added 5 times this quantity to the x5, one fifth to the x0.2 and one tenth to the x0.1. Sequencing was performed on an Illumina MiSeq. The reads were aligned to the Zv9 reference genome for the technical replicate and to the Zv9 reference genome including the spike reference sequences for the differential abundance test. For the technical replicate the duplicate-flagged sequence was passed through the DeTCT pipeline, the results filtered as described in additional file 5 and the resulting normalised count data compared by calculating Pearson’s product moment correlation coefficient using R’s cor.test function. For the differential abundance test the DeTCT pipeline was run on all 12 samples to get count data in triplicate for the four conditions (x5, x1, x0.2, x0.1). Then all six pairwise comparisons of the four conditions were run from the DESeq2 step onwards with the relevant six libraries. The mean of the log2 fold change of the spikes was calculated and compared to the expected log2 fold change.
Sample number
The 11 mutant and 11 wild-type zmp_ph46 BAM files were downsampled to 6 million read pairs each using the downsampling pipeline of the DeTCT pipeline [35] and then merged using Picard [26] to create two BAM files, one containing 66 million read pairs from the 11 mutant samples and one containing 66 million read pairs from the 11 wild-type samples. The real index sequences in each BAM file were then replaced with fake index sequences using Pseudo bam files [40] in order to simulate assigning the reads at random to between 2 to 11 different pseudo-samples. Each of these 10 permutations was repeated 10 times, to give 100 simulations in total. Each simulation was then run through the DeTCT pipeline.
Comparison of RNA-seq and transcript counting
For two different knockout alleles we made six non-directional TruSeq PE Cluster Kit v3 RNA-seq and six TC libraries from three wild-type and three mutant zebrafish total RNA samples, plus two or six additional libraries using the TC protocol (Table 1). Libraries were sequenced using paired-end 75 bp Illumina HiSeq 2000 systems, with reads trimmed to 54 bp using the DeTCT pipeline or the FASTX-Toolkit. Read 2s were mapped to the Zv9 reference genome with BWA 0.5.10 (row 1). Duplicate reads were identified (row 2) using the modified version of Picard’s MarkDuplicates called Picard-detct [26, 27]. RNA-seq read counts for read 2 were obtained with htseq-count using the Ensembl 74 annotation, whilst the DeTCT pipeline was used to extract TC read counts (row 3). Counts mapped to the mitochondrial genome were excluded from further analysis (row 4). The proximity of the Transcript counting 3′ end (TC 3′ end) and an Ensembl transcript were filtered at a high stringency (between -100 and +100 bases) or low stringency (between -100 and +5000 bases) (row 5). For the RNA-seq analysis all counts match a gene (rows 6 and 8), whereas in the TC analysis only counts where the TC 3′ end is associated with an Ensembl gene (rows 6 and 8) or transcripts (rows 8 and 10) are used and the remainder represent un-annotated genes, alternative 3′ ends or experimental artefact. In order to ensure a one to one correspondence between RNA-seq genes and TC genes, 28 Ensembl v74 transcripts thought to be fallacious were added to a blacklist [41]. Transcripts on the blacklist were not used to assign TC 3′ ends to Ensembl genes, ensuring a single gene was not annotated to different TC 3′ ends. In addition one RNA-seq read matching an exon in a gene is sufficient to call a gene, whereas multiple reads are required to call a TC 3′ end in DeTCT. DESeq2 was run on both sets of count data. Only protein-coding genes where a p-value was identified by DESeq2 were considered further (row 10) and from these a subset was also awarded an adjusted p-value (row 11). Genes showing differential transcript abundance between mutant and wild type with an adjusted p-value <= 0.05 (row 12) were further filtered for those with a fold change >= 2 (row 13). Before the comparison of the RNA-seq and TC methods the number of protein-coding genes with an adjusted p-value called by both methods (row 15) and the number showing an adjusted p-value <= 0.05 plus a fold change >= 2 (row 16) were identified.
Ethical statement
Zebrafish were maintained in accordance with UK Home Office regulations, UK Animals (Scientific Procedures) Act 1986, under project licence 70/7606, which was reviewed by the Wellcome Trust Sanger Institute Ethical Review Committee.