NanoCAGE-XL and CapFilter: an approach to genome wide identification of high confidence transcription start sites
© Cumbie et al. 2015
Received: 22 January 2015
Accepted: 29 May 2015
Published: 13 August 2015
Identifying the transcription start sites (TSS) of genes is essential for characterizing promoter regions. Several protocols have been developed to capture the 5′ end of transcripts via Cap Analysis of Gene Expression (CAGE) or linker-ligation strategies such as Paired-End Analysis of Transcription Start Sites (PEAT), but often require large amounts of tissue. More recently, nanoCAGE was developed for sequencing on the Illumina GAIIx to overcome these difficulties.
Here we present the first publicly available adaptation of nanoCAGE for sequencing on recent ultra-high throughput platforms such as Illumina HiSeq-2000, and CapFilter, a computational pipeline that greatly increases confidence in TSS identification. We report excellent gene coverage, reproducibility, and precision in transcription start site discovery for samples from Arabidopsis thaliana roots.
nanoCAGE-XL together with CapFilter allows for genome wide identification of high confidence transcription start sites in large eukaryotic genomes.
KeywordsTranscription start site (TSS) Promoter nanoCAGE Capped analysis of gene expression
Accurate identification of transcription start sites (TSS) for RNA polymerase II (pol-II) genes is critical for determining promoter location, which in turn facilitates accurate determination of functional sequence control elements [1–3]. While standard RNA-Seq methodologies can provide some insight into the nature of full-length transcripts, a heavy 3′ bias in data outcomes must be addressed with a 5′ cap-trapping strategy. Several overall strategies have been published for application to animal tissues and cell lines; these are well-characterized by three protocols. The Paired-End Analysis of Transcription start sites (PEAT) protocol  enriches for capped transcripts in two steps: an initial dephosphorylation of uncapped transcripts, followed by the ligation of a short “tag” sequence to the 5′ ends of capped transcripts. The Cap Analysis of Gene Expression (CAGE) method  actively “traps” the 5′ N7-Methylguanosine-triphosphate (7mG-p-p-p-N) modification common to all pol-II generated transcripts, known as the “cap”, with streptavidin beads. Both PEAT and CAGE have been widely employed in animal studies [4, 6, 7], and recently we have successfully applied the PEAT strategy to plant tissues . The nanoCAGE protocol aims to reduce the required amount of total RNA from the 50 to 150 μg necessary with PEAT and CAGE to the level of nanograms by using a combination of template switching and semi-suppressive PCR, and has been reported to have a level of sensitivity 1000 times higher than that of CAGE [8, 9].
In contrast to PEAT and CAGE, nanoCAGE does not use restriction enzymes or linker ligation, and is therefore not influenced by restriction enzyme inefficiency, competition for naturally occurring restriction sites, or sequence biases resulting from ligation inefficiency of the 5′ linker. Additionally, nanoCAGE, like PEAT and CAGE, uses random primers as opposed to polyA priming, increasing the chance of 5′ end capture in the case of long transcripts. Challenges remain for broad use in current applications to plant and animal organisms with tens of thousands of genes, as the nanoCAGE protocol was developed for the Illumina GAIIx platform and it is reported to yield approximately one million raw reads per library for multiplexed libraries [8, 9]. NanoCAGE also generates false TSSs due to premature template switching before reaching the very 5′ end of transcripts . While it has not yet been widely applied in published data outcomes, the strategy has demonstrated strong potential for reducing the quantity of input total RNA if sequencing depth and template-switching artifacts can be addressed. Currently, one published study  has used the nanoCAGE protocol [8, 9] on the Illumina HiSeq-2000, but little was presented on the methodological differences as the sequencing was done commercially.
In this study using RNA from Arabidopsis roots we present the first publicly available adaptation of nanoCAGE for sequencing libraries on the powerful Illumina HiSeq-2000 platform, thus increasing sequencing depth and achieving excellent genome coverage. These samples are ideal for protocol development, as they present a challenging case where both rRNA removal and sequencing depth are critical for appropriate data coverage of a large eukaryotic genome. We develop and analyze a series of important protocol changes, and provide information on barcode effects on library performance. Additionally, we develop an annotation-free computational filter based on identification of TSS peaks (read clusters) derived from capped mRNAs that greatly increases confidence in TSS discovery. Importantly, this computational filter does not require removal of sequenced reads in pre-processing, and therefore allows the end-user to balance sensitivity with precision based on experimental needs. We provide a complete experimental protocol [Additional file 1], an analysis of gene coverage and precision of TSS identification, and description of the computational procedure with software implementation for the identification and removal of false TSSs.
Linker sequence and template input in nanoCAGE introduce important trade-offs in sequencing outcome
Summary of sequencing and alignment results
Number of Raw Reads (×106)
Uniquely Mapped Readsc (×106)
% rRNA Readsd
Average # of Genes with TSS Peaks (×103)f
Average # of Peaks per Geneg
ATCGTG GCTATA GGG
GATCGA GCTATA GGG
TCGAGC GCTATA GGG
In experiment 3, we adapted the Illumina TruSeq barcodes and used them as per Illumina’s TruSeq sample pooling guide . We selected barcodes containing as few “G”s as possible in order to avoid barcode biases caused by strand invasion , and pooled six different libraries to increase sequence diversity and address the low mappability observed in experiment 2. The linker was removed in order to evaluate whether using Illumina-recommended barcodes would improve read mappability. As expected, read mappability improved: a slight increase in percent of mapped raw reads was observed as compared to experiment 1, and a four to five-fold increase as compared to experiment 2 (Table 1). Experiment 3 had a comparable number of mapped genes to experiment 2 (Fig. 1).
While high read mappability is desirable, a representative genomic read distribution is also essential. If too many reads are generated from strand invasion or other artifacts (i.e., noise), then the number of true TSS reads needed to produce identifiable TSS peaks (read clusters) will not be sufficiently large even with high read mappability. To further explore this relationship, we used a peak calling algorithm developed for TSS-Seq reads  to analyze the distribution of reads along the length of annotated genes without filtering any of the peaks called. In experiment 1, we found that because of greater sequencing depth, the number of genes with an identifiable peak within 500 bases of the TAIR10  annotated TSS or in the 5′ untranslated region (5′ UTR) was slightly greater than with libraries that were multiplexed: 15,000 genes for experiment 1, as compared to 13,000 and 11,000 genes on average for experiments 2 and 3 respectively (data not shown). Additionally, in all experiments we found an inconsistent average number of peaks per gene prior to peak filtering: 1.9 for experiment 1, 1.2-1.5 for experiment 2, and 1.5-1.7 for experiment 3. To assess the source of this variation in average peak number, we plotted the read distribution along the normalized length of a gene within each library (Fig. 1). We found that those libraries which incorporate a linker sequence (experiment 2) and utilize more input template in their preparation display a much greater bias for reads towards the 5′ ends of genes. At the same time, while these libraries with more template have a 5′ read bias, they also have fewer peak calls at the 5′ ends of genes. This is one indicator that fewer false positives are identified in these libraries, which is confirmed in further analyses discussed in the “CapFilter” section below. Overall, our analysis illustrates that inclusion of the low-“G” linker sequence and use of more template together served to reduce the number of reads resulting from strand invasion and other artifacts.
As expected, the overall sequencing depth in experiment 1, which sequenced one library in one lane, was greater than in experiments 2 and 3 based on the total number of genes with mappable reads (Fig. 1). However, when peaks were filtered in order to identify the most confident TSS peaks, we found that libraries in experiment 2, which utilized the low-“G” linker plus the optimal template input, performed best. On average, we identified 10,000 - 11,000 genes with a high confidence TSS peak per library in experiments 1 and 3, as compared to 13,000 genes in experiment 2 (Table 1). Experiment 3 libraries, which were prepared with less mRNA input and had higher rRNA content in the template due to the presence of total RNA, also had higher rRNA contamination in the library (16 % to 27 %), apparently contributing to a reduction of the number of productive reads and an increase in read redundancy. This is in contrast to much lower contamination level overall for experiments 1 and 2 (0.29 % and 1.93 - 2.08 % respectively) (Table 1).
Cap signature identifies high confidence TSS peaks
Because of this cap signature, namely the presence of unencoded 5′ “G”s, many reads derived from capped mRNA may fail to align to the genome, potentially resulting in a loss of the most informative reads. In agreement with this expectation, for all reads that began with a “G”, in ~37 % of these cases the “G” was unencoded. This is in contrast to reads beginning with any other base where collectively an average of ~4 % of those cases began with unencoded instance of that base [Additional file 2: Figure S1 and Additional file 3: Table S1]. This average dropped to ~2 % for all libraries at the second nucleotide, regardless of base [Additional file 2: Figure S1 and Additional file 3: Table S1]. This indicated that a “C” complementary to the 5′ cap was frequently incorporated and converted into a “G” during the second strand synthesis in reads derived from capped mRNA, with additional nucleotides much less likely to be incorporated. The percentage of unencoded “G”s was on average higher for the libraries in experiment 2 (~66 %) as compared to experiments 1 and 3 combined (~24 %) [Additional file 2: Figure S1 and Additional file 3: Table S1], suggesting that using the low-“G” linker in addition to having a more optimal mRNA template input increases the chances of the TS-oligo to anneal to the cap-derived cytosine (see Fig. 1). To overcome the potentially lower mappability of reads derived from capped mRNAs, we removed the first base from each read prior to alignment, while keeping track of the removed base. When examining read alignments relative to their position along the length of genes, a bias was observed toward reads mapping to the 5′-most portion of the gene regardless of their beginning base (Fig. 3d). This was likely due to the TS oligo annealing directly to the cap-derived cytosine (Fig. 3a). However, the bias for reads starting with a “G”, either encoded or unencoded, was remarkably pronounced within the 5′-most portion of the gene. Thus, reads beginning with this “cap signature” display this strong “G”-bias, and are more likely to be true TSSs. This observation allows for the development of filtering approaches for the elimination of potential false positives and other background noise in downstream analyses.
CapFilter: a software program to provide G’-filtering of peaks allows for annotation-free TSS identification
As an additional measure for testing the precision of our G’-filtering approach, we generated sequence logos using Web Logo 3  from the nucleotides surrounding the peak mode, the genome coordinate most commonly sequenced in a peak, to identify enriched nucleotides at the most-preferred TSS positions. Previous research has shown that in plants there is a preference for a pyrimidine (Y: C or T) at the −1 nucleotide, i.e., the nucleotide immediately upstream of the TSS, and a purine (R: A or G) at the +1 position, which is referred to as the “YR rule” . We found that prior to G’-filtering, the +1 and +2 nucleotides were enriched primarily for a “GG” motif (Fig. 5b). This occurred even when using the in silico filter for strand invasion artifacts developed in , indicating that the presence of “C”-rich regions alone in mRNA without additional complementarity to the TS oligo may be sufficient for promoting strand invasion. After G’-filtering, the primary bases enriched at the −1 and +1 nucleotide positions were “T,C” and “A,G”, respectively, consistent with the “YR rule” , indicating that G’-filtering was indeed capturing peak modes that were high confidence TSSs in our data set. In summary we find that the use of CapFilter greatly enriches for peaks composed predominantly of true TSSs. These CapFilter results include a non-negligible portion of low-coverage peaks (10–20 reads), highlighting its ability to capture true TSS peaks regardless of coverage.
NanoCAGE-XL is highly reproducible after G’-filtering
Reproducibility is an important metric for the quality of TSS-seq. To address the reproducibility of our nanoCAGE-XL data, we examined the correlation in read coverage for individual peaks, the consistency between TSS peak modes (the position most commonly sequenced in a read cluster) of biological replicates, and the consistency between nanoCAGE-XL peak modes and those in a PEAT data set generated using the same sample type .
For the read coverage correlation analysis, we used a similar approach to that in  to perform pair-wise comparisons for all of our data sets [Additional file 2: Figure S1 and Additional file 5: Table S3]. Briefly, the number of reads per million (RPM) was calculated for each peak in one library of our data set, and this number was compared to the RPM of the corresponding peak in a separate library using a Spearman’s rank correlation analysis. Peaks were considered to be corresponding based on the overlap of their start/end genomic coordinates. The average Spearman’s rho of 0.67 showed a moderate level of positive correlation, a value in agreement with previous levels of correlation reported for nanoCAGE . This value was even greater for comparisons between libraries using the common linker sequence (average Spearman’s rho of 0.88) [Additional file 2: Figure S1 and Additional file 5: Table S3], indicating that the common linker reduced biases based on barcode difference, consistent with results reported in .
Peak mode analysis
Correct identification of where transcription begins is a critical factor in studying both the process of transcription itself as well as its regulation. For example, due to the degenerate nature of transcription factor binding sites, identifying their precise genomic locations is especially difficult without first correctly identifying their corresponding TSSs [2, 3]. Our previous work has demonstrated the feasibility of this task using bioinformatically selected PEAT data sets combined with logistic regression models that incorporate spatial sequence information for well characterized transcription factors and their respective binding sites [2, 3]. However, the use of PEAT TSS-Seq data precludes the analysis of samples where limited tissue is available due to the high RNA input requirement for this protocol. Additionally, the PEAT protocol removes the 5′ cap before library construction, which prevents noise filtering based on the presence or absence of a cap signature. NanoCAGE dramatically reduces the amount of input RNA needed to interrogate transcription start sites genome wide, although this comes at the expense of low sequencing depth and genome coverage [8, 9]. We were able to address these caveats in the nanoCAGE-XL protocol, and to show that nanoCAGE-XL is appropriate even for plant samples, a notorious technical challenge due to high rRNA content. All raw sequencing data and TSS peak calls are publicly available in the Short Read Archive (SRA) . By upscaling library preparation, using rRNA-depleted templates, and selecting appropriate barcodes, we were able to prepare libraries suitable for sequencing on Illumina’s HiSeq-2000 platform while achieving excellent genome coverage in Arabidopsis root samples. Commensurate with RNA-Seq studies  and microarray studies  in developing whole Arabidopsis root samples, as summarized in , our nanoCAGE-XL libraries covered ~20,000 of the ~27,000 total protein coding genes in the Arabidopsis genome; over 16,000 of these genes have a high confidence TSS peak. Importantly, using the cap signature common in nanoCAGE reads that are derived from capped mRNAs allowed us to develop a new bioinformatic filtering methodology for the identification of true TSSs.
We found that considerations related to library barcoding must be taken into account when preparing nanoCAGE libraries for HiSeq-2000. In the non-barcoded libraries, the presence of a 3G-tail at the start of each read generates low read diversity that is detrimental for HiSeq-2000 sequencing, as compared to sequencing on the older GAIIx platform originally used by nanoCAGE . Introducing inline barcodes, increasing the number of differentially barcoded libraries per lane, and using Illumina-recommended barcode combinations overcomes this problem and increases the number of mappable reads generated. A potential further improvement is to use indexing that incorporates the barcode sequences at the end of reads and requires a separate sequencing step, which has recently been introduced in a number of Hi-Seq 2000 applications. However, such an approach would also require the design of a custom sequencing primer that would anneal over the 3G-tail of the TS oligo. Generally, primers with high G/C content at their 3′ end do not perform well in PCR-based applications. Therefore, such an approach has not yet been tested.
During the RT step employed in generating first strand cDNA in nanoCAGE, two events occur: 1) the RT enzyme incorporates a cytosine nucleotide complementary to the 5′ cap of capped mRNA , and 2) one to two additional cytosine bases are commonly added beyond the 5′ cap by the enzyme’s template-free activity [13, 23–25]. In both nanoCAGE and CAGE, the unencoded “C”s are converted into “G”s in the final sequenced reads, creating a “cap signature” [9, 14–16]. This is in contrast to the PEAT protocol, which removes the cap prior to library construction [3, 4]. In agreement, we found that nearly 40 % of our raw reads started with a “G” that did not match the reference genome. While a small portion of this unencoded “G” bias may arise from sequencing error, the fact that the remaining portion of the same reads matched the reference genome suggests that the majority of these “G”s represent the “cap signature”. We took advantage of this easily identifiable marker that distinguishes reads generated from capped mRNA products to develop a computational approach that selects for true TSS peaks. While it may be intuitive to simply select for reads starting with an unencoded “G” and disregard all other reads, we found that a large number of peaks located near the 5′ ends of TAIR-annotated genes also had a high proportion of reads that began with encoded “G”s. For these encoded “G”s it is impossible to determine whether they are cap-derived or instead arise from artifacts such as strand invasion. Therefore, instead of selecting for reads beginning with unencoded “G”s, we based our filtering approach on selecting for TSS peaks that are marked with a high proportion of reads starting with unencoded “G”s.
We also investigated a threshold that could be used to efficiently filter for high-confidence TSSs by examining the proportion of reads with unencoded “G”s which mapped within annotated genes. We determined that a cutoff of 50 % of such reads within a peak was stringent enough to identify high confidence TSS peaks and was sufficient to eliminate artifactual peaks. Using this cap-signature filter is especially effective because 1) it does not require a priori genome annotation, and 2) it applies to all peaks, regardless of coverage or position along a gene’s sequence. This second point is especially important since some of the TSS peaks identified within our data sets after G’-filtering, as well as those reported for PEAT samples  or other organisms as diverse as human [14, 16], Drosophila  and zebrafish , map to unexpected gene regions (e.g., the 3′ UTR or the coding sequence) or outside of genic regions entirely. In all of these cases, the presence of novel or unexpected transcription starting sites is an open question. The simple procedure of G’-filtering that we report here should provide an unbiased approach for examination and filtering of genuine TSSs in many organisms as long as the experimental protocol preserves the cap signature, as in nanoCAGE  and CAGE . The stringency of this filter can be modified depending on experimental needs. For example, stringency can be lowered to allow for increased gene coverage or higher representation of non-canonical TSS locations, bearing in mind the trade-offs with increased false positive risk.
In this paper we introduced the first publicly available protocol adapting nanoCAGE for the HiSeq-2000 sequencing platform, making TSS sequencing of low input samples practical where significant depth of coverage is required. Using CapFilter, we were able to demonstrate that the reproducibility of nanoCAGE-XL TSS peak calls was very high, with identical peak mode positions found for a substantial portion of all peaks. All cases of non-identical peak mode positions fell within a short distance of each other, indicating that nanoCAGE-XL with CapFilter achieves truly nucleotide level resolution for identified TSS peaks genome wide. Not only is this method precise and internally reproducible, but peak comparisons show a robust cross-platform reproducibility with the PEAT protocol for peaks with moderate to high coverage, although PEAT does not allow for G’-filtering. For loci with low coverage, our analysis suggests that CapFilter provides an advantage in detecting peaks predominantly composed of capped transcripts. When combining nanoCAGE-XL libraries within a given experiment, we find that nanoCAGE-XL reaches saturation at a higher number of genes than PEAT, highlighting the possibility for greater coverage with nanoCAGE-XL for low input samples.
Plant material and growth conditions
Roots of 9 day-old Col-0 plants grown on vertically oriented agar plates were used in all experiments. Seeds were surface-sterilized in 20 % commercial bleach for 30 min, and rinsed four times for 10 min with sterile water. Sterilized seeds were planted on media containing 1X Murashige and Skoog basal medium with vitamins , 1 % sucrose, 10 mM MES buffer pH 5.7 and 0.8 % agar. Plates were incubated at 4 °C for 2 days to ensure even germination after which plants were grown at 21 °C under continuous light conditions with light intensity of 26 μE m−2 s−1.
Library construction, quantification and sequencing
Tissue was ground in liquid N2 and RNA extracted with TRIzol reagent . Total RNA (RIN 9.0 to 9.6) was treated with 1X RNA Secure Reagent  at 65 °C for 10 min, treated with DNase I  for 10 min at 37 °C, and purified with RNAeasy kit  following the manufacturer’s instructions. rRNA was depleted with Ribo-Zero Magnetic kit (Plant Seed/Root)  following the manufacturer’s instructions. Purity and concentration of the resulting mRNAs was determined using Bioanalyzer .
RT was performed as in  except that either 40 μl reactions with ~200 ng mRNA (experiments 1 and 3), or 20 μl reactions with 50 ng mRNA + 100 ng total RNA (experiment 2) were performed. Reactions were purified with Agencourt RNAClean XP kit  as per the manufacturer’s instructions, and first strand cDNA eluted from the beads with 80 μl H2O. The optimal number of semisuppressive PCR cycles, usually between 17 and 25, required for second-strand synthesis were determined by quantitative real-time PCR as in . No-template controls were included at this step to test for potential contaminants. Next, for each library, 400 μl semisuppressive PCR reactions were performed using 60 μl of first strand cDNA and the selected number of cycles, after which the reactions were purified with Agencourt AMPure XP kit . Concentration of PCR products was determined with Qubit dsDNA HS Assay kit , and products diluted to 10 ng/μl. For each library, addition of sequencing adaptors was performed in 700 μl PCR mixtures. At this step, the ExTaq polymerase was replaced with Phusion Hot Start II High-Fidelity DNA Polymerase  in order to generate blunt cDNA ends. The cycling conditions were: 1) 98 °C for 1 min, 2) 1 cycle of 98 °C for 15 s, 55 °C for 10 s, 68 °C for 2 min, 3) 9 cycles of 98 °C for 15 s, 65 °C for 10 s, 68 °C for 2 min. After completion of the PCR, remaining primers were purified by Exonuclease I digestion as in . Namely, 5 μl of Exo I (20 U/ μl)  was added per 700-μl PCR mixture and the mixtures incubated at 37 °C for 30 min. Each 700 μl PCR mixture was then mixed with 3.5 ml (5 V) of PB buffer, purified by running through a single column of QIAquick PCR purification kit , and each library eluted with 25 μl H2O.
Library concentrations were determined with Qubit dsDNA HS Assay kit , and library molecular size distributions determined using Bioanalyzer. The optimal amounts of libraries were then determined as per the Illumina qPCR quantification guide , and libraries sequenced at concentrations of 1.3 to 2.3 nM.
Sequence processing and alignment
All alignments were made against the TAIR10 version of the Arabidopsis genome . Prior to alignment, all reads within a library had the TS oligo sequence removed. For experiment 1, only reads starting with “GGG” were accepted, and the “GGG” sequence was removed before alignment. For libraries that were barcoded (experiments 2 and 3) and often had mismatches in the barcode portion of the read, a custom Perl script was used to assign a barcode to a library only if 1) the barcode had the fewest mismatches compared to any other barcode, and 2) no more than three mismatches total were found . Once a barcode was identified, both the adapter with the barcode, and the following “GGG” portion of the adapter was removed before alignment. For experiment 1, which sequenced 101 nucleotides as opposed to 51 nucleotides in experiments 2 and 3, the last 41 nucleotides were trimmed prior to alignment in order to make sequence lengths more comparable across samples. rRNA reads were removed, and only uniquely mapped reads were used for analyses. CapFilter, described below, was used to pre-process all sequence files prior to alignment, but after adapter removal. After applying CapFilter, all reads were aligned to the TAIR10 reference genome, using Tophat version 2.0.12 , with the parameter settings ‘--bowtie1 -N 2 -i 50 -I 5000’ and the ‘--segment-length’ option set to half the length of the aligned read for those libraries where reads were < 51 nucleotides.
Strand invasion filtering
Strand invasion artifacts were removed post-alignment using the same procedure outlined in  with some minor modifications. Briefly, the nine nucleotides directly upstream of the mapped reads were pulled from the reference genome and this sequence was compared to the last nine nucleotides of the TS oligo used for preparing the library. The last three nucleotides of the reference genome sequence had to have no less than two of the three “G”s present in the TS oligo tail, and no more than two mismatches overall. This same alignment requirement was then applied to subsequences one to three nucleotides upstream and downstream of the read alignment start position, and if at least one subsequence matched the last 9 nucleotides of the TS oligo, then the read was considered as showing evidence of strand invasion.
All annotated protein coding genes were partitioned into four evenly spaced quartiles, relative to the orientation of the gene, starting at 100 bp upstream of the TAIR10 TSS and ending at the TAIR10 annotated end of the gene . Quartiles were ordered from the most 5′ (quartile 1) to the most 3′ (quartile 4). The total number of reads whose 5′-most base resided within a given quartile was then calculated, and the percent of genes with the majority of reads found within a quartile was plotted.
Peak identification and G’ filtering with CapFilter
CapFilter is a two-step process used to identify high confidence peaks. The first step pre-processes reads prior to alignment by trimming the first nucleotide of a given sequence and modifying the FASTQ identifier to keep track of this nucleotide. This FASTQ file is then aligned against the reference, and the subsequent BAM file produced by TopHat . This BAM file is then filtered for strand invasion artifacts, and peak identification is performed using a previously developed peak calling program . CapFilter then takes as input the BAM file, the generated peak file, and the reference genome sequence, and generates as output a final peak file containing high confidence TSS peaks. For this, CapFilter enumerates the percent of reads within a peak that had a 5′ unencoded “G” removed (based on the nucleotide identified in the “qname” field of the SAM alignment). CapFilter then creates a final peak file with high confidence peaks and adds an additional “%-Capped” column to the output to denote the percent of reads which had a 5′ unencoded “G”. Each unencoded “G” was identified by comparing the first base of the FASTQ file (which was removed prior to alignment) to the reference genome nucleotide that was one base pair upstream of the mapped read.
Read correlation analysis
The total number of reads assigned to each peak was calculated using the same procedure outlined above for G’-filtering, and these totals were normalized to the number of reads per million (RPM) found within a given library. For all pairwise comparisons, all peaks were then paired based on overlapping coordinates; only those peaks that could be paired uniquely between libraries were compared. A Spearman’s rank correlation analysis was then performed for all peaks that could be paired between each library.
Peak mode analysis
For all pairwise comparisons, all peaks were paired based on overlapping start/end coordinates. Only those peaks which could be paired uniquely were used, and the median and average distance between the identified peak modes was calculated. For those peaks in nanoCAGE-XL that were comparable to PEAT peaks (overlapping), only peaks within 500 bases of the TAIR10 annotated TSS or within the 5′ UTR were used for this comparison based on the criteria for PEAT peak selection . For all comparisons within nanoCAGE-XL data uniquely paired peaks from all regions were used prior to filtering. After filtering, only uniquely paired peaks that had at least 50 % of reads beginning with an unencoded “G” were considered.
Availability of supporting data
A detailed experimental protocol for nanoCAGE-XL is provided in Additional file 1. All raw sequencing data and TSS peak calls (made for each library individually and for each experiment combined) supporting the results of this article are available at Short Read Archive in [http://www.ncbi.nlm.nih.gov/sra] with accession [PRJNA270670]. The Perl script for assigning barcodes is publicly available as an open source command-line tool at . The CapFilter program for identifying high confidence peak calls is publicly available as an open source command-line tool .
We would like to thank Charles Plessy of the RIKEN Center for Life Science Technologies and Jenn To of Grassroots Biotechnology for technical advice on the nanoCAGE protocol. We would also like to thank Mark Dasenko of the Center for Genome Research and Biocomputing at Oregon State University for troubleshooting assistance in sample preparation for sequencing. This work was supported by the National Institutes of Health [GM097188 to MM]; and by startup funds from Oregon State University.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Plessy C, Bertin N, Takahashi H, Simone R, Salimullah M, Lassmann T, et al. Linking promoters to functional transcripts in small samples with nanoCAGE and CAGEscan. Nat Methods. 2010;7:528–34.PubMed CentralView ArticlePubMedGoogle Scholar
- Megraw M, Pereira F, Jensen ST, Ohler U, Hatzigeorgiou AG. A transcription factor affinity-based code for mammalian transcription initiation. Genome Res. 2009;19:644–56.PubMed CentralView ArticlePubMedGoogle Scholar
- Morton T, Petricka J, Corcoran DL, Li S, Winter CM, Carda A, et al. Paired-end analysis of transcription start sites in Arabidopsis reveals plant-specific promoter signatures. Plant Cell. 2014;26:2746–60.PubMed CentralView ArticlePubMedGoogle Scholar
- Ni T, Corcoran DL, Rach EA, Song S, Spana EP, Gao Y, et al. A paired-end sequencing strategy to map the complex landscape of transcription initiation. Nat Methods. 2010;7:521–7.PubMed CentralView ArticlePubMedGoogle Scholar
- Takahashi H, Kato S, Murata M, Carninci P. CAGE (cap analysis of gene expression): a protocol for the detection of promoter and transcriptional networks. Methods Mol Biol. 2012;786:181–200.PubMed CentralView ArticlePubMedGoogle Scholar
- Batut P, Dobin A, Plessy C, Carninci P, Gingeras TR. High-fidelity promoter profiling reveals widespread alternative promoter usage and transposon-driven developmental gene expression. Genome Res. 2013;23:169–80.PubMed CentralView ArticlePubMedGoogle Scholar
- Nepal C, Hadzhiev Y, Previti C, Haberle V, Li N, Takahashi H, et al. Dynamic regulation of the transcription initiation landscape at single nucleotide resolution during vertebrate embryogenesis. Genome Res. 2013;23:1938–50.PubMed CentralView ArticlePubMedGoogle Scholar
- Salimullah M, Sakai M, Plessy C, Carninci P. NanoCAGE: a high-resolution technique to discover and interrogate cell transcriptomes. Cold Spring Harb Protoc. 2011;2011:pdb prot5559.PubMed CentralView ArticlePubMedGoogle Scholar
- Tang DT, Plessy C, Salimullah M, Suzuki AM, Calligaris R, Gustincich S, et al. Suppression of artifacts and barcode bias in high-throughput transcriptome analyses utilizing template switching. Nucleic Acids Res. 2012;41, e44.PubMed CentralView ArticlePubMedGoogle Scholar
- Marques AC, Hughes J, Graham B, Kowalczyk MS, Higgs DR, Ponting CP. Chromatin signatures at transcriptional start sites separate two equally populated yet distinct classes of intergenic long noncoding RNAs. Genome Biol. 2013;14:R131.PubMed CentralView ArticlePubMedGoogle Scholar
- Illumina. [http://support.illumina.com/sequencing/best_practices.html].
- The Arabidopsis Information Resource. [http://arabidopsis.org].
- Ohtake H, Ohtoko K, Ishimaru Y, Kato S. Determination of the capped site sequence of mRNA based on the detection of cap-dependent nucleotide addition using an anchor ligation method. DNA Res. 2004;11:305–9.View ArticlePubMedGoogle Scholar
- Kawaji H, Lizio M, Itoh M, Kanamori-Katayama M, Kaiho A, Nishiyori-Sueki H, et al. Comparison of CAGE and RNA-seq transcriptome profiling using clonally amplified and single-molecule next-generation sequencing. Genome Res. 2014;24:708–17.PubMed CentralView ArticlePubMedGoogle Scholar
- Yamamoto YY, Yoshitsugu T, Sakurai T, Seki M, Shinozaki K, Obokata J. Heterogeneity of Arabidopsis core promoters revealed by high-density TSS analysis. Plant J. 2009;60:350–62.View ArticlePubMedGoogle Scholar
- Carninci P, Sandelin A, Lenhard B, Katayama S, Shimokawa K, Ponjavic J, et al. Genome-wide analysis of mammalian promoter architecture and evolution. Nat Genet. 2006;38:626–35.View ArticlePubMedGoogle Scholar
- Hestand MS, Klingenhoff A, Scherf M, Ariyurek Y, Ramos Y, van Workum W, et al. Tissue-specific transcript annotation and expression profiling with complementary next-generation sequencing technologies. Nucleic Acids Res. 2010;38, e165.PubMed CentralView ArticlePubMedGoogle Scholar
- Crooks GE, Hon G, Chandonia JM, Brenner SE. WebLogo: a sequence logo generator. Genome Res. 2004;14:1188–90.PubMed CentralView ArticlePubMedGoogle Scholar
- Yamamoto YY, Ichida H, Matsui M, Obokata J, Sakurai T, Satou M, et al. Identification of plant promoter constituents by analysis of local distribution of short sequences. BMC Genomics. 2007;8:67.PubMed CentralView ArticlePubMedGoogle Scholar
- Short Read Archive. [http://www.ncbi.nlm.nih.gov/sra].
- Brady SM, Orlando DA, Lee JY, Wang JY, Koch J, Dinneny JR, et al. A high-resolution root spatiotemporal map reveals dominant expression patterns. Science. 2007;318:801–6.View ArticlePubMedGoogle Scholar
- Li S, Liberman LM, Mukherjee N, Benfey PN, Ohler U. Integrated detection of natural antisense transcripts using strand-specific RNA sequencing data. Genome Res. 2013;23:1730–9.PubMed CentralView ArticlePubMedGoogle Scholar
- Schmidt WM, Mueller MW. CapSelect: a highly sensitive method for 5′ CAP-dependent enrichment of full-length cDNA in PCR-mediated analysis of mRNAs. Nucleic Acids Res. 1999;27, e31.PubMed CentralView ArticlePubMedGoogle Scholar
- Zajac P, Islam S, Hochgerner H, Lonnerberg P, Linnarsson S. Base preferences in non-templated nucleotide incorporation by MMLV-derived reverse transcriptases. PLoS One. 2013;8, e85270.PubMed CentralView ArticlePubMedGoogle Scholar
- Zhu YY, Machleder EM, Chenchik A, Li R, Siebert PD. Reverse transcriptase template switching: a SMART approach for full-length cDNA library construction. Biotechniques. 2001;30:892–7.PubMedGoogle Scholar
- Hoskins RA, Landolin JM, Brown JB, Sandler JE, Takahashi H, Lassmann T, et al. Genome-wide analysis of promoter architecture in Drosophila melanogaster. Genome Res. 2011;21:182–92.PubMed CentralView ArticlePubMedGoogle Scholar
- Phyto Technology. [http://www.phytotechlab.com].
- Invitrogen [http://www.lifetechnologies.com].
- Ambion. [http://www.lifetechnologies.com].
- Qiagen. [http://www.qiagen.com].
- Epicentre. [http://www.epibio.com].
- Agilent Technologies. [http://www.agilent.com/home].
- Beckman Coulter. [https://www.beckmancoulter.com].
- Molecular Probes. [http://www.lifetechnologies.com].
- Thermo Scientific. [http://www.thermoscientificbio.com].
- New England Biolabs. [https://www.neb.com].
- Cumbie JS. CapFilter. [http://megraw.cgrb.oregonstate.edu/software/CapFilter].
- Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics. 2009;25:1105–11.PubMed CentralView ArticlePubMedGoogle Scholar
- Stein LD, Mungall C, Shu S, Caudy M, Mangone M, Day A, et al. The generic genome browser: a building block for a model organism system database. Genome Res. 2002;12:1599–610.PubMed CentralView ArticlePubMedGoogle Scholar