We have developed a restriction enzyme-independent Tag-seq method for expression profiling (EXPRSS) and we present evidence that it performs better than existing restriction enzyme- based (SAGE and derivative) methods. The main drawback associated with SAGE-derived NGS expression profiling methods is restriction enzymatic bias. We overcame this bias using Adaptive Focussed Acoustics from Covaris to shear cDNA and using a specific adapter, we could sequence cDNA tags from a defined position in a transcript for expression profiling. Additionally, use of sonication for fragmenting cDNA allowed us to sequence tags from all expressed genes, unlike restriction enzyme based methods, which require the enzyme recognition site in the gene to get a sequence tag. AFA shearing could be replaced with other available physical fragmentation techniques (SonicMan, Bioruptor, Hydroshear etc.), as the prerequisite is to shear cDNA to a desired size.
In principle, the EXPRSS method has the following advantages over the enzyme-based and other existing Tag-seq methods [24, 25, 43, 70–74]. EXPRSS generates one tag per transcript at a relatively defined position from the 3′ end of a gene, ensuring no length-based data transformation and enabling straightforward statistical analysis. Reverse transcription of only the 3′ ~300 bp of mRNA is required to generate a tag. Shearing using AFA, followed by gel purification, allows us to generate unbiased cDNA tags at a user-specified distance from the poly(A) site for mapping to the transcriptome. DNA shearing and DNA ligation are more high-throughput than RNA fragmentation and RNA ligation. A comparative analysis of frequency distribution of lengths of TAIR10 genes (longest transcript taken for a gene with multiple variants) against that of genes detected (1 TPM at least in 4 replicates) and differentially expressed in EXPRSS Tag-seq, revealed no preference for or against longer transcripts (see Additional file 1, Figure S18), unlike RNA-seq . However, transcripts less than 250 bp are slightly depleted in detection. However, 50% of the transcripts less than 250 bp in TARI10 are pre-trna and other non-coding RNAs, suggesting EXPRSS may not lose information on protein coding genes. Tag sequences are longer than existing Tag-seq methods (~30 bp and can be up to 250 bp) thus increasing the accuracy of read mapping to reference; EXPRSS tags are directional and can distinguish sense and antisense transcripts since the strand synthesized from mRNA acts as template for sequencing, thereby providing the same strand information as that of mRNA. SMART cDNA amplification might result in amplification biases and has been shown to have reduced strand specificity , while EXPRSS does not use amplification for cDNA generation. If enough quantity of input is provided or enough samples are pooled, EXPRSS samples do not require any amplification before loading on to the Illumina flow cell, as they have adapters required to bind in the flow cell. Handling losses are minimal, resulting in lower variability which is an essential factor for a high-throughput method. For example in Arabidopsis, (with a mean mRNA length of ~1.5 kb based on TAIR10 transcripts) for every ~200 bp 3′-most fragment of cDNA there are 5 to 6 additional fragments resulting from shearing. These fragments act as carrier DNA until the size selection step, thereby minimizing variation in handling. EXPRSS can be carried out with minimal input of RNA and addition of a carrier DNA, at the end of the second strand cDNA reaction, reduces any losses during subsequent steps. Carrier DNA would not result in tags, as they lack sequences for amplification in the flow cell. We have successfully employed lambda phage DNA as carrier in our sample preparation without any interference. With these advantages, EXPRSS Tag-seq is an excellent method for expression profiling.
Improvements in NGS techniques have resulted in improved and increased sequence yield, thereby providing an opportunity for multiplexed tag sequencing. How should one judge sufficient depth of sampling of the library? A simulated study based on pooled SAGE data  indicated that a cell with a million mRNA copies, requires three million SAGE tags to identify ~90% of expressed genes. According to Zhu et al.,  a million reads would sample ~85% of mRNA tags expressing at 1–5 copies per cell. Further evidence from recent studies [76–79] suggest that a RNA-seq sample resulting in 20–40 million reads would identify ~80% of the genes.
How do Tag-seq methods compare with whole transcriptome sequencing RNA-seq methods? RNA-seq is invaluable for annotation of a transcriptome and detection of splice variants and novel transcripts. However, RNA-seq as a tool for differential expression analysis has its drawbacks. Tag-seq methods generate one tag per transcript, mostly from a defined position, thereby avoiding transcript length biases and length-based data transformation for differential expression analysis. According to TAIR10 transcript data, the mean cDNA length of Arabidopsis is 1.5 kb. Thus, for every one fragment sequenced per transcript by EXPRSS there would be 10 ~ 150 bp fragments from RNA-seq, one could reasonably assume that 2 million reads from EXPRSS are equivalent to ~20 million reads of RNA-seq data, without transcript length and reverse transcription biases. Illumina single end sequencing on GAII provides about 35-40 M reads per lane, while a Hiseq lane provides about ~100-150 M reads. This means sequencing of only 2 RNA-seq samples per lane on GAII and 4–6 RNA-seq samples on Hiseq is required to provide sufficient depth for differential expression analysis. Multiplexing is thus more valuable with Tag-sequencing than RNA-seq. Tag-seq methods like EXPRSS therefore provide an attractive option for expression profiling with superior dynamics and reliability at substantially reduced cost.
Based on pilot experiments using four replicates each of Arabidopsis (Col-0) control and 60 minute flg22-treated leaf discs, we found that EXPRSS captures tags resulting from annotated and novel transcription units and also poly(A) transcripts from antisense strands. Read alignments indicate that cDNA is randomly sheared, as ~90% of tags are aligning around ~150 bp (±150) from the 3′ end. EXPRSS Tag-seq is highly reproducible between biological replicates and was also superior to NlaIII-DGE Tag-seq libraries made from the same RNA samples. EXPRSS revealed expression of 26% more genes compared to NlaIII-DGE in libraries made from the same RNA samples (16619 genes in EXPRSS against 13153 genes in NlaIII-DGE genes with ≥1 TPM in 4 replicates). Differential expression analysis performed on absolute tag counts facilitates interpretation of expression levels. Analysis using EXPRSS has identified more than 80% of the genes found by NlaIII-DGE and ATH1 microarray, though one could argue this is due to the higher numbers of genes found differentially expressed by EXPRSS compared to NlaIII-DGE and ATH1 microarray. A comparative study of flg22-treated microarray data from three different developmental stages in Arabidopsis has shown that EXPRSS data had a better overlap with the three microarray data sets than the overlap observed between the three microarray data sets. Two thirds of the upregulated genes from EXPRSS (spotted on ATH1 chip) were also found differentially expressed in at least one of the three array experiments. This supports the view that EXPRSS is sensitive enough to detect response-specific differential expression.
With respect to downregulated genes, EXPRSS had good overlap with two of the three-microarray data sets; although the three-microarray data sets had poor overlap among themselves. The number of genes found downregulated, using the EXPRSS Tag-seq, was greater than those detected from other microarray experiments. Therefore, one could argue this higher overlap might be due to the higher number of genes identified as downregulated with the EXPRSS method. There appears to be a detection bias against identification of down-regulation with microarrays, explaining the poor overlap between the three microarray data sets tested. The down-regulated genes found in this study could have been previously undetected in microarray analysis. Additionally, qPCR verification of selected genes from EXPRSS data confirmed that differential expression identification is reliable. These results thus emphasize the advantages of EXPRSS Tag-seq sensitive detection and absence of hybridization related issues, unlike microarrays.
Alternative splicing is another interesting phenomenon to detect during gene expression profiling. However, finding alternative splice-junctions is computationally intricate with a tag length of ~30 bp and requires higher depth of sequencing . Variable levels of expression of different alternative transcripts of a gene results in fluctuating tag densities for each variant. Li et al.  found that even at a tag density as high as ~20 million RNA-seq reads, only ~20% of verified alternative splice variants were discovered in the human transcriptome. Due to such insufficient tag densities, quantitative differences among alternative polyadenylation and alternative transcripts are unreliable. However, with the possibility of longer tags (~150 bp with Illumina) EXPRSS tags could be used to find alternative splice junctions, provided they occur within the last ~300 bp from a polyadenylation site.
Through EXPRSS we identified cDNA tags from regions of the genome without prior annotation. Paired end EXPRSS tag-seq could greatly improve annotation of 3′ regions and determine precise poly(A) sites for these rare transcripts. With improvements in technology the length of sequencing reads continues to increase, so that longer cDNA tags can be obtained. Longer cDNA tags not only enable us to match tags with higher confidence, but also to expression profile during host-pathogen interactions, and thus study gene expression patterns of both host and microbe. We are using such methods to address gene expression changes during Arabidopsis interactions with white rust and downy mildew.