The last 10 years have seen the rise of countless functional genomics studies based on Next-Generation Sequencing (NGS). In the vast majority of cases, whatever the species, whatever the experiment, the two first steps of data analysis consist of a quality control of the raw reads followed by a mapping of those reads to a reference genome/transcriptome. Subsequent steps then depend on the type of study that is being made. While some tools have been proposed for investigating data quality after the mapping step, there is no commonly adopted framework that would be easy to use and broadly applicable to any NGS data type.
We present ALFA, a simple but universal tool that can be used after the mapping step on any kind of NGS experiment data for any organism with available genomic annotations. In a single command line, ALFA can compute and display distribution of reads by categories (exon, intron, UTR, etc.) and biotypes (protein coding, miRNA, etc.) for a given aligned dataset with nucleotide precision. We present applications of ALFA to Ribo-Seq and RNA-Seq on Homo sapiens, CLIP-Seq on Mus musculus, RNA-Seq on Saccharomyces cerevisiae, Bisulfite sequencing on Arabidopsis thaliana and ChIP-Seq on Caenorhabditis elegans.
We show that ALFA provides a powerful and broadly applicable approach for post mapping quality control and to produce a global overview using common or dedicated annotations. It is made available to the community as an easy to install command line tool and from the Galaxy Tool Shed.
Software programs such as FastQC  have become routine to get information about raw high-throughput sequenced material. Most of the time, this first analysis step is followed by a second step where the reads are mapped to a reference genome using a sequence aligner such as STAR . However, the subsequent steps are often very specific to the type of NGS experiment that is being run. For instance, in ChIP-Seq  or CLIP-Seq  experiments, peaks will need to be detected prior to further processing; in RNA-Seq, a differential analysis will often be performed on aligned reads; in BS-Seq experiments, a dedicated analysis of sequence will be applied in order to obtain a methylation ratio per nucleotide prior to further processing. To sum up, only two steps are common to a large majority of NGS data analysis pipelines: sequencing quality control and mapping. The subsequent steps are most often specific to the considered application.
Dedicated categorization of reads for specific kinds of NGS data were employed in past studies performing RNA-Seq , Ribosome profiling , ChIP-seq  or miR-seq . However, they were not designed to be broadly applicable. That is, none of these dedicated methods would be suitable for the analysis of another type of NGS data. For instance, a peak calling was a prerequisite to ChIP-Seq data categorization in  which would be meaningless for an RNA-Seq experiment. Alternatively and with some coding effort, programs such as BEDTools , BEDOPS , featureCounts or Bam2x  can be combined to intersect positions with feature intervals and produce read feature distributions (see Table 1 for a comparison). However, combining those is not straightforward and the overall process is often largely sub-optimal . To the best of our knowledge, there is no available tool that provides a broadly applicable and fast quantitative overview of several samples with nucleotide precision.
Here, we introduce ALFA (Annotation Lanscape For Aligned reads), a simple and broadly applicable tool that produces a global overview of the distribution of mapped reads, both in terms of genomic categories (stop codon, 5′-UTR, CDS, intergenic, etc.) and biotypes (protein coding, miRNA, ncRNA, etc.) with nucleotide precision. ALFA turned out to be a very useful systematic post-mapping quality control tool for a broad range of NGS applications we have been dealing with. We describe here the results obtained with ALFA using datasets produced by several types of NGS experiments on various model organisms. Even though its primary purpose was to offer a quality control step right after mapping, we show that ALFA occasionally provided a first global functional insight prior to a more dedicated analysis.
Common genomic databases (e.g. Ensembl) mostly propose two types of functional annotations: genomic categories (5 ́-UTR, CDS, intron, intergenic region, etc.) and gene biotypes (protein-coding, miRNA, IG-gene, etc.). An annotation file with both the category and biotype information is required to run ALFA. In order to compute the proportions of reads from a given NGS dataset falling into these annotated regions, ALFA proceeds in three steps. In a first step, ALFA generates reusable index files from genomic annotations. In a second step, ALFA computes the actual nucleotide count per category from a given set of NGS samples using those index files. Finally, ALFA can display bar plots of the raw and normalized distributions. Importantly, the three steps described in detail below can be processed at once with a single command line call.
Creation of the index files
ALFA first processes a GTF annotation file and converts it to two index files. This step needs to be performed only once for a given annotation file. Its aim is to transform the GTF file content, which is an unordered and overlapping set of associations between intervals and category-biotype pairs, into a list of category-biotype pairs indexed by genomic coordinates. For instance, in a GTF file, a nucleotide could be associated to an exon interval but also to at least one transcript and one gene interval. This hierarchy can be deeper for the most precise features (UTR, CDS, start codon, etc.). Consequently, in a GTF file most genomic coordinates appear several times and in an unordered fashion. In order to suppress this redundancy, this indexing step splits intervals so as to remove overlaps and so that each interval matches a list of category-biotype pairs. It proceeds as follows: for each row of the GTF file, if the category-biotype pair for the considered interval does not overlap with any yet processed intervals, it is created. If a nucleotide sequence belongs to several overlapping features at the same hierarchical level, it is considered as ambiguous by default and discarded from further processing. Optionally, ALFA can equally split the considered counts between the overlapping category-biotype pairs. At the end of the process, the index is made of non overlapping intervals annotated with one or more category-biotype pairs. In this way, all category-biotypes of any given genomic coordinate can be retrieved at once. Furthermore, in order to preserve the DNA strand information, two index files are created. The first one is a stranded index, which contains the features for the sense and for the antisense strands separately. In stranded mode, a nucleotide mapping to an unannotated region on one strand facing an annotated region on the opposite strand will be counted in the category “opposite strand”. The second one is an unstranded index containing all annotations from both strands. Concretely, the systematic production of those two files enables a flexible choice of strandness when processing sample files at the next step using the strand parameter.
Processing of the sample files
The processing of a set of mapped reads samples in BAM format is performed in two steps. First the coverage per genomic position is obtained for each sample using Bedtools genomecov  which produces a bedgraph file per sample. The bedgraph format is a non overlapping list of genomic intervals with counts. Following this, each bedgraph file is intersected with the index previously computed from the GTF file (either the stranded or the unstranded version, according to the strand parameter chosen by the user). This intersection consists in aggregating all counts falling into all intervals of each category-biotype pair. Concretely, it is obtained in the following way: each genomic interval defined in the bedgraph files is split into as many genomic intervals it covers in the index file and the list of category-biotype associations found in those intervals are incremented by the corresponding fraction of the count. As annotation files reflect our current knowledge about a given genome, they are not strictly standardized. As a consequence, elements can be unevenly described. To solve this issue, ALFA adds virtual categories called “undescribed” at the gene and the exon levels. For instance, if ALFA is supposed to display the category distribution up to the UTR regions level but a gene is only annotated up to the exon level, the reads mapping to this exon of this gene will be categorized as “undescribed exon” because there is no way to know which of the deeper categories (“UTR” or “CDS”) it should account for. At the end of this process, a list of nucleotide counts per category-biotype pair is obtained for each sample and stored in one of the output files. Eventually, nucleotides that map to a region that corresponds to two or more distinct features from the same hierarchical level are considered ambiguous and are discarded by default. In this case, the percentage of ignored read counts is displayed by the program directly on the standard output.
Displaying the results
The last step consists in creating plots from the file produced at the previous step. From the counts file, containing the number of nucleotides in the sample for each category-biotype annotation pair, counts are grouped by category and by biotype to produce two images. On each image, two plots are displayed on top of each other. The top plot represents the raw distribution of reads among the selected type of features (categories or biotypes). The bottom plot shows the same counts divided by the total amount of the corresponding features in the reference genome. This second plot provides information about the sample enrichment (or depletion) of categories/biotypes compared to the genomic global distribution.
ALFA can detect inconsistencies between replicates
Comparing feature distributions right after mapping is a simple and powerful way to easily detect possible inconsistencies between replicates. To illustrate this point, we used ALFA to evaluate the consistency of three replicates in a CLIP-Seq experiment on Mus musculus samples. CLIP-Seq, for Cross-Linking ImmunoPrecipitation followed by high throughput sequencing, is used to identify the binding sites of RNA associated proteins. In Fig. 1, the category plots produced by ALFA distinctly highlight that one of the replicates behaves differently from the others, with an enrichment of reads mapping within intergenic regions. In this case, ALFA allowed to quickly point out, after the mapping step, that a replicate should be further investigated, treated with special care or even discarded.
ALFA can detect inconsistencies between experimental studies
Feature distributions can also be used to compare replicates of a similar experiment performed in different places or experimental conditions. To illustrate this point, we used ALFA to compare two BS-Seq experiments performed on Arabidopsis thaliana by two different laboratories ([14, 15]). BS-Seq for Bisulfite treatment followed by high throughput sequencing is used to determine the pattern of methylation on DNA. In Fig. 2, a category plot produced by ALFA permits the detection of discrepancies in the relative amount of reads mapping within coding sequences (CDS) and intergenic regions; indeed, the proportion of reads mapping to intergenic regions was higher in the data from  compared to . Moreover, it also offered a quick overview of the reproducibility between replicates for each of these studies with  showing less divergence.
ALFA can be used to validate a protocol
Feature distributions can also be used to check if a technique is working as expected. To illustrate this point we used ALFA to validate the difference between two Ribosome Profiling protocols on Mus musculus samples. The Ribosome Profiling technique, which consists of the deep sequencing of ribosome-protected mRNA fragments, produces a global snapshot of the translatome. Alongside two replicates using the standard protocol, we produced two extra replicates with an additional harringtonine treatment. Harringtonine is a drug that inhibits the elongation phase of translation, after initiation. In Fig. 3, the category plot produced by ALFA shows that Harringtonine decreased the CDS fraction of reads while increasing the 5’UTR and start codon fractions of reads, thus expectedly shifting the distribution of mapped reads towards the start sites.
ALFA can help choosing between available methods
Comparing feature distributions can be helpful to choose a method that is adapted to the user’s experimental purpose. For instance, two Ribosome Profiling protocols have been proposed, where the footprints are generated either by digestion with MNase or with RNase I. In Fig. 4, a category plot produced with ALFA on samples treated as in  or as in  shows a substantial contamination of the footprints obtained by MNase digestion with non-protein-coding sequences, with an important fraction of reads annotated as intergenic and 3’UTR. In our hands, the MNase procedure was thus less specific of ribosome-protected RNA sequences than the RNAse I-based protocol.
ALFA can detect contaminations
Comparing feature distributions enables to quickly detect a contamination. For instance, ALFA was used to identify a mitochondrial ribosomal RNA (rRNA) contamination in a RNA-Seq experiment on Homo sapiens samples. Contamination by rRNA can reach a few percents to the vast majority of reads depending on the experiment. In Fig. 5, a biotype plot produced by ALFA confirms a substantial rRNA contamination in one of the tissue samples as described in .
ALFA provides a genomic scale overview
While primarily designed as a quality control tool, ALFA can also provide a universal way to compare samples at genomic scale prior to a dedicated analysis. For instance, the user may detect global differences between conditions as seen in Fig. 3. To further illustrate this point, we used ALFA to confirm an enrichment of specific types of genetic loci in a ChIP-Seq experiment of the nuclear pore protein NPP-13 from Caenorhabditis elegans performed in a previous study . ChIP-Seq, for Chromatin ImmunoPrecipitation followed by high throughput sequencing, is used to identify the binding sites of DNA-associated proteins. In Fig. 6, a biotype plot by ALFA confirmed an enrichment of nuclear pore proteins at genetic loci transcribed by RNA polymerase III, such as snoRNAs and tRNAs according to the results obtained in . ALFA additionally highlights that miRNA and ncRNA genes are also preferentially associated with this nuclear pore component. These loci might also correspond to polIII-transcribed genes , even though this would require deeper scrutiny.
ALFA allows to use custom annotations
Whereas in practice most ALFA applications should require ready-made publicly available annotation files, users may choose to define additional dedicated features of interest for specific applications. To illustrate this, we used ALFA to obtain read distributions of singular transcripts in an RNA-Seq experiment on Saccharomyces cerevisiae. In Fig. 7, we show that ALFA can be used with a customized genomic annotation file to observe the relative distribution of dedicated biotypes as for instance unstable (CUTs , NUTs , XUTS ) and stable (SUTs ) transcripts.
ALFA is fast
One of the compelling advantages of ALFA, as a single integrated tool, is that it was optimized to be fast even when run on a regular desktop computer. Furthermore, it can also take advantage of multi-core machines. ALFA needs approximately 7 min to build the Homo sapiens genome index and 6 min to compute distributions of four samples of approximately 40 M reads using four CPUs on a regular standalone computer. Note that computing the genome index needs to be performed only once and can be reused to compute additional sample distributions.
In a single and simple call, ALFA proved to be useful as a post-mapping quality control tool. It was also useful to obtain a preliminary global overview of data distribution into features and functional categories prior to dedicated analysis. In this section, we emphasize the two main limitations of ALFA.
ALFA can only detect genome scale events
ALFA is a broadly applicable and easy to use genome scale quality control tool. However, it cannot be used to perform a differential analysis at the gene level and it is not able to detect effects that would impact only a small fraction of the dataset. ALFA is meant to be a relevant post mapping quality control step but cannot replace a dedicated analysis.
ALFA requires a GTF annotation file
ALFA cannot be used with organisms that have not been annotated yet. Therefore, the use of ALFA is somewhat restricted to organisms that are commonly studied by the scientific community and for which biotype annotations exist (e.g. GTF files from Ensembl). Alternatively, the user can provide a custom-built annotation file.
In functional genomics, a vast majority of NGS experiments have the two first data analysis steps in common: a quality control of the raw sequencing reads and a mapping to the corresponding organism reference genome. We introduced ALFA, a third step that can be easily inserted into a pipeline after mapping in order to reveal genome scale possible artefacts. ALFA produces a global overview of the distribution of the mapped reads across genomic categories and biotypes at nucleotide resolution. ALFA was designed to process data from a wide variety of NGS experiments. In order to illustrate its broad usability, we applied ALFA on 6 sequencing datasets that used 5 different methods on 5 different organisms. Those examples highlighted the effectiveness of ALFA to provide a powerful post mapping quality control step on various type of NGS experiments, together with, in some cases a useful post mapping global overview. ALFA is open source and freely available as a Python script and as a Galaxy tool.
All figures can be recomputed with the data specified in the “Availability of data and material” section below and the following ALFA command lines (optionally using the desired number of CPUs with the “-p” argument):
Calviello L, Mukherjee N, Wyler E, Zauber H, Hirsekorn A, Selbach M, Landthaler M, Obermayer B, Ohler U. Detecting actively translated open reading frames in ribosome profiling data. Nat Methods. 2016;13(2):165–70.
Heinz S, Benner C, Spann N, Bertolino E, Lin YC, Laslo P, Cheng JX, Murre C, Singh H, Glass CK. Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Mol Cell. 2010;38(4):576–89.
Reid DW, Shenolikar S, Nicchitta CV. Simple and inexpensive ribosome profiling analysis of mRNA translation. Methods. 2015.
Ingolia NT, Ga B, Rouskin S, McGeachy AM, Weissman JS. The ribosome profiling strategy for monitoring translation in vivo by deep sequencing of ribosome-protected mRNA fragments. Nat Protoc. 2012;7:1534–50.
Choy JYH, Boon PLS, Bertin N, Fullwood MJ: A resource of ribosomal RNA-depleted RNA-Seq data from different normal adult and fetal human tissues. Scientific data 2015, 2:150063, [http://www.ncbi.nlm.nih.gov/pubmed/26594381, http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=PMC4640133].
Ikegami K, Lieb JD. Integral nuclear pore proteins bind to pol III-transcribed genes and are required for pol III transcript processing in C. Elegans. Mol Cell. 2013;51(6):840–9.
van Dijk EL, Chen CL, d’Aubenton Carafa Y, Gourvennec S, Kwapisz M, Roche V, Bertrand C, Silvain M, Legoix-Ne P, Loeillet S, Nicolas A, Thermes C, Morillon A. XUTs are a class of Xrn1-sensitive antisense regulatory non-coding RNA in yeast. Nature. 2011;475(7354):114–7.
We thank Zhen Wang, Jérôme Saulière and Vinko Besic for providing the Ribo-Seq and CLIP-Seq datasets. We thanks Mary Ann Letellier for help editing the manuscript and Felipe Delestro Matos for discussion and help in designing the figure plots.
This work has received support under the program “Investissements d’Avenir” implemented by ANR (ANR-10-LABX-54 MEMOLIFE and ANR-10-IDEX-0001-02 PSL University), Fondation pour la Recherche Médicale (FRM-AJE20131128944), Inserm ATIP-Avenir and Mairie de Paris (Programme Émergences – Recherche médicale). The funders were not involved in study design, and collection, analysis, and interpretation of data and in writing the manuscript.
Availability of data and materials
All data presented in the figures are available from the specified data repositories corresponding to the cited publications. A few figures contain unpublished data from the authors’ laboratories that are available on demand. All datasets are listed below.
MB, BFN and VM conceived and implemented ALFA, LB tested ALFA, HLH and AL generated datasets and suggested typical use cases. CB wrote the Galaxy wrapper. AG, BFN and MB wrote the manuscript. All authors have read, edited and approved the manuscript.
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.