ALFA: annotation landscape for aligned reads

Background The last 10 years have seen the rise of countless functional genomics studies based on Next-Generation Sequencing (NGS). In the vast majority of cases, whatever the species, whatever the experiment, the two first steps of data analysis consist of a quality control of the raw reads followed by a mapping of those reads to a reference genome/transcriptome. Subsequent steps then depend on the type of study that is being made. While some tools have been proposed for investigating data quality after the mapping step, there is no commonly adopted framework that would be easy to use and broadly applicable to any NGS data type. Results We present ALFA, a simple but universal tool that can be used after the mapping step on any kind of NGS experiment data for any organism with available genomic annotations. In a single command line, ALFA can compute and display distribution of reads by categories (exon, intron, UTR, etc.) and biotypes (protein coding, miRNA, etc.) for a given aligned dataset with nucleotide precision. We present applications of ALFA to Ribo-Seq and RNA-Seq on Homo sapiens, CLIP-Seq on Mus musculus, RNA-Seq on Saccharomyces cerevisiae, Bisulfite sequencing on Arabidopsis thaliana and ChIP-Seq on Caenorhabditis elegans. Conclusions We show that ALFA provides a powerful and broadly applicable approach for post mapping quality control and to produce a global overview using common or dedicated annotations. It is made available to the community as an easy to install command line tool and from the Galaxy Tool Shed.


Background
Software programs such as FastQC [1] have become routine to get information about raw high-throughput sequenced material. Most of the time, this first analysis step is followed by a second step where the reads are mapped to a reference genome using a sequence aligner such as STAR [2]. However, the subsequent steps are often very specific to the type of NGS experiment that is being run. For instance, in ChIP-Seq [3] or CLIP-Seq [4] experiments, peaks will need to be detected prior to further processing; in RNA-Seq, a differential analysis will often be performed on aligned reads; in BS-Seq experiments, a dedicated analysis of sequence will be applied in order to obtain a methylation ratio per nucleotide prior to further processing. To sum up, only two steps are common to a large majority of NGS data analysis pipelines: sequencing quality control and mapping. The subsequent steps are most often specific to the considered application.
Dedicated categorization of reads for specific kinds of NGS data were employed in past studies performing RNA-Seq [5], Ribosome profiling [6], ChIP-seq [7] or miR-seq [8]. However, they were not designed to be broadly applicable. That is, none of these dedicated methods would be suitable for the analysis of another type of NGS data. For instance, a peak calling was a prerequisite to ChIP-Seq data categorization in [9] which would be meaningless for an RNA-Seq experiment. Alternatively and with some coding effort, programs such as BEDTools [10], BEDOPS [11], featureCounts or Bam2x [12] can be combined to intersect positions with feature intervals and produce read feature distributions (see Table 1 for a comparison). However, combining those is not straightforward and the overall process is often largely sub-optimal [13]. To the best of our knowledge, there is no available tool that provides a broadly applicable and fast quantitative overview of several samples with nucleotide precision.
Here, we introduce ALFA (Annotation Lanscape For Aligned reads), a simple and broadly applicable tool that produces a global overview of the distribution of mapped reads, both in terms of genomic categories (stop codon, 5′-UTR, CDS, intergenic, etc.) and biotypes (protein coding, miRNA, ncRNA, etc.) with nucleotide precision. ALFA turned out to be a very useful systematic post-mapping quality control tool for a broad range of NGS applications we have been dealing with. We describe here the results obtained with ALFA using datasets produced by several types of NGS experiments on various model organisms. Even though its primary purpose was to offer a quality control step right after mapping, we show that ALFA occasionally provided a first global functional insight prior to a more dedicated analysis.

Implementation
Common genomic databases (e.g. Ensembl) mostly propose two types of functional annotations: genomic categories (5́-UTR, CDS, intron, intergenic region, etc.) and gene biotypes (protein-coding, miRNA, IG-gene, etc.). An annotation file with both the category and biotype information is required to run ALFA. In order to compute the proportions of reads from a given NGS dataset falling into these annotated regions, ALFA proceeds in three steps. In a first step, ALFA generates reusable index files from genomic annotations. In a second step, ALFA computes the actual nucleotide count per category from a given set of NGS samples using those index files. Finally, ALFA can display bar plots of the raw and normalized distributions. Importantly, the three steps described in detail below can be processed at once with a single command line call.

Creation of the index files
ALFA first processes a GTF annotation file and converts it to two index files. This step needs to be performed only once for a given annotation file. Its aim is to transform the GTF file content, which is an unordered and overlapping set of associations between intervals and category-biotype pairs, into a list of category-biotype pairs indexed by genomic coordinates. For instance, in a GTF file, a nucleotide could be associated to an exon interval but also to at least one transcript and one gene interval. This hierarchy can be deeper for the most precise features (UTR, CDS, start codon, etc.). Consequently, in a GTF file most genomic coordinates appear several times and in an unordered fashion. In order to suppress this redundancy, this indexing step splits intervals so as to remove overlaps and so that each interval matches a list of category-biotype pairs. It proceeds as follows: for each row of the GTF file, if the category-biotype pair for the considered interval does not overlap with any yet processed intervals, it is created. If a nucleotide sequence belongs to several overlapping features at the same hierarchical level, it is considered as ambiguous by default and discarded from further processing. Optionally, ALFA can equally split the considered counts between the overlapping category-biotype pairs. At the end of the process, the index is made of non overlapping intervals annotated with one or more category-biotype pairs. In this way, all category-biotypes of any given genomic coordinate can be retrieved at once. Furthermore, in order to preserve the DNA strand information, two index files are created. The first one is a stranded index, which contains the features for the sense and for the antisense strands separately. In stranded mode, a nucleotide mapping to an unannotated region on one strand facing an annotated region on the opposite strand will be counted in the category "opposite strand". The second one is an unstranded index containing all annotations from both strands. Concretely, the systematic production of those two files enables a flexible choice of strandness when processing sample files at the next step using the strand parameter.

Processing of the sample files
The processing of a set of mapped reads samples in BAM format is performed in two steps. First the coverage per genomic position is obtained for each sample using Bedtools genomecov [10] which produces a bedgraph file per sample. The bedgraph format is a non overlapping list of genomic intervals with counts. Following this, each bedgraph file is intersected with the index previously computed from the GTF file (either the stranded or the unstranded version, according to the strand parameter chosen by the user). This intersection consists in aggregating all counts falling into all intervals of each category-biotype pair. Concretely, it is obtained in the following way: each genomic interval defined in the bedgraph files is split into as many genomic intervals it covers in the index file and the list of category-biotype associations found in those intervals are incremented by the corresponding fraction of the count. As annotation files reflect our current knowledge about a given genome, they are not strictly standardized. As a consequence, elements can be unevenly described. To solve this issue, ALFA adds virtual categories called "undescribed" at the gene and the exon levels. For instance, if ALFA is supposed to display the category distribution up to the UTR regions level but a gene is only annotated up to the exon level, the reads mapping to this exon of this gene will be categorized as "undescribed exon" because there is no way to know which of the deeper categories ("UTR" or "CDS") it should account for. At the end of this process, a list of nucleotide counts per category-biotype pair is obtained for each sample and stored in one of the output files. Eventually, nucleotides that map to a region that corresponds to two or more distinct features from the same hierarchical level are considered ambiguous and are discarded by default. In this case, the percentage of ignored read counts is displayed by the program directly on the standard output.

Displaying the results
The last step consists in creating plots from the file produced at the previous step. From the counts file, containing the number of nucleotides in the sample for each category-biotype annotation pair, counts are grouped by category and by biotype to produce two images. On each image, two plots are displayed on top of each other. The top plot represents the raw distribution of reads among the selected type of features (categories or biotypes). The bottom plot shows the same counts divided by the total amount of the corresponding features in the reference genome. This second plot provides information about the sample enrichment (or depletion) of categories/biotypes compared to the genomic global distribution.

ALFA can detect inconsistencies between replicates
Comparing feature distributions right after mapping is a simple and powerful way to easily detect possible inconsistencies between replicates. To illustrate this point, we used ALFA to evaluate the consistency of three replicates in a CLIP-Seq experiment on Mus musculus samples. CLIP-Seq, for Cross-Linking ImmunoPrecipitation followed by high throughput sequencing, is used to identify the binding sites of RNA associated proteins. In Fig. 1, the category plots produced by ALFA distinctly highlight that one of the replicates behaves differently from the others, with an . Datasets were gathered from two studies performed on the same model but in two different laboratories: Lab1-Rep1 (SRR342381) and Lab1-Rep2 (SRR342391) from [14] and Lab2-Rep1 (ERR046552) and Lab2-Rep2 (ERR046553) from [15]. ALFA highlights laboratory dependent differences between reads falling in CDS (t-test significant at a 5% level with a p-value of 4 × 10 -2 ). enrichment of reads mapping within intergenic regions. In this case, ALFA allowed to quickly point out, after the mapping step, that a replicate should be further investigated, treated with special care or even discarded.

ALFA can detect inconsistencies between experimental studies
Feature distributions can also be used to compare replicates of a similar experiment performed in different places or experimental conditions. To illustrate this point, we used ALFA to compare two BS-Seq experiments performed on Arabidopsis thaliana by two different laboratories ( [14,15]). BS-Seq for Bisulfite treatment followed by high throughput sequencing is used to determine the pattern of methylation on DNA. In Fig. 2, a category plot produced by ALFA permits the detection of discrepancies in the relative amount of reads mapping within coding sequences (CDS) and intergenic regions; indeed, the proportion of reads mapping to intergenic regions was higher in the data from [14] compared to [15]. Moreover, it also offered a quick overview of the reproducibility between replicates for each of these studies with [15] showing less divergence.

ALFA can be used to validate a protocol
Feature distributions can also be used to check if a technique is working as expected. To illustrate this point we used ALFA to validate the difference between two Ribosome Profiling protocols on Mus musculus samples. The Ribosome Profiling technique, which consists of the deep sequencing of ribosomeprotected mRNA fragments, produces a global snapshot of the translatome. Alongside two replicates using the standard protocol, we produced two extra replicates with an additional harringtonine treatment.
Harringtonine is a drug that inhibits the elongation phase of translation, after initiation. In Fig. 3, the category plot produced by ALFA shows that Harringtonine decreased the CDS fraction of reads while increasing the 5'UTR and start codon fractions of reads, thus expectedly shifting the distribution of mapped reads towards the start sites.

ALFA can help choosing between available methods
Comparing feature distributions can be helpful to choose a method that is adapted to the user's experimental purpose. For instance, two Ribosome Profiling protocols have been proposed, where the footprints are generated either by digestion with MNase or with RNase I. In Fig. 4, a category plot produced with ALFA on samples treated as in [16] or as in [17] shows a substantial contamination of the footprints obtained by MNase digestion with non-protein-coding sequences, with an important fraction of reads annotated as intergenic and 3'UTR. In our hands, the MNase procedure was thus less specific of ribosome-protected RNA sequences than the RNAse I-based protocol.

ALFA can detect contaminations
Comparing feature distributions enables to quickly detect a contamination. For instance, ALFA was used to identify a mitochondrial ribosomal RNA (rRNA) contamination in a RNA-Seq experiment on Homo sapiens samples. Contamination by rRNA can reach a few percents to the vast majority of reads depending on the experiment. In Fig. 5, a biotype plot produced by ALFA confirms a substantial Fig. 4 ALFA category plots (raw and normalized) for Ribosome Profiling (Ribo-Seq) data on Homo sapiens (unpublished data from AL available on demand) performed with two different procedures for footprinting: treatment with MNase as in [16] or treatment with RNase I as in [17]. As a preprocessing step, rRNA and mtRNA reads were computationally filtered out. The enrichment of signal in intergenic and 3′-UTR regions shows that treatment with MNase seems to produce a substantial increase of the non-protein coding reads compared to RNase I.
rRNA contamination in one of the tissue samples as described in [18].

ALFA provides a genomic scale overview
While primarily designed as a quality control tool, ALFA can also provide a universal way to compare samples at genomic scale prior to a dedicated analysis. For instance, the user may detect global differences between conditions as seen in Fig. 3. To further illustrate this point, we used ALFA to confirm an enrichment of specific types of genetic loci in a ChIP-Seq experiment of the nuclear pore protein NPP-13 from Caenorhabditis elegans performed in a previous study [19]. ChIP-Seq, for Chromatin ImmunoPrecipitation followed by high throughput sequencing, is used to identify the binding sites of DNA-associated proteins. In Fig. 6, a biotype plot by ALFA confirmed an enrichment of nuclear pore proteins at genetic loci transcribed by RNA polymerase III, such as snoRNAs and tRNAs according to the results obtained in [19]. ALFA additionally highlights that miRNA and ncRNA genes are also preferentially associated with this nuclear pore component. These loci might also correspond to polIII-transcribed genes [20], even though this would require deeper scrutiny.

ALFA allows to use custom annotations
Whereas in practice most ALFA applications should require ready-made publicly available annotation files, users may choose to define additional dedicated features of interest for specific applications. To illustrate this, we used ALFA to obtain read distributions of singular transcripts in an RNA-Seq experiment on Saccharomyces cerevisiae. In Fig. 7, we show that ALFA can be used with a customized genomic annotation file to observe the relative distribution of dedicated biotypes as for instance unstable (CUTs [21], NUTs [22], XUTS [23]) and stable (SUTs [24]) transcripts.

ALFA is fast
One of the compelling advantages of ALFA, as a single integrated tool, is that it was optimized to be fast even when run on a regular desktop computer. Furthermore, it can also take advantage of multi-core machines. ALFA needs approximately 7 min to build the Homo sapiens genome index and 6 min to compute distributions of four samples of approximately 40 M reads using four CPUs on a regular standalone computer. Note that computing the genome index needs to be performed only once and can be reused to compute additional sample distributions. . This dataset is part of a research work where ribosomal RNA depletion is compared between adult and fetal tissues [18]. This study reported that a large portion of transcripts with mitochondrial ribosomal origin was observed, in particular in colon, heart and kidney samples. For clarity, only lung (SRR2014234 and SRR2014235) and heart (SRR2014232 and SRR2014233) replicates from the study are reported here. ALFA enables, with a single command, to quickly confirm that mitochondrial rRNA contamination is more important in the heart samples than in the lung samples (ttest significant at a 1% level with a p-value of 3 × 10 -3 ). Moreover, an intergenic contamination, not revealed in the original work, can also be noticed on Lung Adult-Rep2 with no additional work.

Discussion
In a single and simple call, ALFA proved to be useful as a post-mapping quality control tool. It was also useful to obtain a preliminary global overview of data distribution into features and functional categories prior to dedicated analysis. In this section, we emphasize the two main limitations of ALFA.

ALFA can only detect genome scale events
ALFA is a broadly applicable and easy to use genome scale quality control tool. However, it cannot be used to perform a differential analysis at the gene level and it is not able to detect effects that would impact only a small fraction of the dataset. ALFA is meant to be a relevant post mapping quality control step but cannot replace a dedicated analysis.

ALFA requires a GTF annotation file
ALFA cannot be used with organisms that have not been annotated yet. Therefore, the use of ALFA is somewhat restricted to organisms that are commonly studied by the scientific community and for which biotype annotations exist (e.g. GTF files from Ensembl). Alternatively, the user can provide a custom-built annotation file.

Conclusions
In functional genomics, a vast majority of NGS experiments have the two first data analysis steps in common: a quality control of the raw sequencing reads and a mapping to the corresponding organism reference genome. We introduced ALFA, a third step that can be easily inserted into a pipeline after mapping in order to reveal genome scale possible artefacts. ALFA produces a global overview of the distribution of the mapped reads across genomic categories and biotypes at nucleotide resolution. ALFA was designed to process data from a wide variety of NGS experiments. In order to illustrate its broad usability, we applied ALFA on 6 sequencing datasets that used 5 different methods on 5 different organisms. Those examples highlighted the effectiveness of ALFA to provide a powerful post mapping quality control step on various . This dataset originates from a study [19] where snoRNA and tRNA genetic loci were found to be enriched in the IP (SRR628901) compared to the Inputs (SRR628899 and SRR628900  ALFA biotype plots (raw and normalized) for RNA sequencing (RNA-Seq) data from a Saccharomyces cerevisiae sample (public data available on NCBI: SRA030505 -SRR927165). In this example, a customized GTF annotation file was created to highlight the flexibility of ALFA. Dedicated biotypes characterizing various Saccharomyces cerevisiae stable transcripts (SUTs [24]) and unstable transcripts (CUTs [21], NUTs [22], XUTS [23]) were converted from a BED file. and sets up the script ALFA.py along with all required dependencies. ○ Conda: ALFA can be installed using the command "conda install -c biocomp alfa" which also provides alfa as a standalone executable directly available from the command prompt ○ Galaxy: ALFA can also be installed on Galaxy easily by just searching for "ALFA" in the main Galaxy Tool Shed (https://toolshed.g2.bx.psu.edu/).

Reproducibility
All figures can be recomputed with the data specified in the "Availability of data and material" section below and the following ALFA command lines (optionally using the desired number of CPUs with the "-p" argument):  Author's contributions MB, BFN and VM conceived and implemented ALFA, LB tested ALFA, HLH and AL generated datasets and suggested typical use cases. CB wrote the Galaxy wrapper. AG, BFN and MB wrote the manuscript. All authors have read, edited and approved the manuscript.
Ethics approval and consent to participate N/A

Competing interests
The authors declare that they have no competing interests.

Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Author details 1