NG6: Integrated next generation sequencing storage and processing environment
© Mariette et al.; licensee BioMed Central Ltd. 2012
Received: 25 June 2012
Accepted: 30 August 2012
Published: 9 September 2012
Next generation sequencing platforms are now well implanted in sequencing centres and some laboratories. Upcoming smaller scale machines such as the 454 junior from Roche or the MiSeq from Illumina will increase the number of laboratories hosting a sequencer. In such a context, it is important to provide these teams with an easily manageable environment to store and process the produced reads.
We describe a user-friendly information system able to manage large sets of sequencing data. It includes, on one hand, a workflow environment already containing pipelines adapted to different input formats (sff, fasta, fastq and qseq), different sequencers (Roche 454, Illumina HiSeq) and various analyses (quality control, assembly, alignment, diversity studies,…) and, on the other hand, a secured web site giving access to the results. The connected user will be able to download raw and processed data and browse through the analysis result statistics. The provided workflows can easily be modified or extended and new ones can be added. Ergatis is used as a workflow building, running and monitoring system. The analyses can be run locally or in a cluster environment using Sun Grid Engine.
NG6 is a complete information system designed to answer the needs of a sequencing platform. It provides a user-friendly interface to process, store and download high-throughput sequencing data.
Sequencer manufacturers follow different objectives using different platforms . In the first place they release upgrades of second generation platforms producing more data with updated hardware and sequencing kits. This lowers the sequencing cost per base pair but often focuses these machines on medium or large projects. In the second place, they introduce new laboratory scale platforms such as the Illumina MiSeq or the Roche Junior which target smaller projects. And last, they work on the third generation machines which will not depend on amplified material and therefore get rid of some biases. The first two machines types which are already marketed today associated with a larger scope of sequencing protocols, enabling new studies, push towards more sequencing projects and more users.
Once the sequencing is done, the largest part of the work and the longest time period of the project are dedicated to data analysis. Therefore it is important to provide the new smaller production units and the laboratories in which the projects are conducted with efficient and user-friendly processing environments, enabling quality control and routine analysis. These pieces of software should have several features such as access control, metadata storage on the produced reads, quality control including known bias verification and standard analysis. NG6 was developed to match these goals and to be as flexible as possible, in order to follow sequencing technologies upgrades.
Laboratory information management systems (LIMS) are often focused on the traceability of the biological material. Some of them, such as PIMS  or even SLIMS , have included extensions to monitor the sequencing process. However few of the open-source LIMS also provide the data processing environment. This feature is present in the galaxy  sample tracking module. It is based on the galaxy workflow engine and provides users with an interface to create and track sequencing requests. Once the sequences have been produced, the user can transfer its data files, build and run workflows to process them.
NG6 is an extensible sequencing provider oriented LIMS. It includes read quality control and first level analysis processes which ease the data validation made jointly by the sequencing facility staff and the end-users. It provides a secured user-friendly interface to visualize and download the raw sequences files and the analysis results.
NG6 uses three data types: project, run and analysis. A project is a collection of runs and analysis. A run contains one or several raw files which can be used as inputs of different analysis. A project is owned by a user group and only users within this group are allowed to browse and download data related to this project.
Building and running pipelines
Pipelines are defined by a set of connected ergatis components. Depending on the links between the components, they are processed in a parallel or a serial manner. Most components available in NG6 combine a processing step and a storage step. This last one stores, on one hand, resulting files into the ad-hoc directory structure and, on the other hand, saves information into the database such as software version, parameters, links between analysis and resulting figures.
In the current version, NG6 offers a set of pipelines adapted to two platforms (Roche 454, Illumina HiSeq), four file formats (sff, fastq, fasta and qseq) and handles both casava 1.7 and casava 1.8 outputs of the illumina package . It includes analyses such as quality control, genomic read alignment, BAC assembly, 16S/18S diversity analysis, expression quantification using 16S amplicons. In order to handle multiplexed runs, some pipelines first split the input read file into sample files, process and collect results on each of them and last merge these results in a summary table.
As an example, the 454_default pipeline processes sff files, coming from the Roche sequencer. It first performs usual statistical analysis on the reads, then tracks down contamination from common contaminant databases (ecoli, yeast and phage) using blast  returning a list of contaminated sequence IDs. Contamination between the different regions is also traced using the sfffile script included in the Roche Newbler package . Sequences with incorrect MID (Multiplexed ID) are discarded and the number of contaminated sequences is returned to the end-user. Roche 454 sequencing kits include control fragments known as spike-ins within each run. Statistics on the corresponding sequences are used to check if the run matches the expected quality standard. In the next step reads are cleaned using the pyrocleaner script . It discards reads considering different criteria such as length, base quality, complexity, number of undetermined bases, multiple copy reads or even faulty paired-ends. The analysis results are presented to the users in a summary table. Last, a de novo assembly is performed on the cleaned reads using the Newbler runAssembly command . Some basic figures regarding the assembly results, such as contig count, N50 value, contig length distribution or even contig length versus sum of read length per contig diagram are presented to the user in order to ease the assembly quality assessment.
When the pipeline execution is over, all analysis and runs newly added to the system are flagged as hidden. This was meant to permit the validation of the run by the team in charge of the sequencer before data release to the end-user.
The analyses provided in NG6 have been designed to limit the used disk space and the number of temporary files. As an example, the bwa alignment against a reference genome, performed on illumina reads, chains bwa and samtools using the unix pipe command.
A cluster environment has often a local optimized file system. NG6 moves files from the cluster file system to the storage file system using the ng6synchronization component. Until synchronization is completed, a warning message is displayed to inform the end-user.
Browsing and downloading results
As a typo3 plug-in, NG6 can easily be included in any web site built with this CMS. The NG6 plug-in is compliant with the national language support system of typo3. Configuring the system for a new language only consists in translating and adding the corresponding language files. So far, only English and French are supported.
Right accesses and administration
Users and data right management
Data right level
A published project is openly accessible on the web site. For example, you can access our demonstration project using the following link : http://ng6.toulouse.inra.fr/index.php?id=3. This feature provides the biologists with a fast and easy way to make their data accessible to their community.
Adding new analysis
Results and discussion
NG6 has been in production since September 2009 at the genomic platform of GenoToul  and stores more than 950 runs corresponding to 96 projects and using 5 TB on the hard drive. The system stores Illumina and Roche 454 runs produced by different sequencer versions. Pipelines are configured and launched by the genomic platform staff for one year.
Assessing the quality of the produced reads is an important task for a sequencing center. Making it automatic saves a lot of time. Displaying the analysis results within a user-friendly interface eases the discussions with the end-users.
Other read analysis environments are available to biologists. The most popular today is Galaxy. We have chosen to implement our own system because Galaxy and NG6 target different aims and focus on different users. Galaxy aims at simplifying data processing for researchers. It includes modules processing sequencing data. NG6 is a sequencing provider focused LIMS gathering specialized pipelines and website.
NG6 is an information system providing a set of automated analysis pipelines built to process NGS (Next Generation Sequencing) data which can be executed locally or in a cluster environment. It is built upon well documented and extensively used components such as ergatis and typo3. The current version of NG6 offers several pipelines but some others are under-construction: RNAseq using tophat  and cufflinks  and miRNA expression analysis.
Availability and requirements
The NG6 code is freely available on the web. To ease the installation, the package and all its dependencies are also available as a virtual machine. Installing and maintaining the system would require expertise in Linux system administration. The project is hosted in a forge environment in order to open it to the developers community.
· Project name: ng6
· Operating system(s): Platform independent
· Programming language: Python/PHP
· Other requirements: VMWare or VirualBox
· License: GNU GPL
· Any restrictions to use by non-academics: none
We would like to acknowledge the GenoToul genomic platform and the CBiB platform of Bordeaux for providing us useful feedback on the system and for pointing out us features worth developing. We thank the reviewers for their insightful and constructive comments.
- Glenn TC: Field guide to next-generation DNA sequencers. Mol Ecol Resour. 2011, 11: 759-769. 10.1111/j.1755-0998.2011.03024.x. 10.1111/j.1755-0998.2011.03024View ArticlePubMed
- Troshin PV, Vincent LG P, Denise A, Baldwin SA, McPherson MJ, Barton GJ: PIMS sequencing extension: a laboratory information management system for DNA sequencing facilities. BMC Research Notes. 2011, 4: 48-10.1186/1756-0500-4-48. 10.1186/1756-0500-4-48PubMed CentralView ArticlePubMed
- Van Rossum T, Tripp B, Daley D: SLIMS—a user-friendly sample operations and inventory management system for genotyping labs. Bioinformatics. 2010, 26 (14): 1808-1810. 10.1093/bioinformatics/btq271.PubMed CentralView ArticlePubMed
- Giarine B, Riemer C, Hardison RC, Burhans R, Elnitski L, Shah P, Shang Y, Blankenberg D, Albert I, Taylor J, Miller W, Kent WJ, Nekrutenko A: Galaxy: A platform for interactive large-scale genome analysis. Genome Res. 2005, 15: 1451-1455. 10.1101/gr.4086505.View Article
- Orvis J, et al: Ergatis: a web interface and scalable software system for bioinformatics workflows. Bioinformatics. 2010, 10.1093/bioinformatics/btq167
- ,: Typo3 web site.http://typo3.org/,
- Illumina web site:http://www.illumina.com/,
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990, 215 (3): 403-410.View ArticlePubMed
- Roche 454 web site:http://www.my454.com/,
- Mariette J, Noirot C, Klopp C: Assessment of replicate bias in 454 pyrosequencing and a multi-purpose read-filtering tool. BMC Research Notes. 2011, 4: 149-10.1186/1756-0500-4-149.View Article
- Cutadapt web site:http://code.google.com/p/cutadapt/,
- Li H, Durbin R: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009, 25: 1754-1760. 10.1093/bioinformatics/btp324.PubMed CentralView ArticlePubMed
- Li H, Durbin R: Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics. 2010, 26 (5): 589-595. 10.1093/bioinformatics/btp698. [PMID: 20080505]PubMed CentralView ArticlePubMed
- Fastqc web site:http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/,
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R: 1000 Genome Project Data Processing Subgroup: The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics. 2009, 25: 2078-2079. 10.1093/bioinformatics/btp352. [PMID: 19505943]PubMed CentralView ArticlePubMed
- Schloss PD, et al, et al: Introducing mothur: Open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol. 2009, 75 (23): 7537-7541. 10.1128/AEM.01541-09.PubMed CentralView ArticlePubMed
- Sff_extract web site:http://bioinf.comav.upv.es/sff_extract/,
- Smarty template engine web site:http://www.smarty.net/,
- Jquery web site:http://jquery.com/,
- GenoToul web site:http://get.genotoul.fr/,
- Trapnell C, Pachter L, Salzberg SL: TopHat: discovering splice junctions with RNA-Seq. Bioinformatics. 2009, 1;25 (9): 1105-1111.View Article
- Roberts A, Pimentel H, Trapnell C, Pachter L: Identification of novel transcripts in annotated genomes using RNA-Seq. Bioinformatics. 2011, 10.1093/bioinformatics/btr355
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.