Deploying the Atlas2 personal genomic analysis pipeline via the Genboree Workbench (Atlas2 Genboree)
The Atlas2 Suite is a variant detection software package optimized for variant discovery in exome capture data on all the three next generation sequencing platforms [10] (Roche 454, Illumina and SOLiD). The suite consists of Atlas-SNP2 for calling Single Nucleotide Polymorphisms (SNPs) and Atlas-Indel2 for calling short INsertions and DELetions (INDELs) http://www.hgsc.bcm.tmc.edu/cascade-tech-software-ti.hgsc. These tools have been available for command line usage, and applied to a number of large scale projects including the International 1000 Genomes Project [11], The Cancer Genome Atlas Project (TCGA), and follow-up resequencing in the context of disease genome wide association studies.
Genboree Workbench is a platform for deploying genomic tools as a service and is deployed at Baylor College of Medicine http://www.genboree.org. The Genboree Workbench Graphical User Interfaces (GUI) extensively relies on Ext-JS, a JavaScript library. Tools within the workbench make API (Application Programming Interface) calls to the REST (REpresentational State Transfer) API which is hosted on a thin server. This is done asynchronously using Asynchronous JavaScript and XML (AJAX). Since Genboree System uses REST style of architecture to communicate between the server and the client, it allowed us to easily integrate Atlas2 within a couple of weeks. Genboree is backed by a small cluster of nodes which are managed by the TORQUE resource manager (an open source tool) and Maui (developed by Adaptive Computing) to schedule jobs. Atlas2 Genboree can be accessed as a Genboree Workbench Toolset. Users from external groups with access to a web browser can 1) upload data onto the cloud, 2) run Atlas2 for variant analysis, and 3) visualize the variant calling results using different genome browsers such as the Genboree Browser or University of California, Santa Cruz Genome Browser[12] (http://www.genome.ucsc.edu) (Figure 1A). Atlas2 Genboree has a web-interface with hierarchical click-through steps. The self-explanatory nature of the web-interface eases the usage overhead. The workflow illustrated in Figure 1B shows the specific steps in running the Atlas2 Suite on the Genboree System.
Genboree Data Selector
The Genboree Workbench organizes data in a hierarchal tree. Before using the Atlas2 Suite users must define a group and create a database. Within the database are the "Files" and "Tracks" subdirectories. Files contain input files uploaded by the user and output files generated by Atlas2. Tracks contain processed output files which can be used for visualization on the Genboree browser. This hierarchical representation is shown in a screenshot of the Genboree workbench in Figure 2A.
Uploading data onto Atlas2 Genboree
Atlas2 Genboree accepts Binary sequence Alignment/Mapping format (BAM) files as input. Files are uploaded onto Genboree by dragging the destination database from the Data Selector to the Output Targets box and selecting "transfer files" under the data tab in the menu. A prompt window allows users to select an input BAM file from their local computer and upload it to the cloud servers. A 24 GB BAM file took approximately one hour to upload on a 50 Mb/sec bandwidth connection.
Variant calling
The Atlas2 Suite may be run by simply assigning the desired input and output and selecting the appropriate tool (Figure 2A). Atlas2 Genboree allows users to specify parameter cutoffs in the job parameter-setting window (Figure 2B). Here one can choose from the three different sequencing platforms and tune the parameters.
The tool produces two output files, an LFF file and a Variant Call Format (VCF) [13] file which are stored under the files section inside of the database specified in the output target box. The LFF format is adapted from the LDAS upload format used to store variants and annotations http://www.genboree.org/java-bin/showHelp.jsp?topic=lffFileFormat. Both the files can be downloaded by selecting the specific file and clicking on the download file option.
Genboree system allows integration with third party tools
Cloud deployment may produce "silos" of integration where extension of analysis pipelines and addition of analysis steps beyond those offered as a service may be hard to accomplish. To overcome this problem, Genboree system provides application programming interfaces for programmatic access to all the data and tools. Also data is accessible in formats that can be readily fed into a variety of ancillary tools. The interfaces and data format compatibilities enable mixing-and-matching of tools required in specific steps such as visualization in various genome browsers including UCSC genome browser, invocation of pipelines such as Galaxy [14], and integration with custom or third-party variant analysis and annotation tools such as ANNOVAR[15]. As described next, we successfully tested all three types of integration.
Visualizing variants with genome browsers
Genboree browser
The variant calls can be readily viewed in the Genboree genome browser. After going into the browser, variants can be visualized by selecting the appropriate database. Genboree browser supports looking at variants from multiple samples simultaneously.
UCSC genome browser
The variants called by Atlas2 Genboree can be directly exported to UCSC genome browser [16] for further viewing, annotation and analysis. The variants can be exported by converting our variants file into a BigBed format file (http://genome.ucsc.edu/goldenPath/help/bigBed.html) via the cloud file conversion functionality.
Integration with Galaxy
As our initial trial, we were able to upload our raw VCF file downloaded from Genboree without post-processing onto Galaxy and convert the VCF file into a multiple alignment format (MAF) custom track using the VCF to MAF custom track function with Graph/Display data.
Post-processing with third party variant annotation tools
The VCF file downloaded from Genboree was annotated and filtered using ANNOVAR. ANNOVAR categorizes variants into intronic, exonic, splicing, non-coding RNA, 5' untranslated region, 3' untranslated region, upstream, downstream and intergenic. The exonic variants are further categorized into synonymous, nonsynonymous, stop gain (gain of stop function), stop lost (loss of stop function), and frameshift or non-frameshift changes caused by insertions, deletions or block substitutions. ANNOVAR can also be used to filter out variants found in dbSNP.
Enabling the Atlas2 personal genomic analysis pipeline via Amazon Web Services (Atlas2 Amazon)
The Amazon Web Services (AWS) provides virtualized computational infrastructure on demand. AWS can be tailored to provide scalable and flexible solutions for application hosting, web applications and high performance computing. We used the Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3) to enable Atlas2 on Amazon. The EC2 allows users to lease a wide variety of EC2 instances which differ based on the amount of compute nodes and memory (http://aws.amazon.com/ec2/#instance). Amazon S3 is a persistent data storage solution offered by AWS, which is meant to be highly scalable and have low latency. Both the EC2 and S3 services can be managed from the AWS Management Console.
Our Atlas2 cloud pipeline on AWS was designed ground up to be specifically used for personal genome analysis. The web user interface, developed using the Spring Framework written in Java, provides access to Atlas2 suite on the machine image; this user friendly interface can be used to submit jobs and monitor worker nodes (Figure 3). The application runs on Apache Tomcat (version 5.5.35) and can be accessed through port 8080. The backend code, on the Atlas2 machine image, is optimized to efficiently analyze data and ease the process of adding newer tools to the pipeline in the context of genome analysis. The backend code was written in Python (version 2.7.2). Fabric (version 1.4.1), Amazon EC2 API Tools (version 1.4.3 2011-05-15) and s3cmd (version 1.0.0) are integral part of the backend code. Fabric was used for executing commands on the worker nodes, Amazon EC2 API Tools were used to start, terminate and monitor the status of worker instances and s3cmd was used to interact with the S3. Figure 3 provides an overview of the Atlas2 Amazon pipeline.
In order to access the Atlas2 cloud pipeline the user has to first register for an AWS account by going to http://aws.amazon.com/. Once registered the user needs to sign up for EC2 and S3 services. The user then starts an instance using the public Atlas2 machine image (ami-2ee23847) which can be found inside the community Amazon Machine Image (AMI) tab. Before starting the instance the user needs to change his Security Groups setting so as to enable http access on port 8080 to be able to access the webpage which can be done using the public DNS of the instance which can be found in AWS management console. Since the master instance is only acting as a portal to access and monitor the jobs running on AWS this master instance can be a "t1.micro" instance. The advantage of a "t1.micro" instance is they are cheap and at the time of writing this article every new registered AWS user would get 750 hours of free runtime every month for a year; afterwards it is $0.02/hour.
Once able to access the webpage the user must create an account before they can access the pipeline. By way of creating an account, this instance can support multiple users and users do not have to type in AWS credentials each time they submit a job. The AWS credentials are needed to start additional instances and to access user data on S3. To submit a job users must provide the name of the folder on S3 containing the input files and reference FASTA needed for analysis, maximum number of parallel EC2 instances to run, upload a file with the list of input files to be processed, reference file name, sequencing platform and analysis to be performed. Figure 4A shows a screenshot of the job submissions page. Currently Atlas2 Amazon expects the user to upload the data onto S3; this can be done by going to the S3 tab on AWS management console. Alternatively, users may take advantage of the AWS Import/Export option wherein the user can ship a portable storage device to Amazon and it will securely process and transfer the data onto S3. This option is extremely useful in uploading large amounts of data to S3 due to network bandwidth bottleneck.
The monitoring page shows information regarding each worker node, i.e. each row in the monitoring page represents a worker instance. For each worker instance the following information is shown; instance id, public DNS, node status, job status, start time, end time, input bucket and output bucket. The instance id and the public DNS can be used to access the worker node. Node status can have two possible states either the worker node is running or has been terminated. Job status can have five possible states, started refers to worker node has been instantiated, setup refers to head node is prepping the worker instance, downloading reference suggests that the worker instance is downloading the reference FASTA file, variant calling suggests worker node is running the analysis and termination refers to there are no more jobs in the queue and worker node is being terminated. Figure 4B shows a screenshot of the monitoring page.