Volume 13 Supplement 6
Atlas2 Cloud: a framework for personal genome analysis in the cloud
© Evani et al.; licensee BioMed Central Ltd. 2012
Published: 26 October 2012
Until recently, sequencing has primarily been carried out in large genome centers which have invested heavily in developing the computational infrastructure that enables genomic sequence analysis. The recent advancements in next generation sequencing (NGS) have led to a wide dissemination of sequencing technologies and data, to highly diverse research groups. It is expected that clinical sequencing will become part of diagnostic routines shortly. However, limited accessibility to computational infrastructure and high quality bioinformatic tools, and the demand for personnel skilled in data analysis and interpretation remains a serious bottleneck. To this end, the cloud computing and Software-as-a-Service (SaaS) technologies can help address these issues.
We successfully enabled the Atlas2 Cloud pipeline for personal genome analysis on two different cloud service platforms: a community cloud via the Genboree Workbench, and a commercial cloud via the Amazon Web Services using Software-as-a-Service model. We report a case study of personal genome analysis using our Atlas2 Genboree pipeline. We also outline a detailed cost structure for running Atlas2 Amazon on whole exome capture data, providing cost projections in terms of storage, compute and I/O when running Atlas2 Amazon on a large data set.
We find that providing a web interface and an optimized pipeline clearly facilitates usage of cloud computing for personal genome analysis, but for it to be routinely used for large scale projects there needs to be a paradigm shift in the way we develop tools, in standard operating procedures, and in funding mechanisms.
The revolutionary development of massively parallel DNA sequencing has enabled identification of biomedically relevant genomic variants via whole genome  and exome resequencing . Information relevant for personalized medicine such as assessment of longitudinal disease risks, and personalized treatment  are now within reach.
In a few very recent personal genomic studies, results have directly led to targeted treatment and dramatic improvement in the patient's quality of life . These examples are paving the way to soon turn genomic sequencing into a routine diagnostic procedure and enable personalized medicine.
Currently, analysis of sequencing data on a genomic scale requires bioinformatic expertise and access to extensive computational resources, presenting a significant barrier. Most cutting-edge genome analysis applications [5, 6] are still limited to a command line interface and require at least moderate informatics expertise to operate. In addition, large scale genomic data analysis requires routine access to a high performance compute cluster. Such requirements are entirely unsuitable for the operational models of smaller research/diagnostic laboratories due to the excessive investment requirements in computing infrastructure and personnel.
The deployment of genomic analysis Software as a Service (SaaS) within a cloud computing framework offers a unique solution for these problems. The concept behind cloud computing is to outsource computation to third-party servers or clusters at a remote location. This allows small laboratories to take advantage of external computational resources without having to maintain an in-house compute cluster. This software as a service model removes the upfront investment requirement and any delays associated with building local computing infrastructure. Earlier solutions such as CloudBurst  and Crossbow  have attempted to tackle the very specific problem of mapping short read data and assembling large genomes using the scalability offered by the map-reduce framework deployed on top of a compute cluster. While this is useful the users would still need to have considerable bioinformatics skill and acquaintance with cluster infrastructure to undertake such an analysis. Other solutions such as CloudMan  from the Galaxy Project provide a user interface and remove the need for user to have informatics experience but are not specifically designed for personal genome analysis.
To this end we integrated our variant analysis pipeline - Atlas2 Suite - onto a "local cloud" using the Genboree Workbench http://www.genboree.org and onto a "commercial cloud" via the Amazon Web Services http://aws.amazon.com. We performed a case study using the Atlas2 Genboree pipeline as a proof of concept to demonstrate the potential of personal genome analysis on the cloud. We also processed two whole exome capture samples using our Atlas2 Amazon pipeline to outline the cost of running analysis on Amazon. Our cloud analysis pipeline on Genboree has a web browser-based drag and drop interface, allowing users to interact with the software through their browser at any location, and making it practical for the software to be used by non-bioinformaticians. Our cloud pipeline is actively maintained by our team, which also removes the need for users to update the software.
Deploying the Atlas2 personal genomic analysis pipeline via the Genboree Workbench (Atlas2 Genboree)
The Atlas2 Suite is a variant detection software package optimized for variant discovery in exome capture data on all the three next generation sequencing platforms  (Roche 454, Illumina and SOLiD). The suite consists of Atlas-SNP2 for calling Single Nucleotide Polymorphisms (SNPs) and Atlas-Indel2 for calling short INsertions and DELetions (INDELs) http://www.hgsc.bcm.tmc.edu/cascade-tech-software-ti.hgsc. These tools have been available for command line usage, and applied to a number of large scale projects including the International 1000 Genomes Project , The Cancer Genome Atlas Project (TCGA), and follow-up resequencing in the context of disease genome wide association studies.
Genboree Data Selector
Uploading data onto Atlas2 Genboree
Atlas2 Genboree accepts Binary sequence Alignment/Mapping format (BAM) files as input. Files are uploaded onto Genboree by dragging the destination database from the Data Selector to the Output Targets box and selecting "transfer files" under the data tab in the menu. A prompt window allows users to select an input BAM file from their local computer and upload it to the cloud servers. A 24 GB BAM file took approximately one hour to upload on a 50 Mb/sec bandwidth connection.
The Atlas2 Suite may be run by simply assigning the desired input and output and selecting the appropriate tool (Figure 2A). Atlas2 Genboree allows users to specify parameter cutoffs in the job parameter-setting window (Figure 2B). Here one can choose from the three different sequencing platforms and tune the parameters.
The tool produces two output files, an LFF file and a Variant Call Format (VCF)  file which are stored under the files section inside of the database specified in the output target box. The LFF format is adapted from the LDAS upload format used to store variants and annotations http://www.genboree.org/java-bin/showHelp.jsp?topic=lffFileFormat. Both the files can be downloaded by selecting the specific file and clicking on the download file option.
Genboree system allows integration with third party tools
Cloud deployment may produce "silos" of integration where extension of analysis pipelines and addition of analysis steps beyond those offered as a service may be hard to accomplish. To overcome this problem, Genboree system provides application programming interfaces for programmatic access to all the data and tools. Also data is accessible in formats that can be readily fed into a variety of ancillary tools. The interfaces and data format compatibilities enable mixing-and-matching of tools required in specific steps such as visualization in various genome browsers including UCSC genome browser, invocation of pipelines such as Galaxy , and integration with custom or third-party variant analysis and annotation tools such as ANNOVAR. As described next, we successfully tested all three types of integration.
Visualizing variants with genome browsers
The variant calls can be readily viewed in the Genboree genome browser. After going into the browser, variants can be visualized by selecting the appropriate database. Genboree browser supports looking at variants from multiple samples simultaneously.
UCSC genome browser
The variants called by Atlas2 Genboree can be directly exported to UCSC genome browser  for further viewing, annotation and analysis. The variants can be exported by converting our variants file into a BigBed format file (http://genome.ucsc.edu/goldenPath/help/bigBed.html) via the cloud file conversion functionality.
Integration with Galaxy
As our initial trial, we were able to upload our raw VCF file downloaded from Genboree without post-processing onto Galaxy and convert the VCF file into a multiple alignment format (MAF) custom track using the VCF to MAF custom track function with Graph/Display data.
Post-processing with third party variant annotation tools
The VCF file downloaded from Genboree was annotated and filtered using ANNOVAR. ANNOVAR categorizes variants into intronic, exonic, splicing, non-coding RNA, 5' untranslated region, 3' untranslated region, upstream, downstream and intergenic. The exonic variants are further categorized into synonymous, nonsynonymous, stop gain (gain of stop function), stop lost (loss of stop function), and frameshift or non-frameshift changes caused by insertions, deletions or block substitutions. ANNOVAR can also be used to filter out variants found in dbSNP.
Enabling the Atlas2 personal genomic analysis pipeline via Amazon Web Services (Atlas2 Amazon)
The Amazon Web Services (AWS) provides virtualized computational infrastructure on demand. AWS can be tailored to provide scalable and flexible solutions for application hosting, web applications and high performance computing. We used the Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3) to enable Atlas2 on Amazon. The EC2 allows users to lease a wide variety of EC2 instances which differ based on the amount of compute nodes and memory (http://aws.amazon.com/ec2/#instance). Amazon S3 is a persistent data storage solution offered by AWS, which is meant to be highly scalable and have low latency. Both the EC2 and S3 services can be managed from the AWS Management Console.
In order to access the Atlas2 cloud pipeline the user has to first register for an AWS account by going to http://aws.amazon.com/. Once registered the user needs to sign up for EC2 and S3 services. The user then starts an instance using the public Atlas2 machine image (ami-2ee23847) which can be found inside the community Amazon Machine Image (AMI) tab. Before starting the instance the user needs to change his Security Groups setting so as to enable http access on port 8080 to be able to access the webpage which can be done using the public DNS of the instance which can be found in AWS management console. Since the master instance is only acting as a portal to access and monitor the jobs running on AWS this master instance can be a "t1.micro" instance. The advantage of a "t1.micro" instance is they are cheap and at the time of writing this article every new registered AWS user would get 750 hours of free runtime every month for a year; afterwards it is $0.02/hour.
The monitoring page shows information regarding each worker node, i.e. each row in the monitoring page represents a worker instance. For each worker instance the following information is shown; instance id, public DNS, node status, job status, start time, end time, input bucket and output bucket. The instance id and the public DNS can be used to access the worker node. Node status can have two possible states either the worker node is running or has been terminated. Job status can have five possible states, started refers to worker node has been instantiated, setup refers to head node is prepping the worker instance, downloading reference suggests that the worker instance is downloading the reference FASTA file, variant calling suggests worker node is running the analysis and termination refers to there are no more jobs in the queue and worker node is being terminated. Figure 4B shows a screenshot of the monitoring page.
Applying the Altas2 Genboree to a case of personal genome study
We tested Atlas2 Genboree by performing an analysis on a recently published personal whole genome sequencing data set . We examined the resource usage metrics and reproducibility in variant analysis, and examined the challenges related to integrating multiple tools required for variant detection, visualization, and analysis.
Description of the personal genome data set
Bainbridge et al. employed the SOLiD 4 next-generation sequencing platform, and sequenced the complete genomes of a 14-year-old fraternal twin pair, one female (patientX) and one male (patientY) diagnosed with dopa (3,4-dihydrophenylalanine)-responsive dystonia (DRD). DRD is a genetically heterogeneous and clinically complex movement disorder with parkinsonian features that is usually treated with L-dopa. After identifying six heterozygous autosomal mutations in three genes, a new clinical intervention was prescribed that dramatically improved the quality of life of both twins.
Summarizes the amount of computation and time required to get the data (Chr 2 and Chr 19) onto Genboree and to run through the variant calling steps.
Resource usage (Chr2/Chr 19)
Size of BAM file (GB)
Time to upload (Min)
Atlas2 Runtime (Min)
Atlas2 Memory Usage (MB)
Summarizes the total number of raw variants found in chromosome 2 and 19 of the two patients.
Following three genes were found to contain two or more predicted amino acid altering heterozygous mutation in both the patients.
Applying Atlas2 Amazon to whole exome capture data
Summarizes the cost of running Atlas-INDEL2 on whole exome capture SOLiD and Illumina BAMs using Atlas2 Amazon.
Time to upload (Hours)
Size of BAM (GB)
Total Cost (USD)
Following table summarizes the cost projections of analyzing 1, 3, 10, 50, 100 and 1000 BAMs using Atlas2 Amazon.
No of BAM
Compute time (Hrs)
Total Cost (USD)
If personal genomic studies are to become a routine part of personalized diagnostics and medical management that is accessible to small research and clinical laboratories, advanced bioinformatic analysis must be made accessible both in terms of computational resources and usability. We have demonstrated the suitability of deploying existing analysis tools onto a cloud resource to address these issues, and demonstrated its utility by duplicating a real-world case study of clinical significance. We also outlined the cost of running Atlas2 Amazon pipeline on SOLiD and Illumina whole exome capture samples and made cost projections of running the analysis on much larger sample sizes.
These analyses show that Atlas2 on Genboree and Amazon provide both possible and practical solution for personal genome variant analysis on the cloud by outsourcing the computation resources and expertise needed to perform such analysis. By removing these barriers, Atlas2 on the cloud enables non-bioinformaticians at small research labs to perform this analysis without the need to invest in expensive compute clusters. It is our hope that various pipelines will have output that is cross-compatible with each other so as to enable and facilitate the creation of customized personal genomic analysis.
While the present Atlas2 Amazon architecture and AWS cost structure are certainly a viable solution for small scale personal genome analysis the storage cost in the long run can make it prohibitive for large scale analysis. With the growing number of competitors in the cloud computing space we believe the cost of storage and compute is eventually going to come down and by harnessing the power of distributed computing algorithms like map-reduce framework will make it attractive for large scale analysis. There are other serious consideration such as data security which is of utmost importance especially in a clinical setting, the burden of which lies in the hands of both developers and end users and until such issues are resolved they pose a serious hindrance for clinical use. Other minor issues include the network-bandwidth bottleneck, but this is a onetime problem since once the data is uploaded onto the cloud it can used for multiple analyses. Once these challenges have been addressed we believe that genome analysis on the cloud will become a valuable resource, enabling both large and small scale clinical analysis by a variety of diverse research groups.
Availability and requirements
The Atlas2-Cloud machine image is made public and can be instantiated from the AWS management console by searching for the following Amazon machine image ID ami-ec469c85. Since machine image IDs are not permanent and susceptible to change when we update the machine image the better way to find the Atlas2 image would be to search for "atlas2" in the community AMI tab. The Atlas2-Amazon backend source code is released under the BSD license and is available for download at http://sourceforge.net/projects/atlas2cloud/ . More detailed instructions and tutorial on how to access the pipeline can be found at our Sourceforge page.
Amazon Web Services
Amazon Elastic Compute Cloud
Amazon Simple Storage Service
Amazon Elastic Block Storage
Elastic Compute Unit
Amazon Machine Image
Single Nucleotide Variants
Insertions and Deletions
Binary sequence Alignment/Map format
Variant Call Format
Yoruba in Ibadan Nigeria
British from England and Scotland.
Based on “Enabling Atlas2 personal genome analysis on the cloud”, by Uday S Evani, Danny Challis, Jin Yu, Andrew R Jackson, Sameer Paithankar, Matthew N Bainbridge, Cristian Coarfa, Aleksandar Milosavljevic and Fuli Yu which appeared in Genomic Signal Processing and Statistics (GENSIPS), 2011 IEEE International Workshop on. © 2011 IEEE .
USE, DC, JY, MNB, AJ, PP and FY were supported by the National Human Genome Research Institute, National Institutes of Health, under grants 5U54HG003273 and 1U01HG005211-0109. AM, SP, ARJ, and CC were supported by NIH grants R01HG004009 and U01DA025956. We would like to thank Walker Hale for his input during the design of Atlas2 Amazon.
This article has been published as part of BMC Genomics Volume 13 Supplement 6, 2012: Selected articles from the IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS) 2011. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/13/S6.
- Tucker T, Marra M, Friedman JM: Massively parallel sequencing: the next big thing in genetic medicine. Am J Hum Genet. 2009, 85 (2): 142-154. 10.1016/j.ajhg.2009.06.022.PubMed CentralView ArticlePubMed
- Ng SB, Turner EH, Robertson PD, Flygare SD, Bigham AW, Lee C, Shaffer T, Wong M, Bhattacharjee A, Eichler EE, et al: Targeted capture and massively parallel sequencing of 12 human exomes. Nature. 2009, 461 (7261): 272-276. 10.1038/nature08250.PubMed CentralView ArticlePubMed
- Ashley EA, Butte AJ, Wheeler MT, Chen R, Klein TE, Dewey FE, Dudley JT, Ormond KE, Pavlovic A, Morgan AA, et al: Clinical assessment incorporating a personal genome. Lancet. 2010, 375 (9725): 1525-1535. 10.1016/S0140-6736(10)60452-7.PubMed CentralView ArticlePubMed
- Bainbridge MN, Wiszniewski W, Murdock DR, Friedman J, Gonzaga-Jauregui C, Newsham I, Reid JG, Fink JK, Morgan MB, Gingras MC, et al: Whole-genome sequencing for optimized patient management. Sci Transl Med. 2011, 3 (87): 87re83-View Article
- McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, et al: The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010, 20 (9): 1297-1303. 10.1101/gr.107524.110.PubMed CentralView ArticlePubMed
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R: The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009, 25 (16): 2078-2079. 10.1093/bioinformatics/btp352.PubMed CentralView ArticlePubMed
- Schatz MC: CloudBurst: highly sensitive read mapping with MapReduce. Bioinformatics. 2009, 25 (11): 1363-1369. 10.1093/bioinformatics/btp236.PubMed CentralView ArticlePubMed
- Langmead B, Schatz MC, Lin J, Pop M, Salzberg SL: Searching for SNPs with cloud computing. Genome Biol. 2009, 10 (11): R134-10.1186/gb-2009-10-11-r134.PubMed CentralView ArticlePubMed
- Afgan E, Baker D, Coraor N, Chapman B, Nekrutenko A, Taylor J: Galaxy CloudMan: delivering cloud compute clusters. BMC Bioinformatics. 2010, 11 (Suppl 12): S4-10.1186/1471-2105-11-S12-S4.PubMed CentralView ArticlePubMed
- Challis D, Yu J, Evani US, Jackson AR, Paithankar S, Coarfa C, Milosavljevic A, Gibbs RA, Yu F: An integrative variant analysis suite for whole exome next-generation sequencing data. BMC Bioinformatics. 2012, 13 (1): 8-10.1186/1471-2105-13-8.PubMed CentralView ArticlePubMed
- Siva N: 1000 Genomes project. Nat Biotechnol. 2008, 26 (3): 256-PubMed
- Karolchik D, Hinrichs AS, Kent WJ: The UCSC Genome Browser. Curr Protoc Bioinformatics. 2009, Chapter 1: Unit1 4-PubMed
- Danecek P, Auton A, Abecasis G, Albers CA, Banks E, Depristo MA, Handsaker RE, Lunter G, Marth GT, Sherry ST, et al: The variant call format and VCFtools. Bioinformatics. 2011, 27 (15): 2156-2158. 10.1093/bioinformatics/btr330.PubMed CentralView ArticlePubMed
- Blankenberg D, Von Kuster G, Coraor N, Ananda G, Lazarus R, Mangan M, Nekrutenko A, Taylor J: Galaxy: a web-based genome analysis tool for experimentalists. Curr Protoc Mol Biol. 2010, Chapter 19: Unit 19.10.1-21.
- Wang K, Li M, Hakonarson H: ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 2010, 38 (16): e164-10.1093/nar/gkq603.PubMed CentralView ArticlePubMed
- Fujita PA, Rhead B, Zweig AS, Hinrichs AS, Karolchik D, Cline MS, Goldman M, Barber GP, Clawson H, Coelho A, et al: The UCSC Genome Browser database: update 2011. Nucleic Acids Res. 2011, 39 (Database issue): D876-882.PubMed CentralView ArticlePubMed
- Evani US, Challis D, Yu J, Jackson AR, Paithankar S, Bainbridge MN, Coarfa C, Milosavljevic A, Yu F: Enabling Atlas2 personal genome analysis on the cloud. Genomic Signal Processing and Statistics (GENSIPS), 2011 IEEE International Workshop on: 4-6 December 2011. 2011, 117-120. 10.1109/GENSiPS.2011.6169458.View Article
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.