Cloud or On Prem? In Genomics, It’s Good to Have Both
A hybrid cloud architecture we recently tested for genomic sequence data analysis can unlock the cloud’s power for life sciences.
We recently ran a test together with friends from Pure Storage, Microsoft Azure and Illumina, the genomics tech giant, to see whether a hybrid cloud setup using a combination of the three companies’ products could help meaningfully shrink time to result in genomic analysis by optimizing data management architecture and processes life-sciences organizations use. As it turned out, not only could it help get results faster, the hybrid cloud architecture made it possible to process a lot more samples in parallel.
The results were a big deal for several reasons. For one, the more samples you can analyze, the more patients you can diagnose and the more new life-saving or quality-of-life-improving drugs you can discover. Additionally, this could drive more labs to use cloud services for genomics, something many of them have been reluctant or unable to do, opting instead for on-premises systems.
In a common on-prem setup, a typical lab has two to five of the specialized “secondary analysis” computers in its facilities (usually made by Illumina), each able to process one sample at a time. Meanwhile, a single virtual cloud instance of such a machine can do the same, but you can spin up a lot more than five cloud instances to run simultaneously, so you want to use cloud services if you want to do this sort of work at scale.
Our test setup ran 50 Azure NP20s VMs in a Microsoft data center in San Antonio, each analyzing a sample stored on a Pure flash array in an Equinix Metal data center in Dallas. The two sites were linked via a private network connection between Metal and an Azure ExpressRoute cloud onramp in the Equinix facility. Three additional, alternative architectures were tested:
- Storing and processing sample data solely on physical Illumina machines on premises.
- Using the same Illumina machines to process sample data stored on a Pure array, all on the same premises.
- Using the specialized Azure cloud VMs to process data stored locally on the VMs’ own storage.
Of the four architectures, the hybrid cloud setup enabled the greatest scale and speed at the lowest overall cost.
Many labs that work with genomic data today would like to have access to the scale enabled by cloud services but don’t. There are several reasons for that, including data privacy and data sovereignty regulations, cloud storage costs—the volumes of data they work with are enormous—and the long time it takes to transfer the raw data for each sample to the cloud.
The hybrid architecture we tested addresses all those concerns. By storing data on a dedicated Pure array managed by Equinix Metal they retain full control over the data without the headache of managing and hosting the storage infrastructure themselves. They can choose from among many global Metal locations to provision these arrays in, so that they can comply with data sovereignty laws they may be subject to. All those locations provide direct, private network access to all the major cloud providers’ onramps, so they can extend their genomic analysis capacity using cloud services regardless of where in the world they are.
NGS Secondary Analysis: a Bottleneck
The predominant process for diagnosis and drug discovery via genetic sequencing today involves three stages of analysis. (The process is referred to in the field as Next-Generation Sequencing, or NGS.) In the primary stage, samples of DNA, RNA and sequencing data are analyzed. The output of this stage consists of data that includes representations of short nucleotide sequences, or genetic “snippets.” In the secondary phase, the snippets are assembled into an entire genome using a reference genome as a model. The resulting sequence is then examined side by side with the model to identify variants, or segments where the two differ. The variants, some innocuous and others potentially carrying bad news for the patient, are what the researchers later study in the tertiary stage.
Illumina sells solutions for sequencing and for all three analysis stages, of which secondary analysis is by far the most demanding of compute and storage resources. A single run by an Illumina sequencer (the machine that generates the raw data for analysis) can produce 128 sequences. With only a handful of machines for secondary analysis on a typical lab’s premises, each capable of analyzing one sample at a time, this stage presents a major bottleneck.
Hybrid Cloud Supercharges Secondary Analysis
Our hybrid cloud test ran secondary analysis on data generated in primary analysis stored on a Pure FlashBlade appliance at Equinix. The secondary analysis was done by Illumina’s DRAGEN software running on Azure NP20s VMs, which are powered by AMD CPUs and Xilinx FPGA accelerators.
During the test, an on-premises DRAGEN server stored the output data of a sequencer on a FlashBlade in the same location. That data was then replicated to the FlashBlade in the Equinix data center in Dallas. This replication step is where the bulk of the time savings was gained thanks to Pure’s “array level” replication technology, which is much faster than the traditional process of moving data to cloud storage over FTP. It took 2.4 minutes to transfer 32GB of files containing a single genome sample to the Pure array in Dallas, while moving the same files over FTP took 90 minutes. That’s 35 times faster. Not bad!
We then had 50 of the same Azure VMs run secondary analysis on 50 samples stored on Pure at Equinix in parallel, which took 60 minutes total. The cloud VMs accessed the samples using the NFSv3 protocol. Round-trip network latency on the Azure ExpressRoute link between Equinix Metal in Dallas and Microsoft’s cloud data center in San Antonio was 8 milliseconds. Latency on the FlashBlade was just under 3 milliseconds with two cloud VMs running analysis, going up to 3.6 milliseconds with all 50. The 10Gbps ExpressRoute connection we used got saturated once the test scaled to 32 VMs, causing IO requests to start queueing up, which in turn caused the slight increase in latency. Over a higher-bandwidth connection that did not saturate (say, a 100-Gig one), analyzing the 50 samples in parallel would take 54 minutes to complete instead of 60.
Such a dramatic increase in the amount of samples a lab can process per unit of time, accompanied by a dramatic decrease in the cost of storing genomic data (compared to the cost of cloud storage), makes for a compelling argument in favor of a hybrid cloud architecture over a purely on-prem or a purely cloud-based one.
Another possible addition to the hybrid setup could reduce storage costs even further. As part of our testing we compared performance of on-prem DRAGEN servers processing data stored on an on-prem FlashBlade array using NFSv3 to their performance processing data stored on their own local storage. It took a single DRAGEN machine a few minutes longer to run secondary analysis on FlashBlade than on its local storage, but when two DRAGEN servers ran analysis in parallel, there was almost no time difference between the two approaches to storage.
FlashBlade takes the cake, however, when primary analysis data is written directly to it rather than being copied from elsewhere. Because it supports a variety of protocols, a single array can store data in separate partitions for primary, secondary and tertiary analysis. This way, there’s no need to transfer primary analysis output data to a DRAGEN server’s local storage before secondary analysis can be done—something that in our test took 86 minutes. Labs could store data on FlashBlade arrays at their own facilities, and when they needed the scale that’s possible with DRAGEN cloud VMs, they could temporarily spin up a FlashBlade on Equinix Metal, copy the data to it, run the analysis in the cloud and then spin the storage system on Metal down, paying only for the time they used it.
Unlocking Access to the Cloud
The approach suggested above would also help organizations with stringent data privacy requirements. They could keep clinical data on premises while replicating sequencing data and input files to a FlashBlade at a Metal location for cloud-scale secondary analysis.
Enabling researchers to retain full control of the data they work with is crucial. Privacy regulations are often the main reason life-sciences organizations don’t use cloud services for genomics work. Cloud adoption in the genomics industry varies from country to country. More organizations use services like Illumina’s cloud-based software in the US and Canada, for example, than in some European countries with more stringent rules for where and how patient data can be stored and processed. Additionally, compliance with regulations like HIPAA in the US or GDPR in Europe for cloud users requires paying for compliance-specific cloud services, further increasing the cost. Finally, many organizations in the space simply don’t have the DevOps and infrastructure operations skills they need to use cloud services effectively.
While some of that technical expertise would still be required to operate a hybrid cloud environment like the one we tested in Texas, it would make compliance easier by leaving full control of the data in the hands of the customer, lower operating costs by avoiding the use of cloud storage, simplify infrastructure management by relying on managed storage by Equinix Metal and, most importantly, still leverage the scale and might of public clouds to supercharge life-sciences organizations’ ability to diagnose patients and discover new drugs.
If you’d like to learn more about this test, you will find more details about the setup and the results in this whitepaper.
Ready to kick the tires?
Use code DEPLOYNOW for $300 credit