HPC in the Cloud Research Roundup
Our HPC cloud research stories are hand-selected from leading science centers, prominent journals and relevant conference proceedings. The top item this week addresses the question: what if it were possible to cheaply and easily test the suitability of moving to a cloud platform – a virtual “try it before you buy it”? We also explore the reliability of HPC cloud, take another pass at GPU virtualization, and evaluate I/O performance in Amazon’s EC2 cloud.
Novel Cloud Evaluation Project Receives Google Award
What if were possible to predict the suitability of the cloud resources for a given application? This smart idea is the basis of a research project led by University of Texas at Dallas professor Dr. Lawrence Chung. The researcher and his SilverLining team from the Erik Jonsson School of Engineering and Computer Science have already caught the attention of Web giant Google. Earlier this month, Dr. Chung’s team (and six other worthy recipients) received the first-ever Google App Engine Research Award.
The projects, which each received $60,000 in Google App Engine credits, were selected for their intellectual excellence, innovation and expected to benefit society.
The SilverLining team starts with the premise that the “initial purchase, re-purchase, and operation of computing equipment that has become unsustainable and, hence, is becoming an increasingly great burden on the US economy.”
Countless organizations are interested in the benefits of cloud computing, but testing a new cloud system can be costly and time-consuming. Chung and his colleagues propose that this complex process can be simulated on one system.
Chung explains: “We play with numbers and do not need the real software and machines. Using this approach, we can see the behavior of the cloud very quickly and inexpensively.”
The project seeks to determine the feasibility of predicting: 1. whether an operational system can migrate to a cloud, while making everyone happy, and 2. the performance and scalability of the system after or even before it is actually built.
The researchers have run initial simulations and benchmarks, but to verify their work, they require access to a large-scale cloud-based infrastructure. This is very similar to comparing a virtual model with a physical model, but as is typically the case, the “physical model” requires some capital outlay.
“Before we use the simulator further, we want to make sure that the results we obtain from simulators are going to be meaningful,” Chung said.
With the Google award, the SilverLining team now has access to a full-scale cloud infrastructure enabling them to run the their experiments and compare the results to their simulations to see if they hold up.
The other Google App Engine Award recipients are from the California Institute of Technology, University of Bristol, Massachusetts Institute of Technology, Carnegie Mellon University, University of Washington and Arizona State University.
Next >> Reliability in HPC Cloud
Making HPC Cloud Computing More Reliable
A team of computer scientists from Louisiana Tech University has contributed to the growing body of HPC cloud research, specifically as it relates to the reliability of cloud computing resources. Their paper, A Reliability Model for Cloud Computing for High Performance Computing Applications, was published in the book, Euro-Par 2012: Parallel Processing Workshops.
Cloud computing and virtualization allow resources to be used more efficiently. Public cloud resources are available on-demand and don’t require an expensive capital expenditure. But with an increase in both software and hardware components, comes a corresponding rise in server failure. The researchers assert that it’s important for service providers to understand the failure behavior of a cloud system, so they can better manage the resources. Much of their research applies specifically to the running of HPC applications on the cloud.
In the paper, the researchers “propose a reliability model for a cloud computing system that considers software, application, virtual machine, hypervisor, and hardware failures as well as correlation of failures within the software and hardware.”
They conclude failures caused by dependencies create a less reliable system, and as the failure rate of the system increases, the mean time to failure decreases. Not surprisingly, they also find that an increase in the number of nodes decreases the reliability of the system.
Next >> GPU Virtualization
GPU Virtualization using PCI Direct Pass-Through
The technical computing space has seen several trends develop over the past decade, among them are server virtualization, cloud computing and GPU computing. It’s clear that GPGPU computing has a role to play in HPC systems. Can these trends be combined?
A research team from Chonbuk National University in South Korea has written a paper in the periodical Applied Mechanics and Materials, proposing exactly this. The investigate a method of GPU virtualization that exploits the GPU in a virtualized cloud computing environment.
The researchers claim their approach is different from previous work, which mostly reimplemented GPU programming APIs and virtual device drivers. Past research focused on sharing the GPU among virtual machines, which increased virtualization overhead. The paper describes an alternate method: the use of PCI direct pass-through.
“In our approach, bypassing virtual machine monitor layer with negligible overhead, the mechanism can achieve similar computation performance to bare-metal system and is transparent to the GPU programming APIs,” the authors write.
Analysis of I/O Performance on AWS High I/O Platform
The HPC community is still exploring the potential of the cloud paradigm to discern the most suitable use cases. The pay-per-use basis of compute and storage resources is an attractive draw for researchers, but so is the illusion of limitless resources to tackle large-scale scientific workloads.
In the most recent edition of the Journal of Grid Computing, computer scientists from the Department of Electronics and Systems at the University of A Coruña in Spain evaluate the I/O storage subsystem on the Amazon EC2 platform, specifically the High I/O instance type, to determine its suitability for I/O-intensive applications. The High I/O instance type, released in July 2012, is backed by SSD and also provides high levels of CPU, memory and network performance.
The study looked at the low-level cloud storage devices available in Amazon EC2, ephemeral disks and Elastic Block Store (EBS) volumes, both on local and distributed file systems. It also assessed several I/O interfaces, notably POSIX, MPI-IO and HDF5, that are commonly employed by scientific workloads. The scalability of a representative parallel I/O code was also analyzed based on performance and cost.
As the results show, cloud storage devices have different performance characteristics and usage constraints. “Our comprehensive evaluation can help scientists to increase significantly (up to several times) the performance of I/O-intensive applications in Amazon EC2 cloud,” the researchers state. “An example of optimal configuration that can maximize I/O performance in this cloud is the use of a RAID 0 of 2 ephemeral disks, TCP with 9,000 bytes MTU, NFS async and MPI-IO on the High I/O instance type, which provides ephemeral disks backed by Solid State Drive (SSD) technology.”