HPC as a Service: Lessons Learned

By Wolfgang Gentzsch and Burak Yenier

December 10, 2012

After a fast-paced three months, round 1 of the HPC Experiment (also known as the Uber-Cloud Experiment) concluded last month, with more than 160 participating organizations and individuals from 25 countries, working together in 25 international teams. In this article we present their main findings, challenges, and their lessons learned.

The aim of the Uber-Cloud Experiment is to explore the end-to-end process of accessing remote computing resources in HPC centers and in HPC clouds as well as to study and overcome the potential roadblocks.

The experiment kicked off in July 2012 and brought together four categories of participants: the industry end-users with their applications, the software providers, the computing and storage resource providers, and the experts. We set up an end-user project by first selecting an end-user and his software provider, assigning an HPC/CAE expert, and matching a suitable resource provider to complete the team. Each team’s goal was to complete the project, and to chart the way around the hurdles they identified.

End users can achieve many benefits by gaining access to additional compute resources beyond their current internal resources, such as workstations. Arguably the most important two are the benefits of agility gained by speeding up product design cycles through shorter simulation run times, and those gained by the superior quality achieved by simulating more sophisticated geometries or physics, or by running many more iterations to look for the best product design.

Tangible benefits like these make HPC and more specifically HPC-as-a-Service (HPCaaS) very attractive. But how far are we from an ideal HPCaaS or HPC in the cloud model?

Honestly, at this point, we don’t know. However, in the course of this experiment, following each team and monitoring its challenges and progress, we’ve collected some excellent insight into these roadblocks and how our 25 teams have tackled them.

The main approach for this experiment is to look at the end-users’ project and select the appropriate resources, software and expertise that match those requirements.

During the three months of the experiment, we were able to build 25 teams each with a project proposed by an end user. These teams were: Team Anchor Bolt, Team Resonance, Team Radiofrequency, Team Supersonic, Team Liquid-Gas, Team Wing-Flow, Team Ship-Hull, Team Cement-Flows, Team Sprinkler, Team Space Capsule, Team Car Acoustics, Team Dosimetry, Team Weathermen, Team Wind Turbine, Team Combustion, Team Blood Flow, Team Turbo-Machinery, Team Gas Bubbles, Team Side impact, Team ColombiaBio, and Team Cellphone.

The final report, available to all of our registered participants, contains the use cases of many of the teams offering valuable insight through their own words. We look forward to future rounds of the experiment where this accumulating knowledge will yield ever more successful projects.

We recognize that every end-user project requires a slightly different approach, a variety of software and compute resources, a certain expertise to lead the end-to-end process, and a tailored schedule. To be able to keep the entire experiment consistent we asked each team to follow a common roadmap. The expert assigned to each team is the guide in following this roadmap. It calls for communication with the organizers at certain points, although generally the teams are autonomous and make their own decisions.

Based on the roadmap we defined going into round 1 of the experiment, the teams followed six steps to reach their goal:

Step 1. Define the end-user project. The end-user together with the expert and software provider jointly defined the project. Based on this information, as organizers we assigned the appropriate resources to the project.

Step 2. Contact the resource provider and set up the project environment. The expert contacted the computing resource and performed activities such as assisting in software and license installations, creation of user accounts, and configuration of the project environment.

Step 3. Initiate the end-user project execution. The expert assisted the end-user with uploading the necessary data, code and configuration files to the remote resource(s). The expert then worked with the resource provider to queue the project up on the HPC system.

Step 4. Monitor the project. The expert remained engaged with the resource providers and at any time had up to date information about the status of the project.

Step 5. Review results with the end-user. The expert assisted the end-user in downloading the results from the resource provider’s environment and discussed the results with the end-user. If any rework or rerun was required it was completed by executing steps 2-5 again.

Step 6. Document findings. During the entire lifecycle of the project, there occurred hurdles, friction and failure points and the expert documented these findings.

Intentionally, we performed the first round of this experiment manually, that is, not via an automated service, because we believe the technology is not the challenge anymore; rather it’s the people and their processes, and that’s what we wanted to explore. We are continuously improving the roadmap to successful completion of our projects.

The teams reported the following main roadblocks and provided information on how they resolved them (or not):

  • Security and privacy, guarding the raw data, processing models and the results
  • Unpredictable costs can be a major problem in securing a budget for a given project
  • Lack of easy, intuitive self-service registration and administration
  • Incompatible software licensing models hinder adoption of Computing-as-a-Service
  • High expectations can lead to disappointing results
  • Lack of reliability and availability of resources can lead to long delays

Just like all other participants, we as the organizers, treated the experiment as a learning opportunity. In our report we have also summarized what we’ve found to be shortcomings of the experiment as we put it together in round 1. Learning from these shortcomings we have improved the experiment for round 2. To be specific, we discussed and provided solutions for the following shortcomings:

All participants are professionals with busy schedules and the experiment is not their primary job, so they could only dedicate a few hours per week to the experiment

  • Vacations delayed most of the teams’ progress, especially in the beginning (August) of the Experiment
  • Some resource providers ran into resource crunches which delayed team projects
  • Some of our projects ran into long delays since the project and the resource provider weren’t the best match possible
  • Some resource providers struggled with the installation of an application
  • Other resource providers had difficulties with providing network access through complex network connections
  • Resource providers differ in their service philosophies
  • Simply getting started was a challenge
  • A few teams struggled with figuring out which team member needs to do what and when
  • Team forming was one of the steps, which took the longest amount of time, each team member needed to exchange significant amounts of information about their background, capabilities, expectations, availability, and commitment levels with one another before the project could even kick off
  • Finally, manual processes are just slow; they consumed days, sometimes weeks especially because the various technology and people resources were inherently remote, each with different priorities

We hope that our participants will extract value out of the experiment and the final report. They certainly deserve to do so in return for their generous contributions, support and participation. We now look forward to round 2 of the experiment with its already over 250 participants and the learning that it will result in.

If you are interested in participating in round 2 or just want to monitor its progress, you can register at http://hpcexperiment.com.  You can also go there to get the final report for round 1, which details the results and recommendations.

About the Authors

Wolfgang Gentzsch and Burak Yenier are the creators and facilitators of the Uber-Cloud Experiment. Wolfgang is an HPC veteran. Having worked in leading positions in research, academia and industry for some 30 years, Wolfgang is now an HPC consultant and the chairman of the ISC Cloud conference series for HPC and Big Data in the Cloud. Burak is the vice president of operations at CashEdge, a software-as-a-service company in Silicon Valley, which provides innovative payments and aggregation solutions to financial institutions.


Related Articles

Half-Time in the Uber-Cloud

The Uber-Cloud Experiment

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

NSF Project Sets Up First Machine Learning Cyberinfrastructure – CHASE-CI

July 25, 2017

Earlier this month, the National Science Foundation issued a $1 million grant to Larry Smarr, director of Calit2, and a group of his colleagues to create a community infrastructure in support of machine learning research Read more…

By John Russell

DARPA Continues Investment in Post-Moore’s Technologies

July 24, 2017

The U.S. military long ago ceded dominance in electronics innovation to Silicon Valley, the DoD-backed powerhouse that has driven microelectronic generation for decades. With Moore's Law clearly running out of steam, the Read more…

By George Leopold

Graphcore Readies Launch of 16nm Colossus-IPU Chip

July 20, 2017

A second $30 million funding round for U.K. AI chip developer Graphcore sets up the company to go to market with its “intelligent processing unit” (IPU) in 2017 with scale-up production for enterprise datacenters and Read more…

By Tiffany Trader

HPE Extreme Performance Solutions

HPE Servers Deliver High Performance Remote Visualization

Whether generating seismic simulations, locating new productive oil reservoirs, or constructing complex models of the earth’s subsurface, energy, oil, and gas (EO&G) is a highly data-driven industry. Read more…

Trinity Supercomputer’s Haswell and KNL Partitions Are Merged

July 19, 2017

Trinity supercomputer’s two partitions – one based on Intel Xeon Haswell processors and the other on Xeon Phi Knights Landing – have been fully integrated are now available for use on classified work in the Nationa Read more…

By HPCwire Staff

NSF Project Sets Up First Machine Learning Cyberinfrastructure – CHASE-CI

July 25, 2017

Earlier this month, the National Science Foundation issued a $1 million grant to Larry Smarr, director of Calit2, and a group of his colleagues to create a comm Read more…

By John Russell

Graphcore Readies Launch of 16nm Colossus-IPU Chip

July 20, 2017

A second $30 million funding round for U.K. AI chip developer Graphcore sets up the company to go to market with its “intelligent processing unit” (IPU) in Read more…

By Tiffany Trader

Fujitsu Continues HPC, AI Push

July 19, 2017

Summer is well under way, but the so-called summertime slowdown, linked with hot temperatures and longer vacations, does not seem to have impacted Fujitsu's out Read more…

By Tiffany Trader

Researchers Use DNA to Store and Retrieve Digital Movie

July 18, 2017

From abacus to pencil and paper to semiconductor chips, the technology of computing has always been an ever-changing target. The human brain is probably the com Read more…

By John Russell

The Exascale FY18 Budget – The Next Step

July 17, 2017

On July 12, 2017, the U.S. federal budget for its Exascale Computing Initiative (ECI) took its next step forward. On that day, the full Appropriations Committee Read more…

By Alex R. Larzelere

Women in HPC Luncheon Shines Light on Female-Friendly Hiring Practices

July 13, 2017

The second annual Women in HPC luncheon was held on June 20, 2017, during the International Supercomputing Conference in Frankfurt, Germany. The luncheon provid Read more…

By Tiffany Trader

Satellite Advances, NSF Computation Power Rapid Mapping of Earth’s Surface

July 13, 2017

New satellite technologies have completely changed the game in mapping and geographical data gathering, reducing costs and placing a new emphasis on time series Read more…

By Ken Chiacchia and Tiffany Jolley

Intel Skylake: Xeon Goes from Chip to Platform

July 13, 2017

With yesterday’s New York unveiling of the new “Skylake” Xeon Scalable processors, Intel made multiple runs at multiple competitive threats and strategic Read more…

By Doug Black

Google Pulls Back the Covers on Its First Machine Learning Chip

April 6, 2017

This week Google released a report detailing the design and performance characteristics of the Tensor Processing Unit (TPU), its custom ASIC for the inference Read more…

By Tiffany Trader

Nvidia Responds to Google TPU Benchmarking

April 10, 2017

Nvidia highlights strengths of its newest GPU silicon in response to Google's report on the performance and energy advantages of its custom tensor processor. Read more…

By Tiffany Trader

Quantum Bits: D-Wave and VW; Google Quantum Lab; IBM Expands Access

March 21, 2017

For a technology that’s usually characterized as far off and in a distant galaxy, quantum computing has been steadily picking up steam. Just how close real-wo Read more…

By John Russell

HPC Compiler Company PathScale Seeks Life Raft

March 23, 2017

HPCwire has learned that HPC compiler company PathScale has fallen on difficult times and is asking the community for help or actively seeking a buyer for its a Read more…

By Tiffany Trader

Trump Budget Targets NIH, DOE, and EPA; No Mention of NSF

March 16, 2017

President Trump’s proposed U.S. fiscal 2018 budget issued today sharply cuts science spending while bolstering military spending as he promised during the cam Read more…

By John Russell

CPU-based Visualization Positions for Exascale Supercomputing

March 16, 2017

In this contributed perspective piece, Intel’s Jim Jeffers makes the case that CPU-based visualization is now widely adopted and as such is no longer a contrarian view, but is rather an exascale requirement. Read more…

By Jim Jeffers, Principal Engineer and Engineering Leader, Intel

Nvidia’s Mammoth Volta GPU Aims High for AI, HPC

May 10, 2017

At Nvidia's GPU Technology Conference (GTC17) in San Jose, Calif., this morning, CEO Jensen Huang announced the company's much-anticipated Volta architecture a Read more…

By Tiffany Trader

Facebook Open Sources Caffe2; Nvidia, Intel Rush to Optimize

April 18, 2017

From its F8 developer conference in San Jose, Calif., today, Facebook announced Caffe2, a new open-source, cross-platform framework for deep learning. Caffe2 is the successor to Caffe, the deep learning framework developed by Berkeley AI Research and community contributors. Read more…

By Tiffany Trader

Leading Solution Providers

How ‘Knights Mill’ Gets Its Deep Learning Flops

June 22, 2017

Intel, the subject of much speculation regarding the delayed, rewritten or potentially canceled “Aurora” contract (the Argonne Lab part of the CORAL “ Read more…

By Tiffany Trader

Reinders: “AVX-512 May Be a Hidden Gem” in Intel Xeon Scalable Processors

June 29, 2017

Imagine if we could use vector processing on something other than just floating point problems.  Today, GPUs and CPUs work tirelessly to accelerate algorithms Read more…

By James Reinders

Russian Researchers Claim First Quantum-Safe Blockchain

May 25, 2017

The Russian Quantum Center today announced it has overcome the threat of quantum cryptography by creating the first quantum-safe blockchain, securing cryptocurrencies like Bitcoin, along with classified government communications and other sensitive digital transfers. Read more…

By Doug Black

MIT Mathematician Spins Up 220,000-Core Google Compute Cluster

April 21, 2017

On Thursday, Google announced that MIT math professor and computational number theorist Andrew V. Sutherland had set a record for the largest Google Compute Engine (GCE) job. Sutherland ran the massive mathematics workload on 220,000 GCE cores using preemptible virtual machine instances. Read more…

By Tiffany Trader

Google Debuts TPU v2 and will Add to Google Cloud

May 25, 2017

Not long after stirring attention in the deep learning/AI community by revealing the details of its Tensor Processing Unit (TPU), Google last week announced the Read more…

By John Russell

Groq This: New AI Chips to Give GPUs a Run for Deep Learning Money

April 24, 2017

CPUs and GPUs, move over. Thanks to recent revelations surrounding Google’s new Tensor Processing Unit (TPU), the computing world appears to be on the cusp of Read more…

By Alex Woodie

Six Exascale PathForward Vendors Selected; DoE Providing $258M

June 15, 2017

The much-anticipated PathForward awards for hardware R&D in support of the Exascale Computing Project were announced today with six vendors selected – AMD Read more…

By John Russell

Top500 Results: Latest List Trends and What’s in Store

June 19, 2017

Greetings from Frankfurt and the 2017 International Supercomputing Conference where the latest Top500 list has just been revealed. Although there were no major Read more…

By Tiffany Trader

  • arrow
  • Click Here for More Headlines
  • arrow
Share This