Alces Tries Flight on AWS and Life as an HPC SaaS Provider

By John Russell

June 9, 2016

Today, the community of HPC-in-the-cloud solution providers is fairly limited. Giant hyperscalers – notably Amazon (AWS), Google (GCP) and Microsoft (Azure) – remain at the core and keep expanding their HPC resources. Circling around them are HPC ‘services specialists’ plugging clients into cloud providers and striving to ease delivery of HPC-in-the-cloud to make good on the promise of improved efficiencies and cost reductions. Last week a new HPC Software as a Service (SaaS) player popped up on AWS marketplace – Alces Flight, a UK-based company with roots as an HPC integrator.

Predictably the company’s HPC background and the obvious opportunity drove the initiative explains Wil Mayers, director of research and development for Alces, “Organizations are starting to pivot from an on-premise first attitude to a much more cost-sensitive way of working and thinking. That means looking at public clouds. During this transition period we wanted to have a product that could give end users – scientists and engineers – the same sort of look and feel of an HPC cluster irrespective of the platform.”

For the last twelve months Alces has been working with AWS tweaking its offering which remains modest – maximum cluster deployment of around 1,000 cores (more details below) – but with plans for significant compute capacity growth. Like many such companies, the idea is to present a front-end that simplifies HPC application and workflow deployment and plays nicely with Amazon’s many powerful features. Indeed, Alces boasts its HPC cluster provisioning and SaaS offering already has a library of 750 scientific applications available for AWS users.

There are the big hitters you’d expect, GROMACS and NAMD, for example, “but also many others particularly if you look at something like bioinformatics where there are literally hundred and hundreds of different libraries and tools and utilities that people may want to use. A gene sequencing workflow might include 10 or 20 or 30 different tools, all run in sequence,” says Mayers.

The Alces Flight offering has a typical set of provisioning and SaaS components such as a job scheduler, shared file system, and management tools. Its large library of applications and HPC familiarity are important differentiators. So too, says Mayers is Gridware, its repository and orchestrator described in company literature as “applications, libraries and services you need to get working, packaged with care and attention, favoring convention over configuration.”

“The Gridware project is something we host on GitHub. It’s a big library of applications (about 800) and what Gridware is designed to do is when you invoke it and run it on a cluster, it looks at the cluster, looks at the tool you want to install, and you can tell it to go find the source code for the application and dynamically compile it for the environment that you have. It’s aware of dependencies including things like the operating system,” he says. It’s even aware of the particular MPI, interconnect types, and compilers you have chosen.

alces_flight_marketplace_badge_1One addition made to the most recent Gridware release is “a binary option as well for using those lists of instructions and there are compilations options for different packages. We have actually precompiled a lot of the applications for their instance types that are available on AWS,” he says. “So rather than having to compile on the fly – compilation gives you a lot a flexibility but it can also take a little bit of time – we can on a Alces Flight cluster, go and grab the applications and use them directly from an Amazon S3 bucket that lives on AWS.”

There are management tools including a GUI that allows multiple users to connect to the same cluster at the same time. There are also storage management tools, which not only link to an S3 account if a user has one, but also support back ends such as DropBox and Google drive.

While it’s certainly early days, Alces reports already running on the order of 20 clusters on AWS around world. Given the Euro-centricity of its customer base, complying with privacy requirements is critical especially in the bioscience and health care sectors. During the AWS beta testing a couple UK National Health Service clients were working with annonymized data and running only on systems located in the UK – that’s because of strict NHS requirements health remain on “sovereign” ground.

Even in basic research this is the case, Mayers points out. “While there’s a lot of collaborative work, they want the collaboration to be done in a controlled way. What we have found is European users usually want to stay within their region,” he says, adding the AWS’s plans to ramp up capabilities in London later this year will likely open the doors to do more business.

Given the cost pressures on hospitals there, for example, Mayers thinks the cloud will be very attractive. “Using AWS will sometimes mean not only can they reduce their costs by a factor of three or four, they can also get much bigger clusters and get diagnostic work done much faster.”

The opportunities seem enormous and worldwide thinks Alces. Others do as well. There are a few players already in the HPC-in-the-cloud market each with its own flavor of services and value proposition. Cycle Computing and UberCloud are two that come to mind. It will be interesting to watch Alces navigate the new waters. Scaling is currently modest on the AWS Marketplace version of Flight Compute but a higher ceiling is in the works; and users have other options to achieve higher core counts.

“We have a product today that scales to just over a 1,000 cores, so that’s 32 compute nodes. That limits the market,” agrees Mayers. The reason for the constraint, he says, is “We have a fairly straightforward storage deployment option at the moment. We are working with the Intel team for Lustre and the BeeGFS team in Germany and that will allow us to scale above 32 nodes up to 64, and up to some hundreds of nodes within a single cluster environment.”

Mayers stresses that it is just the AWS Marketplace version of Flight Compute which is limited to 32-nodes. The company has tested launching clusters up to 256 nodes in the AWS spot market (which is around 9,000 cores).

“The Marketplace is all about instant-access for users who want to ‘learn by doing’ rather than interacting with a vendor, so we’ve limited that to 32-nodes for the first release to ensure that new users get a good first-time experience,” explains Mayers. “For future Marketplace releases, we’re hoping to package up larger solutions with an appropriate choice of storage technologies to deliver scalable performance in all directions. Our roadmap has the 2016.3 version of Flight Compute available in Marketplace for September this year, but users can launch a Flight Cluster of any size with BeeGFS or Lustre today using appropriate AWS Cloudformation templates.”

Alces supports both Spot Instances and On-Demand instances, giving users needed choice. The company has also leveraged Amazon’s flexibility here: “So the obvious one is an automatically scaling. The job scheduler feeds into AWS and instructs the infrastructure how big it needs to be in terms of compute node.” It shrinks and expands the environment as needed.

“We try to give people the option with Flight when you launch a cluster you can request it to launch on-demand, which give you a guarantee that these instances are going to be permanent. You can also choose to launch a spot instance and Flight will allow you to enter the bid price you want. You can choose how much you want to spend. If a node does get killed because it’s blocked because your bid is too low, the job will return to a queue state on the cluster and can be submitted when the spot price falls again to a level your happy with,” says Mayers.

How all of this cloud-based HPC capability will be used is still evolving. There are, of course, the usual workflows. There is bursting to the cloud when in-house capacity is strained. That said, Mayers expects HPC in the cloud will be used several ways. For example, it might be used as a cost-effective HPC training tool in a university setting. Parking on-premise HPC capability temporarily in the cloud while the actual on-premise infrastructure undergoes change or maintenance is another. There’s no shortage of ideas.

Currently, Alces Flight’s traditional HPC integration is remains by far the largest piece of its business accounting for around 90 percent. Its integration business model, says Mayers differs from many HPC integration players in that it doesn’t buy and sell equipment; instead it works “mostly tier one” vendors and connects customers directly with them. The revenue comes from services (installation, integration, life cycle servicer, etc.).

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

Nvidia Leads Alpha MLPerf Benchmarking Round

December 12, 2018

Seven months after the launch of its AI benchmarking suite, the MLPerf consortium is releasing the first round of results based on submissions from Nvidia, Google and Intel. Of the seven benchmarks encompassed in version Read more…

By Tiffany Trader

Neural Network ‘Synapse’ Technology Showcased at IEEE Meeting

December 12, 2018

There’s nice snapshot of advancing work to develop improved neural network “synapse” technologies posted yesterday on IEEE Spectrum. Lower power, ease of use, manufacturability, and performance are all key paramete Read more…

By John Russell

IBM, Nvidia in AI Data Pipeline, Processing, Storage Union

December 11, 2018

IBM and Nvidia today announced a new turnkey AI solution that combines IBM Spectrum Scale scale-out file storage with Nvidia’s GPU-based DGX-1 AI server to provide what the companies call the “the highest performance Read more…

By Doug Black

HPE Extreme Performance Solutions

AI Can Be Scary. But Choosing the Wrong Partners Can Be Mortifying!

As you continue to dive deeper into AI, you will discover it is more than just deep learning. AI is an extremely complex set of machine learning, deep learning, reinforcement, and analytics algorithms with varying compute, storage, memory, and communications needs. Read more…

IBM Accelerated Insights

4 Ways AI Analytics Projects Fail — and How to Succeed

“How do I de-risk my AI-driven analytics projects?” This is a common question for organizations ready to modernize their analytics portfolio. Here are four ways AI analytics projects fail—and how you can ensure success. Read more…

Is Amazon’s Plunge into Server Chips a Watershed Moment?

December 11, 2018

For several years now the big cloud providers – Amazon, Microsoft Azure, Google, et al – have been transforming from technology consumers into technology creators in hardware and software. The most recent example bei Read more…

By John Russell

Nvidia Leads Alpha MLPerf Benchmarking Round

December 12, 2018

Seven months after the launch of its AI benchmarking suite, the MLPerf consortium is releasing the first round of results based on submissions from Nvidia, Goog Read more…

By Tiffany Trader

IBM, Nvidia in AI Data Pipeline, Processing, Storage Union

December 11, 2018

IBM and Nvidia today announced a new turnkey AI solution that combines IBM Spectrum Scale scale-out file storage with Nvidia’s GPU-based DGX-1 AI server to pr Read more…

By Doug Black

Is Amazon’s Plunge into Server Chips a Watershed Moment?

December 11, 2018

For several years now the big cloud providers – Amazon, Microsoft Azure, Google, et al – have been transforming from technology consumers into technology cr Read more…

By John Russell

Mellanox Uses Univa to Extend Silicon Design HPC Operation to Azure

December 11, 2018

Call it a corollary to Murphy’s Law: When a system is most in demand, when end users are most dependent on the system performing as required, when it’s crunch time – that’s when the system is most likely to blow up. Or make you wait in line to use it. Read more…

By Doug Black

Topology Can Help Us Find Patterns in Weather

December 6, 2018

Topology--the study of shapes--seems to be all the rage. You could even say that data has shape, and shape matters. Shapes are comfortable and familiar concepts, so it is intriguing to see that many applications are being recast to use topology. For instance, looking for weather and climate patterns. Read more…

By James Reinders

Zettascale by 2035? China Thinks So

December 6, 2018

Exascale machines (of at least a 1 exaflops peak) are anticipated to arrive by around 2020, a few years behind original predictions; and given extreme-scale performance challenges are not getting any easier, it makes sense that researchers are already looking ahead to the next big 1,000x performance goal post: zettascale computing. Read more…

By Tiffany Trader

Robust Quantum Computers Still a Decade Away, Says Nat’l Academies Report

December 5, 2018

The National Academies of Science, Engineering, and Medicine yesterday released a report – Quantum Computing: Progress and Prospects – whose optimism about Read more…

By John Russell

Revisiting the 2008 Exascale Computing Study at SC18

November 29, 2018

A report published a decade ago conveyed the results of a study aimed at determining if it were possible to achieve 1000X the computational power of the the Read more…

By Scott Gibson

Quantum Computing Will Never Work

November 27, 2018

Amid the gush of money and enthusiastic predictions being thrown at quantum computing comes a proposed cold shower in the form of an essay by physicist Mikhail Read more…

By John Russell

Cray Unveils Shasta, Lands NERSC-9 Contract

October 30, 2018

Cray revealed today the details of its next-gen supercomputing architecture, Shasta, selected to be the next flagship system at NERSC. We've known of the code-name "Shasta" since the Argonne slice of the CORAL project was announced in 2015 and although the details of that plan have changed considerably, Cray didn't slow down its timeline for Shasta. Read more…

By Tiffany Trader

IBM at Hot Chips: What’s Next for Power

August 23, 2018

With processor, memory and networking technologies all racing to fill in for an ailing Moore’s law, the era of the heterogeneous datacenter is well underway, Read more…

By Tiffany Trader

House Passes $1.275B National Quantum Initiative

September 17, 2018

Last Thursday the U.S. House of Representatives passed the National Quantum Initiative Act (NQIA) intended to accelerate quantum computing research and developm Read more…

By John Russell

Summit Supercomputer is Already Making its Mark on Science

September 20, 2018

Summit, now the fastest supercomputer in the world, is quickly making its mark in science – five of the six finalists just announced for the prestigious 2018 Read more…

By John Russell

AMD Sets Up for Epyc Epoch

November 16, 2018

It’s been a good two weeks, AMD’s Gary Silcott and Andy Parma told me on the last day of SC18 in Dallas at the restaurant where we met to discuss their show news and recent successes. Heck, it’s been a good year. Read more…

By Tiffany Trader

US Leads Supercomputing with #1, #2 Systems & Petascale Arm

November 12, 2018

The 31st Supercomputing Conference (SC) - commemorating 30 years since the first Supercomputing in 1988 - kicked off in Dallas yesterday, taking over the Kay Ba Read more…

By Tiffany Trader

CERN Project Sees Orders-of-Magnitude Speedup with AI Approach

August 14, 2018

An award-winning effort at CERN has demonstrated potential to significantly change how the physics based modeling and simulation communities view machine learni Read more…

By Rob Farber

Leading Solution Providers

SC 18 Virtual Booth Video Tour

Advania @ SC18 AMD @ SC18
ASRock Rack @ SC18
DDN Storage @ SC18
HPE @ SC18
IBM @ SC18
Lenovo @ SC18 Mellanox Technologies @ SC18
NVIDIA @ SC18
One Stop Systems @ SC18
Oracle @ SC18 Panasas @ SC18
Supermicro @ SC18 SUSE @ SC18 TYAN @ SC18
Verne Global @ SC18

TACC’s ‘Frontera’ Supercomputer Expands Horizon for Extreme-Scale Science

August 29, 2018

The National Science Foundation and the Texas Advanced Computing Center announced today that a new system, called Frontera, will overtake Stampede 2 as the fast Read more…

By Tiffany Trader

HPE No. 1, IBM Surges, in ‘Bucking Bronco’ High Performance Server Market

September 27, 2018

Riding healthy U.S. and global economies, strong demand for AI-capable hardware and other tailwind trends, the high performance computing server market jumped 28 percent in the second quarter 2018 to $3.7 billion, up from $2.9 billion for the same period last year, according to industry analyst firm Hyperion Research. Read more…

By Doug Black

Nvidia’s Jensen Huang Delivers Vision for the New HPC

November 14, 2018

For nearly two hours on Monday at SC18, Jensen Huang, CEO of Nvidia, presented his expansive view of the future of HPC (and computing in general) as only he can do. Animated. Backstopped by a stream of data charts, product photos, and even a beautiful image of supernovae... Read more…

By John Russell

Germany Celebrates Launch of Two Fastest Supercomputers

September 26, 2018

The new high-performance computer SuperMUC-NG at the Leibniz Supercomputing Center (LRZ) in Garching is the fastest computer in Germany and one of the fastest i Read more…

By Tiffany Trader

Houston to Field Massive, ‘Geophysically Configured’ Cloud Supercomputer

October 11, 2018

Based on some news stories out today, one might get the impression that the next system to crack number one on the Top500 would be an industrial oil and gas mon Read more…

By Tiffany Trader

Intel Confirms 48-Core Cascade Lake-AP for 2019

November 4, 2018

As part of the run-up to SC18, taking place in Dallas next week (Nov. 11-16), Intel is doling out info on its next-gen Cascade Lake family of Xeon processors, specifically the “Advanced Processor” version (Cascade Lake-AP), architected for high-performance computing, artificial intelligence and infrastructure-as-a-service workloads. Read more…

By Tiffany Trader

Google Releases Machine Learning “What-If” Analysis Tool

September 12, 2018

Training machine learning models has long been time-consuming process. Yesterday, Google released a “What-If Tool” for probing how data point changes affect a model’s prediction. The new tool is being launched as a new feature of the open source TensorBoard web application... Read more…

By John Russell

The Convergence of Big Data and Extreme-Scale HPC

August 31, 2018

As we are heading towards extreme-scale HPC coupled with data intensive analytics like machine learning, the necessary integration of big data and HPC is a curr Read more…

By Rob Farber

  • arrow
  • Click Here for More Headlines
  • arrow
Do NOT follow this link or you will be banned from the site!
Share This