StarCluster Brings HPC to the Amazon Cloud

By Justin Riley

May 18, 2010

Setting up an HPC cluster in the cloud can be a daunting task for new users looking to utilize the cloud to run their HPC applications. Learning the ins and outs of the infrastructure as a service (IaaS) model in addition to configuring and installing a typical HPC system is not an easy task.

In order to use the cloud effectively users need to be able to automate the process of requesting and configuring new resources and also terminate resources when they’re no longer required without losing data. These concerns can be a challenge even for advanced users and require some level of cloud programming in order to get it right. In an effort to improve this situation, the Software Tools for Academics and Researchers (STAR) group at MIT has created an open-source project called StarCluster that allows anyone to create and manage their own HPC clusters hosted on Amazon’s Elastic Compute Cloud (EC2) without needing to be a cloud expert.

StarCluster Configuration

One of StarCluster’s primary goals is to be simple to use and to hide as many of the cloud computing details from users as possible. When a new user attempts to use StarCluster for the first time an example configuration file is created that is ready to be used out-of-the-box. The user simply needs to fill in the EC2 account information and optionally customize the number of machines to use before he or she is ready to start a cluster. Starting a cluster with the example configuration will launch a two-machine cluster using the cheapest instance types available on EC2. This allows users to experiment with StarCluster for the first time without dramatic up-front costs.

The group of cluster-specific settings in the configuration file is known as a “cluster template”. StarCluster supports defining multiple cluster templates which can be used when launching a cluster. For example, it’s often useful to have separate templates for different cluster sizes such as a template that defines a small two-machine cluster and another template that defines a large ten-machine cluster. These templates can be specified at runtime to allow a variety of configurations to be used when starting a cluster.

Starting an HPC Cluster on EC2

Once the configuration file has been created, starting a cluster is as simple as running “starcluster start mynewcluster” at the command line. This command will first verify that all settings in the configuration file are valid and are likely to create a working system. Once the settings in the configuration file have been verified, the “start” command creates a new cluster based on these settings with a tag-name of “mynewcluster” on EC2.

Once the “start” command has finished the user can login to the “master” machine as root by running “starcluster sshmaster mynewcluster”. At this point the user has the (root) keys to the cluster just as they would with their own local resources.

StarCluster also has the ability to create multiple HPC clusters. Running the same “start” command again with a different tag-name will launch another HPC cluster in the cloud using the same settings as the previous run. If you’ve defined additional cluster templates in the configuration file these can optionally be used to specify a different group of settings to use when starting the next cluster.

Once the user has finished using a cluster they simply specify its tag-name to StarCluster’s “stop” command to shut it down. For the “mynewcluster” example above the command would be “starcluster stop mynewcluster”. The “stop” command will shutdown the entire cluster and terminate the billing period.

Automated HPC Cluster Configuration

StarCluster automatically configures each machine with the appropriate networking settings needed to communicate with the rest of the cluster. On top of this, StarCluster also fully configures password-less SSH communication for both the root user and a normal user on the cluster. Password-less SSH allows a user to login remotely between machines in the cluster without using a password. This is useful when administering the machines in the cloud and is also a necessary requirement for OpenMPI communication.

Most clusters usually have some form of a queuing system for submitting and load-balancing many computationally intensive tasks or “jobs” and StarCluster is no exception. Out-of-the-box, StarCluster installs and configures the open-source version of the Sun Grid Engine (SGE) queuing system for running distributed and parallel jobs on the cluster. A parallel queue is also configured by default that enables SGE to monitor and account for parallel tasks that use more than one machine in a single job.

Many parallel tasks are commonly written using the Message Passing Interface (MPI). For MPI users, StarCluster includes an SGE-aware OpenMPI installation that provides tight integration between the SGE job scheduler and MPI applications. This integration removes the need for users to specify a list of hosts to use when running an MPI job. Rather, OpenMPI will automatically fetch the host info it needs directly from SGE and begin execution. This allows all machines involved in the MPI calculation to be correctly accounted for by the queuing system.

Sharing files between machines without manually copying files around is a requirement for most HPC systems. Typically this is done using a shared folder via the network file system (NFS). StarCluster automatically configures /home on each “worker” machine of the cluster to be NFS-shared from the “master” machine. This allows users to see their files on any machine in the cluster and also provides a globally accessible place for jobs to read input data and write their finished results.

The StarCluster Amazon Machine Image (AMI)

Amazon Machine Images are used by EC2 to load an entire operating system along with various applications, libraries, and data onto a newly requested virtual machine. Machine images are publicly available for just about any Linux distribution, Solaris, and even Microsoft Windows. New images can be created with custom software configurations by launching a new virtual machine from an existing AMI, installing your new software, and then running an AMI creation process on the machine to create a new AMI.

StarCluster comes with a publicly available custom-tailored AMI, in both 32bit and 64bit flavors, that contains the entire OS and software configuration needed for an HPC cluster on Amazon. The StarCluster AMI is Ubuntu Linux 9.10 based and includes the Sun Grid Engine queuing system (open-source edition), the network file system, and OpenMPI along with common development tools and libraries to compile new software from source. The StarCluster AMI also includes a custom-compiled installation of the Automatically Tuned Linear Algebra Subroutines (ATLAS) and Linear Algebra PACKage (LAPACK) libraries that have been optimized for the larger high-CPU instance types on EC2. For numerical python users, the AMI contains both NumPy and SciPy installations that have been custom compiled against the optimized LAPACK/ATLAS installations. These optimized libraries provide a significant performance improvement when running linear algebra routines in the cloud.

Of course, StarCluster does not limit you to only these software installations. The StarCluster AMIs can easily be extended with your own software to create a brand-new AMI tailored for a specific need. To simplify the AMI creation process StarCluster provides a “createimage” command that will automatically create a new AMI from a running Amazon EC2 virtual machine in the cloud. This allows you to launch a single virtual machine, install your software, and easily create a new AMI from this machine. Using a new customized AMI with StarCluster is as simple as updating the configuration file with the new AMI’s identifier.

Using EBS Volumes for Persistent Storage

Amazon also provides a service called Elastic Block Storage (EBS) which allows users to create virtual block storage volumes that are similar in functionality to a USB pen-drive. These volumes can be anywhere from 1GB to 1TB in size and can be attached to a single virtual machine in the cloud at a time. The benefit of using these volumes is that any data written to EBS is automatically stored and persisted in the cloud even after all virtual machines have been terminated. This means the next time you start a cluster and attach the EBS volume, all of your data will be available as it was the last time you launched a cluster. Another benefit of using EBS volumes is that they’re easy to snapshot and duplicate which allows for backing up large amounts of data in the cloud.

StarCluster has the ability to utilize Amazon’s EBS volumes to provide persistent data storage for a given cluster. To use EBS with StarCluster you must first create an EBS volume. For new users, this process is simplified by using StarCluster’s “createvolume” command. This command automates the process of creating, partitioning, and formatting a new EBS volume.

Using a new volume with StarCluster involves adding additional volume settings to the configuration file. These settings specify the volume to use and the location on the cluster’s file system to attach the volume. This file system location is then NFS-shared from the “master” machine to all “worker” machines. StarCluster does not limit you to using a single EBS volume. Multiple EBS volumes can be configured, attached, and shared on the cluster. This allows up to several terabytes of data to be stored on the cluster.Getting Started with StarCluster

StarCluster is open-source software and can be downloaded for free from the StarCluster website at http://web.mit.edu/starcluster or from the Python Package Index (PyPI) at http://pypi.python.org/pypi/StarCluster.

UPDATE: We now have a video screencast of StarCluster in action that can be viewed here.

About the Author

Justin Riley is a software developer for the Software Tools for Academics and Researchers (STAR) group at the Massachusetts Institute of Technology (MIT). The STAR group seeks to bridge the divide between scientific research and the classroom by collaborating with faculty from MIT and other educational institutions to design software that explores core scientific research concepts. The STAR group works out of the Office of Educational Innovation and Technology (OEIT) under the Dean for Undergraduate Education (DUE) at MIT.

Justin has been developing with the Amazon cloud for the past three years and has successfully used the cloud to support the “Introduction to Modeling and Simulation” and “Intro to Parallel Programming for Multicore Machines using OpenMP and OpenMPI” courses at MIT. His work with StarCluster came directly from the need to provide a sustainable solution to the issues associated with bringing computational resources into the classroom. Justin created StarCluster to automate the process of locating, configuring, and maintaining computational resources without needing to be a 24/7 system administrator and without having to make a physical appearance to address potential hardware and software issues.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

RSC Reports 500Tflops, Hot Water Cooled System Deployed at JINR

April 18, 2018

RSC, developer of supercomputers and advanced HPC systems based in Russia, today reported deployment of “the world's first 100% ‘hot water’ liquid cooled supercomputer” at Joint Institute for Nuclear Research (JI Read more…

By Staff

New Device Spots Quantum Particle ‘Fingerprint’

April 18, 2018

Majorana particles have been observed by university researchers employing a device consisting of layers of magnetic insulators on a superconducting material. The advance opens the door to controlling the elusive particle Read more…

By George Leopold

Cray Rolls Out AMD-Based CS500; More to Follow?

April 18, 2018

Cray was the latest OEM to bring AMD back into the fold with introduction today of a CS500 option based on AMD’s Epyc processor line. The move follows Cray’s introduction of an ARM-based system (XC-50) last November. Read more…

By John Russell

HPE Extreme Performance Solutions

Hybrid HPC is Speeding Time to Insight and Revolutionizing Medicine

High performance computing (HPC) is a key driver of success in many verticals today, and health and life science industries are extensively leveraging these capabilities. Read more…

Hennessy & Patterson: A New Golden Age for Computer Architecture

April 17, 2018

On Monday June 4, 2018, 2017 A.M. Turing Award Winners John L. Hennessy and David A. Patterson will deliver the Turing Lecture at the 45th International Symposium on Computer Architecture (ISCA) in Los Angeles. The Read more…

By Staff

Cray Rolls Out AMD-Based CS500; More to Follow?

April 18, 2018

Cray was the latest OEM to bring AMD back into the fold with introduction today of a CS500 option based on AMD’s Epyc processor line. The move follows Cray’ Read more…

By John Russell

IBM: Software Ecosystem for OpenPOWER is Ready for Prime Time

April 16, 2018

With key pieces of the IBM/OpenPOWER versus Intel/x86 gambit settling into place – e.g., the arrival of Power9 chips and Power9-based systems, hyperscaler sup Read more…

By John Russell

US Plans $1.8 Billion Spend on DOE Exascale Supercomputing

April 11, 2018

On Monday, the United States Department of Energy announced its intention to procure up to three exascale supercomputers at a cost of up to $1.8 billion with th Read more…

By Tiffany Trader

Cloud-Readiness and Looking Beyond Application Scaling

April 11, 2018

There are two aspects to consider when determining if an application is suitable for running in the cloud. The first, which we will discuss here under the title Read more…

By Chris Downing

Transitioning from Big Data to Discovery: Data Management as a Keystone Analytics Strategy

April 9, 2018

The past 10-15 years has seen a stark rise in the density, size, and diversity of scientific data being generated in every scientific discipline in the world. Key among the sciences has been the explosion of laboratory technologies that generate large amounts of data in life-sciences and healthcare research. Large amounts of data are now being stored in very large storage name spaces, with little to no organization and a general unease about how to approach analyzing it. Read more…

By Ari Berman, BioTeam, Inc.

IBM Expands Quantum Computing Network

April 5, 2018

IBM is positioning itself as a first mover in establishing the era of commercial quantum computing. The company believes in order for quantum to work, taming qu Read more…

By Tiffany Trader

FY18 Budget & CORAL-2 – Exascale USA Continues to Move Ahead

April 2, 2018

It was not pretty. However, despite some twists and turns, the federal government’s Fiscal Year 2018 (FY18) budget is complete and ended with some very positi Read more…

By Alex R. Larzelere

Nvidia Ups Hardware Game with 16-GPU DGX-2 Server and 18-Port NVSwitch

March 27, 2018

Nvidia unveiled a raft of new products from its annual technology conference in San Jose today, and despite not offering up a new chip architecture, there were still a few surprises in store for HPC hardware aficionados. Read more…

By Tiffany Trader

Inventor Claims to Have Solved Floating Point Error Problem

January 17, 2018

"The decades-old floating point error problem has been solved," proclaims a press release from inventor Alan Jorgensen. The computer scientist has filed for and Read more…

By Tiffany Trader

Researchers Measure Impact of ‘Meltdown’ and ‘Spectre’ Patches on HPC Workloads

January 17, 2018

Computer scientists from the Center for Computational Research, State University of New York (SUNY), University at Buffalo have examined the effect of Meltdown Read more…

By Tiffany Trader

Russian Nuclear Engineers Caught Cryptomining on Lab Supercomputer

February 12, 2018

Nuclear scientists working at the All-Russian Research Institute of Experimental Physics (RFNC-VNIIEF) have been arrested for using lab supercomputing resources to mine crypto-currency, according to a report in Russia’s Interfax News Agency. Read more…

By Tiffany Trader

How the Cloud Is Falling Short for HPC

March 15, 2018

The last couple of years have seen cloud computing gradually build some legitimacy within the HPC world, but still the HPC industry lies far behind enterprise I Read more…

By Chris Downing

Chip Flaws ‘Meltdown’ and ‘Spectre’ Loom Large

January 4, 2018

The HPC and wider tech community have been abuzz this week over the discovery of critical design flaws that impact virtually all contemporary microprocessors. T Read more…

By Tiffany Trader

Fast Forward: Five HPC Predictions for 2018

December 21, 2017

What’s on your list of high (and low) lights for 2017? Volta 100’s arrival on the heels of the P100? Appearance, albeit late in the year, of IBM’s Power9? Read more…

By John Russell

How Meltdown and Spectre Patches Will Affect HPC Workloads

January 10, 2018

There have been claims that the fixes for the Meltdown and Spectre security vulnerabilities, named the KPTI (aka KAISER) patches, are going to affect applicatio Read more…

By Rosemary Francis

Nvidia Responds to Google TPU Benchmarking

April 10, 2017

Nvidia highlights strengths of its newest GPU silicon in response to Google's report on the performance and energy advantages of its custom tensor processor. Read more…

By Tiffany Trader

Leading Solution Providers

Deep Learning at 15 PFlops Enables Training for Extreme Weather Identification at Scale

March 19, 2018

Petaflop per second deep learning training performance on the NERSC (National Energy Research Scientific Computing Center) Cori supercomputer has given climate Read more…

By Rob Farber

Lenovo Unveils Warm Water Cooled ThinkSystem SD650 in Rampup to LRZ Install

February 22, 2018

This week Lenovo took the wraps off the ThinkSystem SD650 high-density server with third-generation direct water cooling technology developed in tandem with par Read more…

By Tiffany Trader

AI Cloud Competition Heats Up: Google’s TPUs, Amazon Building AI Chip

February 12, 2018

Competition in the white hot AI (and public cloud) market pits Google against Amazon this week, with Google offering AI hardware on its cloud platform intended Read more…

By Doug Black

HPC and AI – Two Communities Same Future

January 25, 2018

According to Al Gara (Intel Fellow, Data Center Group), high performance computing and artificial intelligence will increasingly intertwine as we transition to Read more…

By Rob Farber

New Blueprint for Converging HPC, Big Data

January 18, 2018

After five annual workshops on Big Data and Extreme-Scale Computing (BDEC), a group of international HPC heavyweights including Jack Dongarra (University of Te Read more…

By John Russell

US Plans $1.8 Billion Spend on DOE Exascale Supercomputing

April 11, 2018

On Monday, the United States Department of Energy announced its intention to procure up to three exascale supercomputers at a cost of up to $1.8 billion with th Read more…

By Tiffany Trader

Momentum Builds for US Exascale

January 9, 2018

2018 looks to be a great year for the U.S. exascale program. The last several months of 2017 revealed a number of important developments that help put the U.S. Read more…

By Alex R. Larzelere

Google Chases Quantum Supremacy with 72-Qubit Processor

March 7, 2018

Google pulled ahead of the pack this week in the race toward "quantum supremacy," with the introduction of a new 72-qubit quantum processor called Bristlecone. Read more…

By Tiffany Trader

  • arrow
  • Click Here for More Headlines
  • arrow
Share This