Does Your Cluster Scale?

By Benoit Marchand

September 22, 2006

Introduction

This article will discuss both the hidden and painfully obvious scaling inefficiencies inherent in current technology commodity cluster and Grid computing with respect to data movement. It will also discuss the latest advances in parallel file serving technology and how these techniques can be utilized with current network topologies, file servers and unmodified applications to deliver throughput performance speed-ups of order 2X to 9X on certain workloads.

Commodity Clusters: The Promise and the Reality

IDC reported that in 2005 clusters represented about 50 percent of high performance and technical computing systems sales. Cluster sales have been growing very rapidly, driven by the desire to lower TCO for computationally intensive workloads.

I think there is a world market for maybe five computers.
    – Thomas Watson, Chairman of IBM, 1943

Perhaps you have more than five computers in your cluster. Using commodity components to build high-throughput clusters makes tremendous sense for delivering affordable processing power. In theory if your application is highly parallel, the run-time should reduce almost linearly as additional processing nodes are allocated to the application workload. However, the move away from SMPs to clusters has caused the file serving capability to become an external resource.

If you were plowing a field, which would you rather use? Two strong oxen or 1024 chickens?
    – Seymour Cray, father of supercomputing

On “throughput-oriented data intensive” workloads, cluster administrators and users alike are discovering that performance not only fails to scale as more nodes are added, but the throughput actually starts to degrade. This phenomenon is perhaps best understood [at a high level] if you consider Seymour Cray's definition of a supercomputer: “a device that turns a compute-bound problem into an I/O-bound problem.”

Clearly this is the case for commodity cluster technology today. As commodity processor technology accelerates with Moore's Law, the network bandwidth and file serving capabilities are falling farther and farther behind. How can we address the I/O corollary to Amdahl's law (the slowest component governs overall performance) in a way that does not add significantly to the cost structure of a cluster? Parallel file serving technology offers an answer.

Typically on clusters today, Workload Managers (WLMs) assign jobs to nodes based on application license availability and the cluster or grid's computational node characteristics (CPU availability, memory, local disk, etc.) Consequently, jobs are assigned to cluster nodes and each cluster node then attempts to acquire the requisite data file(s) from the file system (be it NFS or a parallel file system) through the network in a haphazard manner. This results in an I/O traffic jam that throttles the efficiency and scaling of throughput applications on the cluster.

This phenomenon of poor data provisioning results in the I/O time eventually swamping the compute time as node count increases. Poor data provisioning starves the cluster, resulting in a throughput performance curve that resembles a quadratic in the performance vs. node count space — that is, run time improves as the number of nodes increases, but relatively less with each added node, up to some inflection point, then the time per job increases absolutely due to the continually increasing I/O burden becoming overwhelming.

Often these inefficiencies go unnoticed when one has not yet reached the inflection point, since the cluster CPU utilization, the network utilization and the file system all seem to be performing at acceptable levels. However, further analysis reveals that the cluster CPUs are busy waiting for data to arrive across the network from the file system and are NOT busy doing productive work.

Are you getting what you paid for?

When cluster administrators are asked whether their cluster is utilized efficiently, most would instantly and resoundingly say “YES.” Often it's the users who have experienced performance degradation (or less improvement than expected) after a hardware upgrade who look at this issue more closely.

There are a number of methods available to measure the total efficiency of your cluster on specific applications. The easiest method is explained in the following two steps:

1. First, edit your run scripts and add “/bin/time” (or an equivalent command) where applications are launched. Then run the application on your entire cluster at once, using your normal method of file serving (e.g. NFS from a NAS system or parallel file system).

/bin/time blastall –p ...

2. Second, try pre-staging your files. Run the same script on a single node (or all nodes if you wish) but first transfer all input files to local disk (on each cluster node) and designate output files to be written locally. Collect the timings and calculate the efficiency (“user + system / real”). Compare the NFS (or other file serving mechanism) to the local disk resident results. You will likely be surprised to discover that you have a data serving bottleneck.

real: 160.1
user:  80.3
sys:    1.6

efficiency: 51 percent

Typically such tests take just a few minutes to set up and to analyze results. So with a minimum time investment on your part you can get an accurate picture of your cluster's processing efficiency on your favorite applications.

How parallel file serving technology can improve cluster efficiency

With a clear understanding of what limits cluster efficiency, let's look at ways to boost file serving capacity. Distributed file systems, parallel file systems, high bandwidth file serving devices, etc. exhibit better performance characteristics than standard NFS topologies, but do not provide a quantum leap in performance and are often prohibitively expensive to acquire. In the case of some parallel file systems, they are additionally very expensive when it comes time to expand storage capacity.

Parallel file serving technology with sophisticated error recovery and fault resilience techniques offer a method of replicating data sets ahead of applications being pushed to nodes. This is a novel and tantalizing approach to solving the data provisioning problem plaguing commodity clusters and edge grids today. However, replication as a parallel file serving technology is not a general-purpose solution and the characteristics of an application need to be looked at to determine whether this approach will significantly accelerate cluster throughput.

The ideal application profile is where the same, or substantially the same, large (300 MB to 50 GB) data set is being sent to each node of a cluster that utilizes 32 or more nodes. Examples of these types of ideal codes include Genome Searches (NCBI-BLAST), Medical Imaging, Weather Analysis, Rendering, etc.

However, many parallel applications work by splitting a problem via domain decomposition, such that each sub-task requires just a portion of the entire data set. So why replicate the whole data set on each node? Well there are many reasons.

First, there is always some overlap in the data set requirements between sub-tasks (boundary zones) requiring that part of the data set is sent many times over on the network. In such cases replicating the overlapped sections is more efficient.

Second, ordinarily each node runs multiple sub-tasks and over time each node will have utilized a significant proportion of the entire data set. In practice data sets are sent repeatedly over the network consuming bandwidth and aggravating network congestion. It makes sense then to send the data set just once to all nodes concurrently.

Third, when sub-tasks start to execute I/O, capacity is limited by the lesser of file serving and network capacity. In general file serving is less than 200 MB/s on a large cluster (500 MB/s with high-end specialized file servers). But when data is replicated to nodes the I/O capacity is linearly proportional to the number of nodes. So at 50 MB/s per node a cluster of 100 nodes can generate 5 GB/sec of I/O capacity,  over an order of magnitude better than “top of the line” file servers.

Fourth, one has to take into account the file serving overhead which replication does not experience. The following is a real-world example of such overhead:

Recently CeBiTec, a leading bioinformatics research center located at Bielefeld University in Germany, tested the following scenario for parallel file serving performance. They provisioned 935 MB to each of 120 nodes over 1 Gbit Ethernet using a 250 MB/s file server with CacheFS and then, with replication technology. As you can see by the results below, replication from data residing on the file server was just as fast as CacheFS with a 100 percent hit rate on the local nodes and no dirty cache entries (the absolute best case for CacheFS). Compared to CacheFS with a dirty cache, or to regular V3 NFS, replication was over 20 times faster. This means that in the ideal case for file serving operation, where no actual data movement occurs (all data is perfectly cached on all nodes), it is still no better than straight replication from the original file server. This is what file serving overheads can do to degrade your cluster efficiency.

Method                            Time (mm:ss)

NFS v3                            14:50
NFS CacheFS (cache filled)         0:30  (estimated)
NFS CacheFS (cache dirty)         11:50
Replication                        0:35

Finally, few people believe replication can be made to scale on large clusters. The same research center also tested the scalability of replication. At 10 nodes replication to every node took 45 seconds and at 120 nodes it took just 35 seconds; better than linear scaling. While that result is somewhat unusual, nearly linear performance is typically seen, the point being that replicating to 100s of nodes costs about the same time as replicating to a few nodes.

What has to change to use replication?

Migration to a replication process takes very little time. There is no need to change the applications, Workload Manager, file server, networking or OS environment. Simply edit your job scripts and point to the local cache where appropriate when starting applications. Users can be up and running within the first day.

It is possible to address the scaling issue faced with data-intensive throughput workloads by employing parallel file serving technology, provided the technology supports sophisticated error recovery and fault resilience techniques and is designed for scalability to large numbers of nodes/clients. The technology should:

  • Solve synchronization, error recovery and back-up recovery issues of replication
  • Automatically pre-stage data, which means that while nodes are running a set of processes the data needed for the next set is pre-loaded in the background, thus completely eliminating file serving latencies and network bottlenecks
  • Automatically synchronize with Workload Managers to ensure that jobs only run when data is ready to be used at the nodes
  • Establish an efficient, asynchronous processing pipeline for both input and output data
  • Automatically clean up data sets that are no longer used
  • Automatically resynchronize nodes after a crash, automating complex and tedious management chores

With scalable parallel file serving, high performance technical computing on clusters becomes a reality for many more users.

Make everything as simple as possible, but not simpler.
    – Albert Einstein

—–

Benoit Marchand is the CEO and founder of eXludus and has been active in the field of high performance computing and distributed processing applications for over 20 years. He has held management positions at Sun Microsystems and SGI. He received an MBA from HEC in Montreal and a master's degree in computer science from the University of Waterloo.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

UCSD Web-based Tool Tracking CA Wildfires Generates 1.5M Views

October 16, 2017

Tracking the wildfires raging in northern CA is an unpleasant but necessary part of guiding efforts to fight the fires and safely evacuate affected residents. One such tool – Firemap – is a web-based tool developed b Read more…

By John Russell

Exascale Imperative: New Movie from HPE Makes a Compelling Case

October 13, 2017

Why is pursuing exascale computing so important? In a new video – Hewlett Packard Enterprise: Eighteen Zeros – four HPE executives, a prominent national lab HPC researcher, and HPCwire managing editor Tiffany Trader Read more…

By John Russell

Intel Delivers 17-Qubit Quantum Chip to European Research Partner

October 10, 2017

On Tuesday, Intel delivered a 17-qubit superconducting test chip to research partner QuTech, the quantum research institute of Delft University of Technology (TU Delft) in the Netherlands. The announcement marks a major milestone in the 10-year, $50-million collaborative relationship with TU Delft and TNO, the Dutch Organization for Applied Research, to accelerate advancements in quantum computing. Read more…

By Tiffany Trader

HPE Extreme Performance Solutions

“Lunch & Learn” to Explore the Growing Applications of Genomic Analytics

In the digital age of medicine, healthcare providers are rapidly transforming their approach to patient care. Traditional technologies are no longer sufficient to process vast quantities of medical data (including patient histories, treatment plans, diagnostic reports, and more), challenging organizations to invest in a new style of IT to enable faster and higher-quality care. Read more…

Fujitsu Tapped to Build 37-Petaflops ABCI System for AIST

October 10, 2017

Fujitsu announced today it will build the long-planned AI Bridging Cloud Infrastructure (ABCI) which is set to become the fastest supercomputer system in Japan and will begin operation in fiscal 2018 (starts in April). A Read more…

By John Russell

Intel Delivers 17-Qubit Quantum Chip to European Research Partner

October 10, 2017

On Tuesday, Intel delivered a 17-qubit superconducting test chip to research partner QuTech, the quantum research institute of Delft University of Technology (TU Delft) in the Netherlands. The announcement marks a major milestone in the 10-year, $50-million collaborative relationship with TU Delft and TNO, the Dutch Organization for Applied Research, to accelerate advancements in quantum computing. Read more…

By Tiffany Trader

Fujitsu Tapped to Build 37-Petaflops ABCI System for AIST

October 10, 2017

Fujitsu announced today it will build the long-planned AI Bridging Cloud Infrastructure (ABCI) which is set to become the fastest supercomputer system in Japan Read more…

By John Russell

HPC Chips – A Veritable Smorgasbord?

October 10, 2017

For the first time since AMD's ill-fated launch of Bulldozer the answer to the question, 'Which CPU will be in my next HPC system?' doesn't have to be 'Whichever variety of Intel Xeon E5 they are selling when we procure'. Read more…

By Dairsie Latimer

Delays, Smoke, Records & Markets – A Candid Conversation with Cray CEO Peter Ungaro

October 5, 2017

Earlier this month, Tom Tabor, publisher of HPCwire and I had a very personal conversation with Cray CEO Peter Ungaro. Cray has been on something of a Cinderell Read more…

By Tiffany Trader & Tom Tabor

Intel Debuts Programmable Acceleration Card

October 5, 2017

With a view toward supporting complex, data-intensive applications, such as AI inference, video streaming analytics, database acceleration and genomics, Intel i Read more…

By Doug Black

OLCF’s 200 Petaflops Summit Machine Still Slated for 2018 Start-up

October 3, 2017

The Department of Energy’s planned 200 petaflops Summit computer, which is currently being installed at Oak Ridge Leadership Computing Facility, is on track t Read more…

By John Russell

US Exascale Program – Some Additional Clarity

September 28, 2017

The last time we left the Department of Energy’s exascale computing program in July, things were looking very positive. Both the U.S. House and Senate had pas Read more…

By Alex R. Larzelere

US Coalesces Plans for First Exascale Supercomputer: Aurora in 2021

September 27, 2017

At the Advanced Scientific Computing Advisory Committee (ASCAC) meeting, in Arlington, Va., yesterday (Sept. 26), it was revealed that the "Aurora" supercompute Read more…

By Tiffany Trader

How ‘Knights Mill’ Gets Its Deep Learning Flops

June 22, 2017

Intel, the subject of much speculation regarding the delayed, rewritten or potentially canceled “Aurora” contract (the Argonne Lab part of the CORAL “ Read more…

By Tiffany Trader

Reinders: “AVX-512 May Be a Hidden Gem” in Intel Xeon Scalable Processors

June 29, 2017

Imagine if we could use vector processing on something other than just floating point problems.  Today, GPUs and CPUs work tirelessly to accelerate algorithms Read more…

By James Reinders

NERSC Scales Scientific Deep Learning to 15 Petaflops

August 28, 2017

A collaborative effort between Intel, NERSC and Stanford has delivered the first 15-petaflops deep learning software running on HPC platforms and is, according Read more…

By Rob Farber

Oracle Layoffs Reportedly Hit SPARC and Solaris Hard

September 7, 2017

Oracle’s latest layoffs have many wondering if this is the end of the line for the SPARC processor and Solaris OS development. As reported by multiple sources Read more…

By John Russell

US Coalesces Plans for First Exascale Supercomputer: Aurora in 2021

September 27, 2017

At the Advanced Scientific Computing Advisory Committee (ASCAC) meeting, in Arlington, Va., yesterday (Sept. 26), it was revealed that the "Aurora" supercompute Read more…

By Tiffany Trader

Top500 Results: Latest List Trends and What’s in Store

June 19, 2017

Greetings from Frankfurt and the 2017 International Supercomputing Conference where the latest Top500 list has just been revealed. Although there were no major Read more…

By Tiffany Trader

Google Releases Deeplearn.js to Further Democratize Machine Learning

August 17, 2017

Spreading the use of machine learning tools is one of the goals of Google’s PAIR (People + AI Research) initiative, which was introduced in early July. Last w Read more…

By John Russell

Graphcore Readies Launch of 16nm Colossus-IPU Chip

July 20, 2017

A second $30 million funding round for U.K. AI chip developer Graphcore sets up the company to go to market with its “intelligent processing unit” (IPU) in Read more…

By Tiffany Trader

Leading Solution Providers

GlobalFoundries Puts Wind in AMD’s Sails with 12nm FinFET

September 24, 2017

From its annual tech conference last week (Sept. 20), where GlobalFoundries welcomed more than 600 semiconductor professionals (reaching the Santa Clara venue Read more…

By Tiffany Trader

Amazon Debuts New AMD-based GPU Instances for Graphics Acceleration

September 12, 2017

Last week Amazon Web Services (AWS) streaming service, AppStream 2.0, introduced a new GPU instance called Graphics Design intended to accelerate graphics. The Read more…

By John Russell

Nvidia Responds to Google TPU Benchmarking

April 10, 2017

Nvidia highlights strengths of its newest GPU silicon in response to Google's report on the performance and energy advantages of its custom tensor processor. Read more…

By Tiffany Trader

EU Funds 20 Million Euro ARM+FPGA Exascale Project

September 7, 2017

At the Barcelona Supercomputer Centre on Wednesday (Sept. 6), 16 partners gathered to launch the EuroEXA project, which invests €20 million over three-and-a-half years into exascale-focused research and development. Led by the Horizon 2020 program, EuroEXA picks up the banner of a triad of partner projects — ExaNeSt, EcoScale and ExaNoDe — building on their work... Read more…

By Tiffany Trader

Cray Moves to Acquire the Seagate ClusterStor Line

July 28, 2017

This week Cray announced that it is picking up Seagate's ClusterStor HPC storage array business for an undisclosed sum. "In short we're effectively transitioning the bulk of the ClusterStor product line to Cray," said CEO Peter Ungaro. Read more…

By Tiffany Trader

Delays, Smoke, Records & Markets – A Candid Conversation with Cray CEO Peter Ungaro

October 5, 2017

Earlier this month, Tom Tabor, publisher of HPCwire and I had a very personal conversation with Cray CEO Peter Ungaro. Cray has been on something of a Cinderell Read more…

By Tiffany Trader & Tom Tabor

IBM Advances Web-based Quantum Programming

September 5, 2017

IBM Research is pairing its Jupyter-based Data Science Experience notebook environment with its cloud-based quantum computer, IBM Q, in hopes of encouraging a new class of entrepreneurial user to solve intractable problems that even exceed the capabilities of the best AI systems. Read more…

By Alex Woodie

Intel Launches Software Tools to Ease FPGA Programming

September 5, 2017

Field Programmable Gate Arrays (FPGAs) have a reputation for being difficult to program, requiring expertise in specialty languages, like Verilog or VHDL. Easin Read more…

By Tiffany Trader

  • arrow
  • Click Here for More Headlines
  • arrow
Share This