Report from the 18th Machine Evaluation Workshop

By Christopher Lazou

December 7, 2007

About 250 academic researchers and vendor personnel attended the 18th Machine Evaluation Workshop from Nov. 27-28, held at the Holiday Inn Hotel, Runcorn, near the Science and Technology Facilities Council (STFC) Daresbury Laboratories in the UK. This excellent workshop, in its eighteenth year, is a leading UK national event dedicated to distributed, high performance scientific computing. The principle objective is to encourage close contact between the research communities and the major vendors of mid-range computing systems, clusters, workstations, servers, interconnect, storage and software.

Most of the 31 presentations were from vendors, describing their own products, on topics such as hardware, compilers, graphics, storage and networking. They focused on cluster solutions, based on commodity chips, interconnect networks and associated file storage systems. An important component of the workshop is the availability of systems for benchmarking evaluation purposes.

There were 25 companies exhibiting, keen to promote their ready-made products, including those based on both dual- and quad-core technologies, the latter including AMD’s Barcelona and Intel’s Clovertown and Harpertown processors, plus Intel’s Itanium family. A strong presence of the older dual-core AMD Opteron and Intel ‘Woodcrest’ processors, plus these initial quad-core systems, as well as various models of blade products, were on display and available for demonstrations.

Of course there are many factors that often dominate the selection of systems. For example, one trade-off is price/performance; another is the type and size of application the system is purchased for. This is pertinent, especially for clusters built from commodity chips with various strengths and weaknesses in memory bandwidth and in the chosen interconnect fabric. Electrical power, space and other infrastructure requirements are also pertinent to the decision. Total cost of ownership depends upon the weight of each of these elements.

Mike Ashworth from the STFC Daresbury Laboratories described the STFC petascale project and the work his team is doing preparing the UK academic research community for petascale computing. This is in line with the UK HPC strategy: “… the UK should aim to achieve sustained petascale performance as early as possible across a broad field of scientific applications, permitting the UK to remain internationally competitive in an increasingly broad set of high-end computing grand challenge problems.”

The STFC petascale project is trying to identify what new science can be done with 100 teraflops or 1,000 teraflops sustained performance. Does one strive for larger problems? Multi-scale? Multi-disciplinary? The STFC team is also trying to address the technical issues for extreme scaling. For example, how will existing codes scale to 100,000 processors and what code attributes are likely to impede this scaling? Is it the scaling of time with processors, time with problem size, memory with problem size, data management, including pre- and post-processing, visualization, or system robustness? And what of interconnects?

Mike described some work in progress on scaling and running codes on large Cray systems, IBM Power5 and Blue Gene/P showing their achievements to date. His talk conclusions are: Petascale computing will be available in the UK, possibly by 2011, achieved by a system with a massive increase in the number of processors and based on multicore nodes. There is an urgent need to look at scalability and other issues on O(10,000-100,000) processors. There is also a need to look at alternatives/additions to the existing programming model of serial language plus MPI.

With hundreds of thousands of processors, reliability and fault tolerance to vitiate MTBF effects is essential. Software developments for an easy-to-use environment are also indispensable. The challenge facing the industry is not only system design for scalable robustness, low power consumption, small weight and space, but also software: operating systems, compilers, tools, application porting and licensing.

At the high end of computing, we have seen various products using new technologies — exemplars are the IBM BG/L/P/Q and Cell B.E., FPGAs, ClearSpeed, GPU — starting from niche domains and gradually becoming mainstream. The next generation of commodity chip designs are focusing on high performance/low power consumption ratios, and likely to use the extra silicon for heterogeneous cores on a die. Semiconductor power trends are driving future systems, but the runaway multicore growth is not going to happen without a quantum improvement of parallel software.

As in previous years, the Daresbury Benchmark results were of great interest and the centrepiece of this workshop. These consisted of a plethora of serial, rate and parallel benchmark results, compiled by Martyn Guest and Christine Kitchen from Daresbury and now at the University of Cardiff. The benchmarks were applied to many cluster and high-end systems. Particular focus was given to the products from vendors using the latest multicore chips.

While pointing to the limited value of single-processor benchmarks on such chips, Martyn focused initially on throughput (or rate) benchmarks. The so-called ChemRate benchmark suite, used to obtain these results, comprises a number of kernel benchmarks that reflect the activities of the computational chemist – matrix multiplication, matrix diagonalisation, similarity transformations, etc. — plus two applications codes, the ab initio molecular electronic structure package GAMESS-UK and the molecular dynamics benchmark, DL_POLY.

These rate benchmarks pointed to the improved overall performance offered by quad-core processors compared to their dual-core counterparts, although the level of performance obtained in practice remained heavily dependent on application. Thus on dual processor quad-core nodes, performance improvements of between 3X up to the full 8X speed-up over a single core were achieved. While Intel’s Harpertown processor showed considerable improvement over the earlier Clovertown processor from Intel, its performance advantage over modestly clocked AMD Barcelona processors (2.0 and 2.1 GHz) was again markedly dependent on the memory bandwidth demands of the application in question.

Martyn Guest augmented this throughput benchmark analysis from previous years by including a variety of parallel applications, measuring not only performance on a single node, but extending the analysis to include application performance on dual-core and quad-core commodity clusters. This approach attempts to address the impact of cluster architectures on performance, a more relevant metric for medium or small size university systems.

Martyn concentrated on capacity-based, modest sized clusters, built out of commodity chips, with a typical usage modality of 32 to 64 cores. The application performance comparisons were against a host of dual processor, dual-core systems from AMD (socket-F) and Intel (Woodcrest), plus the following quad-core clusters:

  • An AMD Barcelona cluster, located at the AMD Developer Center, featuring nine AMD Opteron 2350 (2.0 GHz super-micro) nodes plus Mellanox’s ConnectX, allowing for 72-core benchmarks run.
  • A modest cluster of two IBM X3550 nodes at IBM’s Montpelier plant in France, featuring Harpertown (2.83 GHz) processors (1333 FSB) plus InfiniBand, allowing for 16-core benchmarks run.
  • A much larger SGI Ice cluster at SGI’s Chippewa Falls, Wis. facility, featuring Intel X5365 Clovertown (3.0 GHz) nodes (1333 FSB) plus InfiniBand, allowing for 1024-core benchmarks run.

With some 70 informative PowerPoint slides of measurements in multicolour schemes, in the end, the real drama revolved around the comparable performance of those clusters based on quad-core chips — the 2.83 GHz Harpertown and 3.0 GHz Clovertown from Intel — and how these compared with AMD’s quad-core (2.0 GHz) Barcelona. The dual-core Intel EM64T (3.0 GHz) ‘Woodcrest’ Xeon 5160 and the AMD dual-core 254/2218-F Socket (2.66 GHz) Opteron were used as a baseline.

Broadly, the results from this snapshot evaluation showed the Harpertown chip performing significantly better than Clovertown, across a range of applications, and better than the AMD Barcelona in line with its higher GHz rating. For molecular dynamics problems with very low demands on memory bandwidth, Harpertown is the obvious choice. For those applications with higher memory bandwidth requirements, e.g., the DNS simulation code, PDNS3D, Barcelona systems perform far better. The advice is, “know your application” and also remember that in the next couple of months AMD is likely to be releasing a 2.5 GHz version of the Barcelona so the relative orderings could well be different then.

Some truisms never change. Martyn emphasised that single-processor benchmarks often provide misleading results given the complexity of current processors and their subsequent utilisation as the building blocks of n-way cluster nodes.

Some obvious advice: If the processors are dual-core, use both cores; if four-way cores, use all four so that interactions of cache, memory and communications will be accounted for in any performance measure. When more cores are added on a chip, the interconnect fabric transfer rates (bandwidth) to L3 memory needs to be increased proportionately to match the extra computational power from the extra cores. Otherwise, the sharing of memory paths by the additional cores is likely to degrade the overall system performance. This argues strongly for the use of rate benchmarks rather than the more traditional single processor metrics.

The rest of the workshop consisted of presentations from vendors specialising in providing storage systems, interconnects, such as 10 Gigabit Ethernet, InfiniBand, ConnectX and so on, as well as tailored system solutions from commodity components on demand. These come in various flavours, including buying a system from traditional vendors, who in turn subcontract a specialist computer integration company, e.g., ClusterVision, Streamline, to build a cluster from a customer’s favoured chips and interconnect fabric. These systems vary in size, with some of them attaining TOP500 status.

The findings at the 18th Machine Evaluation Workshop broadly confirm the Los Alamos findings [see LA-UR-07-6855, titled “Experiences in scaling scientific applications on current generation quad-core processors,” authored by K. J. Barker, D. J. Kerbyson, et al, CCS-1]. This paper compares the performance of two quad-core chips when running several Los Alamos codes. The chips are the Intel X7350 (two dual cores on a die) quad-core Tigerton and the AMD 8350 quad-core Opteron processor Barcelona.

The Tigerton is clocked at 2.93 GHz so the DCM has a theoretical peak performance of 46.9 gigaflops. The Barcelona combines four Opteron cores onto a single die. Each die contains a single integrated memory controller and uses a HyperTransport (HT) network for point-to-point and a private 512 KB L2 cache, and each processor has a shared 2MB L3 cache. The shared L3 cache is new to the Opteron architecture. The new 128 bit SSE4a instructions enable each processor to execute four double-precision floating-point operations per clock. Each core is clocked at 2.0 GHz, giving each chip a peak performance of 32 gigaflops. Note that the power consumption of the Tigerton is 130 watts and that of the Barcelona is 95 watts.

The Los Alamos paper reports in detail the results from their tests and concludes: “When considering the performance of a single core, where there are no memory bottlenecks, the higher clock speed of Intel’s Tigerton gives applications a measured 25-44 percent performance advantage over AMD’s Barcelona. When we look at using all of the cores in a processor, the results are more dependent on the way in which each application uses memory. Barcelona has the edge in memory bandwidth available on a single processor.”

It then goes on to say: “When examining scaling across the entire 16-core node, the results are somewhat mixed. In general, applications with small memory footprints perform better on the Tigerton and see almost perfect scaling. More memory-bandwidth-intensive applications, on the other hand, scale better on the Barcelona because of the reduced memory contention. For many of the applications, the better scaling behavior results in a higher achievable performance on the Barcelona… .”

In short, for applications as they exist today, it is important to consider the balance between compute rate and memory rate when selecting a processor from which to build a cluster. Neither the Barcelona nor the Tigerton is unambiguously faster than the other. The decision of which to use must be made on a per-application (or per-workload) basis.

The 18th Machine Evaluation Workshop highlighted some very positive trends in multicore chip developments. For example, quad-core processors in general are more cost-effective than dual-core processors, but when memory bandwidth is an issue, the performance starts to unravel. There is an impact from message passing latency. For large clusters, there are still some concerns. Reliability is not robust enough and availability often dips below 90 percent. Some sites need to reboot the whole system as often as twice a week. Thus, know your application and always remember the Latin adage: “Caveat Emptor.”

As the season of goodwill is upon us: I wish you all, Seasons Greetings and a Peaceful and Happy New Year, 2008.

—–

Copyright (c) Christopher Lazou. December 2007. Brands and names are the property of their respective owners.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

PRACEdays Reflects Europe’s HPC Commitment

May 25, 2017

More than 250 attendees and participants came together for PRACEdays17 in Barcelona last week, part of the European HPC Summit Week 2017, held May 15-19 at t Read more…

By Tiffany Trader

Russian Researchers Claim First Quantum-Safe Blockchain

May 25, 2017

The Russian Quantum Center today announced it has overcome the threat of quantum cryptography by creating the first quantum-safe blockchain, securing cryptocurr Read more…

By Doug Black

Google Debuts TPU v2 and will Add to Google Cloud

May 25, 2017

Not long after stirring attention in the deep learning/AI community by revealing the details of its Tensor Processing Unit (TPU), Google last week announced the Read more…

By John Russell

Nvidia CEO Predicts AI ‘Cambrian Explosion’

May 25, 2017

The processing power and cloud access to developer tools used to train machine-learning models are making artificial intelligence ubiquitous across computing pl Read more…

By George Leopold

HPE Extreme Performance Solutions

Exploring the Three Models of Remote Visualization

The explosion of data and advancement of digital technologies are dramatically changing the way many companies do business. With the help of high performance computing (HPC) solutions and data analytics platforms, manufacturers are developing products faster, healthcare providers are improving patient care, and energy companies are improving planning, exploration, and production. Read more…

PGAS Use will Rise on New H/W Trends, Says Reinders

May 25, 2017

If you have not already tried using PGAS, it is time to consider adding PGAS to the programming techniques you know. Partitioned Global Array Space, commonly kn Read more…

By James Reinders

Exascale Escapes 2018 Budget Axe; Rest of Science Suffers

May 23, 2017

President Trump's proposed $4.1 trillion FY 2018 budget is good for U.S. exascale computing development, but grim for the rest of science and technology spend Read more…

By Tiffany Trader

Hedge Funds (with Supercomputing help) Rank First Among Investors

May 22, 2017

In case you didn’t know, The Quants Run Wall Street Now, or so says a headline in today’s Wall Street Journal. Quant-run hedge funds now control the largest Read more…

By John Russell

IBM, D-Wave Report Quantum Computing Advances

May 18, 2017

IBM said this week it has built and tested a pair of quantum computing processors, including a prototype of a commercial version. That progress follows an an Read more…

By George Leopold

PRACEdays Reflects Europe’s HPC Commitment

May 25, 2017

More than 250 attendees and participants came together for PRACEdays17 in Barcelona last week, part of the European HPC Summit Week 2017, held May 15-19 at t Read more…

By Tiffany Trader

PGAS Use will Rise on New H/W Trends, Says Reinders

May 25, 2017

If you have not already tried using PGAS, it is time to consider adding PGAS to the programming techniques you know. Partitioned Global Array Space, commonly kn Read more…

By James Reinders

Exascale Escapes 2018 Budget Axe; Rest of Science Suffers

May 23, 2017

President Trump's proposed $4.1 trillion FY 2018 budget is good for U.S. exascale computing development, but grim for the rest of science and technology spend Read more…

By Tiffany Trader

Cray Offers Supercomputing as a Service, Targets Biotechs First

May 16, 2017

Leading supercomputer vendor Cray and datacenter/cloud provider the Markley Group today announced plans to jointly deliver supercomputing as a service. The init Read more…

By John Russell

HPE’s Memory-centric The Machine Coming into View, Opens ARMs to 3rd-party Developers

May 16, 2017

Announced three years ago, HPE’s The Machine is said to be the largest R&D program in the venerable company’s history, one that could be progressing tow Read more…

By Doug Black

What’s Up with Hyperion as It Transitions From IDC?

May 15, 2017

If you’re wondering what’s happening with Hyperion Research – formerly the IDC HPC group – apparently you are not alone, says Steve Conway, now senior V Read more…

By John Russell

Nvidia’s Mammoth Volta GPU Aims High for AI, HPC

May 10, 2017

At Nvidia's GPU Technology Conference (GTC17) in San Jose, Calif., this morning, CEO Jensen Huang announced the company's much-anticipated Volta architecture a Read more…

By Tiffany Trader

HPE Launches Servers, Services, and Collaboration at GTC

May 10, 2017

Hewlett Packard Enterprise (HPE) today launched a new liquid cooled GPU-driven Apollo platform based on SGI ICE architecture, a new collaboration with NVIDIA, a Read more…

By John Russell

Quantum Bits: D-Wave and VW; Google Quantum Lab; IBM Expands Access

March 21, 2017

For a technology that’s usually characterized as far off and in a distant galaxy, quantum computing has been steadily picking up steam. Just how close real-wo Read more…

By John Russell

Trump Budget Targets NIH, DOE, and EPA; No Mention of NSF

March 16, 2017

President Trump’s proposed U.S. fiscal 2018 budget issued today sharply cuts science spending while bolstering military spending as he promised during the cam Read more…

By John Russell

Google Pulls Back the Covers on Its First Machine Learning Chip

April 6, 2017

This week Google released a report detailing the design and performance characteristics of the Tensor Processing Unit (TPU), its custom ASIC for the inference Read more…

By Tiffany Trader

HPC Compiler Company PathScale Seeks Life Raft

March 23, 2017

HPCwire has learned that HPC compiler company PathScale has fallen on difficult times and is asking the community for help or actively seeking a buyer for its a Read more…

By Tiffany Trader

CPU-based Visualization Positions for Exascale Supercomputing

March 16, 2017

Since our first formal product releases of OSPRay and OpenSWR libraries in 2016, CPU-based Software Defined Visualization (SDVis) has achieved wide-spread adopt Read more…

By Jim Jeffers, Principal Engineer and Engineering Leader, Intel

Nvidia Responds to Google TPU Benchmarking

April 10, 2017

Last week, Google reported that its custom ASIC Tensor Processing Unit (TPU) was 15-30x faster for inferencing workloads than Nvidia's K80 GPU (see our coverage Read more…

By Tiffany Trader

Nvidia’s Mammoth Volta GPU Aims High for AI, HPC

May 10, 2017

At Nvidia's GPU Technology Conference (GTC17) in San Jose, Calif., this morning, CEO Jensen Huang announced the company's much-anticipated Volta architecture a Read more…

By Tiffany Trader

TSUBAME3.0 Points to Future HPE Pascal-NVLink-OPA Server

February 17, 2017

Since our initial coverage of the TSUBAME3.0 supercomputer yesterday, more details have come to light on this innovative project. Of particular interest is a ne Read more…

By Tiffany Trader

Leading Solution Providers

Facebook Open Sources Caffe2; Nvidia, Intel Rush to Optimize

April 18, 2017

From its F8 developer conference in San Jose, Calif., today, Facebook announced Caffe2, a new open-source, cross-platform framework for deep learning. Caffe2 is Read more…

By Tiffany Trader

Tokyo Tech’s TSUBAME3.0 Will Be First HPE-SGI Super

February 16, 2017

In a press event Friday afternoon local time in Japan, Tokyo Institute of Technology (Tokyo Tech) announced its plans for the TSUBAME3.0 supercomputer, which w Read more…

By Tiffany Trader

Is Liquid Cooling Ready to Go Mainstream?

February 13, 2017

Lost in the frenzy of SC16 was a substantial rise in the number of vendors showing server oriented liquid cooling technologies. Three decades ago liquid cooling Read more…

By Steve Campbell

MIT Mathematician Spins Up 220,000-Core Google Compute Cluster

April 21, 2017

On Thursday, Google announced that MIT math professor and computational number theorist Andrew V. Sutherland had set a record for the largest Google Compute Eng Read more…

By Tiffany Trader

US Supercomputing Leaders Tackle the China Question

March 15, 2017

As China continues to prove its supercomputing mettle via the Top500 list and the forward march of its ambitious plans to stand up an exascale machine by 2020, Read more…

By Tiffany Trader

HPC Technique Propels Deep Learning at Scale

February 21, 2017

Researchers from Baidu's Silicon Valley AI Lab (SVAIL) have adapted a well-known HPC communication technique to boost the speed and scale of their neural networ Read more…

By Tiffany Trader

DOE Supercomputer Achieves Record 45-Qubit Quantum Simulation

April 13, 2017

In order to simulate larger and larger quantum systems and usher in an age of "quantum supremacy," researchers are stretching the limits of today's most advance Read more…

By Tiffany Trader

Knights Landing Processor with Omni-Path Makes Cloud Debut

April 18, 2017

HPC cloud specialist Rescale is partnering with Intel and HPC resource provider R Systems to offer first-ever cloud access to Xeon Phi "Knights Landing" process Read more…

By Tiffany Trader

  • arrow
  • Click Here for More Headlines
  • arrow
Share This