The Leading Source for Global News and Information Covering the Ecosystem of High Productivity Computing
March 21, 2008
The era of single processor systems is over; the multi- and many-core systems world is here. If you're not ready for this change, there's an IT train wreck in your future. We're entering a phase where taking full advantage of the power of multi-core processors is critical for customers to continue to accelerate innovation and to improve their business success.
Single core and life was good
For many years, the microprocessor community has translated Moore's Law of transistor density into a direct doubling of single-threaded performance every 18 months. Applications ran faster on each new processor version, and new versions were released frequently. Performance tuning of applications required minor experimentation with compilers and tuning flags.
This was a period of high productivity for application developers, since they could concentrate on product functionality and performance and minimize the time to create, tune, test, and support computer-model-unique versions. It was fun, but it could not last forever.
Today, the era of single processor systems is over. The multi- and many- core systems world is here. If you're not ready for this change, there's an IT train wreck in your future. We're entering a phase where taking full advantage of the power of multi-core processors is critical for customers to continue to accelerate innovation and to improve their business success. Dual-core technology is now pervasive in the industry; quad-core processors are here and about to become the new standard for server nodes, and roadmaps pointing to octal-core processors are not far off.
For applications not designed to take advantage of the increased raw compute power that comes with the availability of the added cores, applications may run slower. This likelihood increases as the core count increases. So, even though that bright shiny new server you just bought has more raw available compute capability, your applications may run more slowly.
Dual- and quad-core processors take the heat
The real question is how did we get here? Did we really think that Moore's law could go on indefinitely? It used to be that with each new system, the clock speeds increased and the systems ran faster. However, as the chip manufacturers have learned, there are repercussions. More speed requires significantly more power. More power means more heat. More heat means more cooling. Today, we're at the point where, if we continue on this power, performance, heat and cooling curve, even your laptop will soon require water cooling followed by liquid nitrogen for cooling.
Dual-core processors, and then quad-core processors, often get the blame for making it more difficult and complex to program applications. However, they also get the credit for keeping power and cooling requirements at reasonable levels. The real issue is balancing system power, cooling, I/O, memory and cache. To meet new system balance requirements for power and cooling, clock speeds have declined: the more cores per processor chip, the lower the clock frequency. Moore's Law continues, but the additional transistors are used to implement more cores and larger caches.
As a result of this, the problem has shifted from the hardware challenge of making things run faster with faster clock cycles, to the software challenge of how to use all the additional cores (raw compute power) now available on the chip to improve performance. Unfortunately, this has created coding problems for application developers. A multi-core processor can do more work than a single-core processor, so the total amount of work, in compute jobs per month, increases on multi-core-based servers. But without taking into consideration the multi-core nature of today's systems, the performance of an individual application will not increase; it is likely to run more slowly as the number of cores increases, due to the combination of lower clock rates and competition for memory bandwidth and cache.
And it's not just the applications. Sending all your communications interrupts, for example, to one core could overwhelm that core -- possibly slowing down the rest of the system. It's that system balance thing again.
Solving the application performance problem
There is no easy solution to application performance. Serial (non-parallel) applications in many cases can not become parallelized without considerable work and time. Many HPC applications are parallel, and some are highly scalable and can run faster if it is possible to allocate more cores to their execution. But other parallel applications are not very scalable, with the same performance barrier as serial applications.
The best way to make progress is to understand how an application uses system resources. With this knowledge, both developers and users of applications can improve performance. It's important to look at the resource usage at the server level, not just the processor level. Much of the available data comes from the processor developers, but to understand application performance, the complete server must be analyzed. Important resources include memory bandwidth per core; I/O bandwidth per core; network bandwidth per core; amount of memory per core, and amount of cache per core.
Shared caches are also a complicating factor in application performance. New x86-64 processors share cache among two or more cores. As a result, it is not possible to know the amount of cache being used by one core at any one time. Shared caches can be both friend and foe to code performance. With analysis and work, application developers can take advantage of this low latency opportunity. But many codes are tuned for some minimal amount of cache per core, and application performance will suffer if less is available. Erratic application runtimes will be one symptom.
One of the solutions to application performance is to parallelize more code. A roadblock to developing multi-threaded programs exists: the uncertainty and confusion about the rules that must be followed by both users of, and compilers for, the C++ and C languages. As a result, it is much harder to implement shared memory parallel programming.
HP is leading an effort to address this issue by specifying how multi-threaded C++ programs may interact through shared memory. HP has developed a proposal for the upcoming revision to the C++ standard. This proposal supports a simple model that requires no understanding of hardware or compiler optimizations.
Job management software tools can also be implemented to aid performance. An application will run faster if it is intelligently scheduled onto servers which satisfy the code's resource requirements. HP works with its partners like Platform Computing, open-source tools like SLURM, and HP software products like HP-MPI, to design and implement this functionality.
Tools to address multi-core environments
In 2007, HP launched its Multi-core Optimization Program. This program, together with the Multi-core Toolkit, brings together developments from HP and over 15 technology partners to provide open, non-proprietary solutions that enhance multi-core performance across a variety of industry-standard HPC architectures, platforms and operating environments. By optimizing multi-core solutions, customers can maximize application performance on multi-core systems, enabling larger simulations and more data analysis that is necessary to achieve their engineering, science and analytical goals.
While it's not possible to analyze every application, HP is independently and with our partners studying and benchmarking a cross-section of applications that can provide information which applies to broad application sets. For instance, we have completed characterization work in the areas of application energy, job scheduling and performance analysis. HP has collaborated with technology partners for decades, measuring and improving application performance. Our partners span well-established industry giants in hardware and software, to TotalView Technologies, who specializes in debugging and memory analysis tools, to emerging technology players like Acumem, who provides comprehensive application performance characterization with respect to cache and memory bandwidth, for multi-core systems.
Now we are extending this work to the broader industry, providing the same kind of information for multi-core systems and making it available to customers through a series of technical white papers on the toolkit website. Three examples are:
Scheduling to Overcome the Multi-Core Memory Bandwidth Bottleneck
Need more compute servers but don't have room for them? Or maybe you don't have enough electrical capacity to feed them, or enough cooling capacity? You're not alone. Multi-core processors may be the solution. While multi-core processors may solve many of the problems associated with compute cluster sprawl, they also present a new challenge: in some situations they cannot provide sufficient memory bandwidth per core to satisfy the requirements of certain HPC applications. This paper discusses methods to mitigate the effects of memory bandwidth limitations on modern multi-core processors from HP and Platform Computing.Power Utilization vs. Application Performance on HP Servers Using Multi-core Processors – Conserving Application Energy
There are many ways to optimize high performance computing workloads. In addition to the common approaches such as single job runtime, multi-job throughput, and parallel scalability, this paper discusses optimizing for power consumption. Measurements of power versus performance for standard benchmarks and ISV applications are also provided.Application performance characterization of dual and quad core systems using the two most popular network interconnects: Gigabit Ethernet (GigE) and InfiniBand (IB). - ACCELRYS ONETEP Benchmarks
This document contains benchmark data for ONETEP in Materials Studio 4.2 running on a range of HP servers, using Industry Standard processors running on several interconnects.
To summarize, unless you start to plan now, there is an IT train wreck at some point in your future. It's called multi-core and it's here now. HP launched its Multi-core Optimization Program to analyze and improve the performance of High Performance Computing applications on industry-standard servers running Linux and Windows. We're leading with our strengths: deep knowledge of HPC system and cluster design, deep knowledge of the major applications used in HPC, and long-term relationships with HPC technology partners and application developers. HP's Multi-core Program features a unique collaborative approach that combines HP products and technologies with those from a broad set of technology leaders and our partners to address the needs of the more complex multi-core systems coming out today and in the future.
All white papers and more information about the Multi-core toolkit are available at www.hp.com/go/multi-coretoolkit.
About the Author
Dave Field is the Manager of the High Performance Computing Solutions Engineering and Expertise group at Hewlett-Packard. Based in Richardson, Texas, this engineering group provides technical support to HP's HPC ISV partners. In addition, they characterize the performance of HP servers, software, and compute clusters in HPC configurations.
Accelerate with HP - Accelerate with NVIDIA
Listen to the HP-NVIDIA Accelerator webcast and find out how!
The SGI Altix Server Family
The SGI Altix Server Family
Powerful enough to meet any
HPC need, anywhere in the universe
Last week, HPC veteran John Gustafson was named CEO of Massively Parallel Technologies (MPT), a developer of HPC acceleration technology. Using funding from about 300 private shareholders, the Colorado-based company is in the process of commercializing technology that aims to dramatically enhance the performance and utility of high performance computing clusters. Similar to his previous engagements at Sun Microsystems and ClearSpeed Technologies, where he worked on cutting-edge technology programs, Gustafson joins his new company with the goal of introducing a game-changing product into the HPC market.
Read More...
The National High Performance Computing & Communications (HPCC) conference took place in Newport, Rhode Island, at the end of March. It's a very intimate event that is purposely limited to no more than 120 attendees. While the conference always has some interesting and timely presentations, this event is really about networking and lively interaction between HPC insiders.
Read More...
When Jim Thomas set out to find new ways to deal with the mountains of information our society generates, he didn't just create a new organization, he created a new science. In this article we'll take a look at how the National Visualization and Analytics Center is transforming the problem of finding needles in haystacks into an opportunity for a more secure future.
Read More...
May 21 | CircleID | Today, Google has perhaps 20 to 100 petaflops of processing power in their distributed infrastructure, but the supercomputing community pays little attention to the massive computing platform. Read more...
May 20 | Design News | Digital human modeling software is helping companies evaluate ergonomic and safety factors further upstream in the development process. Read more...
May 20 | Semiconductor International | Engineers and physicists from Stanford University and the University of California at Santa Barbara have demonstrated what they term “the potential progenitor” of a basic component of quantum computers — a practical, scalable logic gate that enables the interaction of two photons. Read more...
May 19 | vnunet.com | Supporters of Intel's Itanium platform are hoping that the server chip will finally take off this year. Read more...
May 14 | InfoWorld | Sun Microsystems is taking the lessons learned from Java and applying them to the application development challenges of the high performance computing realm. Read more...
Today, HPC organizations are requiring substantially more floating point performance to solve real-world problems. In this podcast, Ben Bennett, ClearSpeed General Manager, discusses how acceleration technology can improve the overall performance of standard x86-based systems...
Get updates and insights on the High Productivity Computing industry delivered driectly to your inbox.