Visit additional Tabor Communication Publications
March 21, 2008
The era of single processor systems is over; the multi- and many-core systems world is here. If you're not ready for this change, there's an IT train wreck in your future. We're entering a phase where taking full advantage of the power of multi-core processors is critical for customers to continue to accelerate innovation and to improve their business success.
Single core and life was good
For many years, the microprocessor community has translated Moore's Law of transistor density into a direct doubling of single-threaded performance every 18 months. Applications ran faster on each new processor version, and new versions were released frequently. Performance tuning of applications required minor experimentation with compilers and tuning flags.
This was a period of high productivity for application developers, since they could concentrate on product functionality and performance and minimize the time to create, tune, test, and support computer-model-unique versions. It was fun, but it could not last forever.
Today, the era of single processor systems is over. The multi- and many- core systems world is here. If you're not ready for this change, there's an IT train wreck in your future. We're entering a phase where taking full advantage of the power of multi-core processors is critical for customers to continue to accelerate innovation and to improve their business success. Dual-core technology is now pervasive in the industry; quad-core processors are here and about to become the new standard for server nodes, and roadmaps pointing to octal-core processors are not far off.
For applications not designed to take advantage of the increased raw compute power that comes with the availability of the added cores, applications may run slower. This likelihood increases as the core count increases. So, even though that bright shiny new server you just bought has more raw available compute capability, your applications may run more slowly.
Dual- and quad-core processors take the heat
The real question is how did we get here? Did we really think that Moore's law could go on indefinitely? It used to be that with each new system, the clock speeds increased and the systems ran faster. However, as the chip manufacturers have learned, there are repercussions. More speed requires significantly more power. More power means more heat. More heat means more cooling. Today, we're at the point where, if we continue on this power, performance, heat and cooling curve, even your laptop will soon require water cooling followed by liquid nitrogen for cooling.
Dual-core processors, and then quad-core processors, often get the blame for making it more difficult and complex to program applications. However, they also get the credit for keeping power and cooling requirements at reasonable levels. The real issue is balancing system power, cooling, I/O, memory and cache. To meet new system balance requirements for power and cooling, clock speeds have declined: the more cores per processor chip, the lower the clock frequency. Moore's Law continues, but the additional transistors are used to implement more cores and larger caches.
As a result of this, the problem has shifted from the hardware challenge of making things run faster with faster clock cycles, to the software challenge of how to use all the additional cores (raw compute power) now available on the chip to improve performance. Unfortunately, this has created coding problems for application developers. A multi-core processor can do more work than a single-core processor, so the total amount of work, in compute jobs per month, increases on multi-core-based servers. But without taking into consideration the multi-core nature of today's systems, the performance of an individual application will not increase; it is likely to run more slowly as the number of cores increases, due to the combination of lower clock rates and competition for memory bandwidth and cache.
And it's not just the applications. Sending all your communications interrupts, for example, to one core could overwhelm that core -- possibly slowing down the rest of the system. It's that system balance thing again.
Solving the application performance problem
There is no easy solution to application performance. Serial (non-parallel) applications in many cases can not become parallelized without considerable work and time. Many HPC applications are parallel, and some are highly scalable and can run faster if it is possible to allocate more cores to their execution. But other parallel applications are not very scalable, with the same performance barrier as serial applications.
The best way to make progress is to understand how an application uses system resources. With this knowledge, both developers and users of applications can improve performance. It's important to look at the resource usage at the server level, not just the processor level. Much of the available data comes from the processor developers, but to understand application performance, the complete server must be analyzed. Important resources include memory bandwidth per core; I/O bandwidth per core; network bandwidth per core; amount of memory per core, and amount of cache per core.
Shared caches are also a complicating factor in application performance. New x86-64 processors share cache among two or more cores. As a result, it is not possible to know the amount of cache being used by one core at any one time. Shared caches can be both friend and foe to code performance. With analysis and work, application developers can take advantage of this low latency opportunity. But many codes are tuned for some minimal amount of cache per core, and application performance will suffer if less is available. Erratic application runtimes will be one symptom.
One of the solutions to application performance is to parallelize more code. A roadblock to developing multi-threaded programs exists: the uncertainty and confusion about the rules that must be followed by both users of, and compilers for, the C++ and C languages. As a result, it is much harder to implement shared memory parallel programming.
HP is leading an effort to address this issue by specifying how multi-threaded C++ programs may interact through shared memory. HP has developed a proposal for the upcoming revision to the C++ standard. This proposal supports a simple model that requires no understanding of hardware or compiler optimizations.
Job management software tools can also be implemented to aid performance. An application will run faster if it is intelligently scheduled onto servers which satisfy the code's resource requirements. HP works with its partners like Platform Computing, open-source tools like SLURM, and HP software products like HP-MPI, to design and implement this functionality.
Tools to address multi-core environments
In 2007, HP launched its Multi-core Optimization Program. This program, together with the Multi-core Toolkit, brings together developments from HP and over 15 technology partners to provide open, non-proprietary solutions that enhance multi-core performance across a variety of industry-standard HPC architectures, platforms and operating environments. By optimizing multi-core solutions, customers can maximize application performance on multi-core systems, enabling larger simulations and more data analysis that is necessary to achieve their engineering, science and analytical goals.
While it's not possible to analyze every application, HP is independently and with our partners studying and benchmarking a cross-section of applications that can provide information which applies to broad application sets. For instance, we have completed characterization work in the areas of application energy, job scheduling and performance analysis. HP has collaborated with technology partners for decades, measuring and improving application performance. Our partners span well-established industry giants in hardware and software, to TotalView Technologies, who specializes in debugging and memory analysis tools, to emerging technology players like Acumem, who provides comprehensive application performance characterization with respect to cache and memory bandwidth, for multi-core systems.
Now we are extending this work to the broader industry, providing the same kind of information for multi-core systems and making it available to customers through a series of technical white papers on the toolkit website. Three examples are:
Scheduling to Overcome the Multi-Core Memory Bandwidth Bottleneck
Need more compute servers but don't have room for them? Or maybe you don't have enough electrical capacity to feed them, or enough cooling capacity? You're not alone. Multi-core processors may be the solution. While multi-core processors may solve many of the problems associated with compute cluster sprawl, they also present a new challenge: in some situations they cannot provide sufficient memory bandwidth per core to satisfy the requirements of certain HPC applications. This paper discusses methods to mitigate the effects of memory bandwidth limitations on modern multi-core processors from HP and Platform Computing.
Power Utilization vs. Application Performance on HP Servers Using Multi-core Processors – Conserving Application Energy
There are many ways to optimize high performance computing workloads. In addition to the common approaches such as single job runtime, multi-job throughput, and parallel scalability, this paper discusses optimizing for power consumption. Measurements of power versus performance for standard benchmarks and ISV applications are also provided.
Application performance characterization of dual and quad core systems using the two most popular network interconnects: Gigabit Ethernet (GigE) and InfiniBand (IB). - ACCELRYS ONETEP Benchmarks
This document contains benchmark data for ONETEP in Materials Studio 4.2 running on a range of HP servers, using Industry Standard processors running on several interconnects.
To summarize, unless you start to plan now, there is an IT train wreck at some point in your future. It's called multi-core and it's here now. HP launched its Multi-core Optimization Program to analyze and improve the performance of High Performance Computing applications on industry-standard servers running Linux and Windows. We're leading with our strengths: deep knowledge of HPC system and cluster design, deep knowledge of the major applications used in HPC, and long-term relationships with HPC technology partners and application developers. HP's Multi-core Program features a unique collaborative approach that combines HP products and technologies with those from a broad set of technology leaders and our partners to address the needs of the more complex multi-core systems coming out today and in the future.
All white papers and more information about the Multi-core toolkit are available at www.hp.com/go/multi-coretoolkit.
About the Author
Dave Field is the Manager of the High Performance Computing Solutions Engineering and Expertise group at Hewlett-Packard. Based in Richardson, Texas, this engineering group provides technical support to HP's HPC ISV partners. In addition, they characterize the performance of HP servers, software, and compute clusters in HPC configurations.
In quieter times, sounding the bell of funding big science with big systems tends to resonate further than when ears are already burning with sour economic and national security news. For exascale's future, however, the time could be ripe to instill some sense of urgency....
In a recent solicitation, the NSF laid out needs for furthering its scientific and engineering infrastructure with new tools to go beyond top performance, Having already delivered systems like Stampede and Blue Waters, they're turning an eye to solving data-intensive challenges. We spoke with the agency's Irene Qualters and Barry Schneider about..
Large-scale, worldwide scientific initiatives rely on some cloud-based system to both coordinate efforts and manage computational efforts at peak times that cannot be contained within the combined in-house HPC resources. Last week at Google I/O, Brookhaven National Lab’s Sergey Panitkin discussed the role of the Google Compute Engine in providing computational support to ATLAS, a detector of high-energy particles at the Large Hadron Collider (LHC).
May 23, 2013 |
The study of climate change is one of those scientific problems where it is almost essential to model the entire Earth to attain accurate results and make worthwhile predictions. In an attempt to make climate science more accessible to smaller research facilities, NASA introduced what they call ‘Climate in a Box,’ a system they note acts as a desktop supercomputer.
May 22, 2013 |
At some point in the not-too-distant future, building powerful, miniature computing systems will be considered a hobby for high schoolers, just as robotics or even Lego-building are today. That could be made possible through recent advancements made with the Raspberry Pi computers.
May 16, 2013 |
When it comes to cloud, long distances mean unacceptably high latencies. Researchers from the University of Bonn in Germany examined those latency issues of doing CFD modeling in the cloud by utilizing a common CFD and its utilization in HPC instance types including both CPU and GPU cores of Amazon EC2.
May 15, 2013 |
Supercomputers at the Department of Energy’s National Energy Research Scientific Computing Center (NERSC) have worked on important computational problems such as collapse of the atomic state, the optimization of chemical catalysts, and now modeling popping bubbles.
05/10/2013 | Cleversafe, Cray, DDN, NetApp, & Panasas | From Wall Street to Hollywood, drug discovery to homeland security, companies and organizations of all sizes and stripes are coming face to face with the challenges – and opportunities – afforded by Big Data. Before anyone can utilize these extraordinary data repositories, however, they must first harness and manage their data stores, and do so utilizing technologies that underscore affordability, security, and scalability.
04/15/2013 | Bull | “50% of HPC users say their largest jobs scale to 120 cores or less.” How about yours? Are your codes ready to take advantage of today’s and tomorrow’s ultra-parallel HPC systems? Download this White Paper by Analysts Intersect360 Research to see what Bull and Intel’s Center for Excellence in Parallel Programming can do for your codes.
In this demonstration of SGI DMF ZeroWatt disk solution, Dr. Eng Lim Goh, SGI CTO, discusses a function of SGI DMF software to reduce costs and power consumption in an exascale (Big Data) storage datacenter.
The Cray CS300-AC cluster supercomputer offers energy efficient, air-cooled design based on modular, industry-standard platforms featuring the latest processor and network technologies and a wide range of datacenter cooling requirements.