The Leading Source for Global News and Information Covering the Ecosystem of High Productivity Computing
April 25, 2008
A research paper exploring ways to make a popular scientific analysis code run smoothly on different types of multicore computers won a Best Paper Award at the IEEE International Parallel and Distributed Processing Symposium (IPDPS) this month.
The paper's lead author and CRD researcher, Samuel Williams, and his collaborators chose lattice Bolzmann code to explore a broader issue: how to make best use of multicore supercomputers. The multicore trend started recently, and the computing industry is expected to add more cores per chip to boost performance in the future. The paper described how the researchers developed a code generator that could efficiently and productively optimize a lattice Bolzmann code to deliver better performance on a new breed of supercomputers built with multicore processors.
The multicore trend is taking flight without an equally concerted effort by software developers. "The computing revolution towards massive on-chip parallelism is moving forward with relatively little concrete evidence on how to best to use these technologies for real applications," Williams wrote in the paper.
The researchers settled on the lattice Bolzmann code used to model turbulence in magnetohydrodynamics simulations that play a key role in areas of physics research, from star formation to magnetic fusion devices. The code, LBMHD, typically performs poorly on traditional multicore machines.
The optimization research resulted in a great improvement to the code performance -- substantially higher than any published to date. The researchers also gained insight into building effective multicore applications, compilers and other tools.
The paper, "Lattice Boltzmann Simulation Optimization on Leading Multicore Platforms," won the Best Paper Award in the application track. Jonathan Carter of NERSC, Lenny Oliker of CRD, John Shalf of NERSC and Kathy Yelick of NERSC co-authored the paper. The researchers presented their paper at the IPDPS in Miami. Yelick, who is NERSC Director, also was a keynote speaker at the symposium.
Oliker, Carter and Shalf also authored a paper that won the same award last year. The paper, "Scientific Application Performance on Candidate PetaScale Platforms," was co-authored by CRD researchers Andrew Canning, Costin Iancu, Michael Lijewski, Shoaib Kamil, Hongzhang Shan and Erich Strohmaier. Stephane Ethier from the Princeton Plasma Physics Laboratory and Tom Goodale from Louisiana State University also contributed to the work.The researchers first looked at why the original LBMHD performs poorly on these multicore systems. Williams and his fellow researchers found that, contrary to conventional wisdom, memory bus bandwidth didn't present the biggest obstacle. Instead, lack of resources for mapping virtual memory pages, insufficient cache bandwidth, high memory latency, and/or poor functional unit scheduling did more to hamper the code's performance, Williams said.
The researchers created a code generator abstraction for LBMHD in order to optimize it for different multicore architectures. The optimization efforts included loop restructuring, code reordering , software prefetching, and explicit SIMDization. The researchers characterized their effort as akin to the "auto-tuning methodology exemplified by libraries like ATLAS and OSKI."
The results showed a wide range of performances on different processors and pointed to bottlenecks in the hardware that prevented the code from running well. The optimization efforts also resulted in a huge gain in performance -- the speed of the optimized code ran up to 14 times faster than the original version. It also achieved sustained performance for this code that is higher than any published to date: over 50 percent of peak flops on two of the processor architectures.
Compared with other processors, the Cell processor provided the highest raw performance and power efficiency for LBMHD. The processor's design calls for a direct software control of the data movement between on-chip and main memory resulting in the impressive performance. Overall, the researchers concluded, processor designs that focused on high throughput using sustainable memory bandwidth and a large number of simple cores perform better than processors with complex cores that emphasized sequential performance.
They also concluded that auto-tuning would be an important tool for ensuring that numerical simulation codes will perform well on future multicore computers.
Read about the researchers' analyses of other processor architectures by checking out the paper on Williams' Web site.-----
Source: Lawrence Berkeley National Laboratory
Appro Xtreme-X1 Supercomputer is Intel® Cluster Ready Certified
Appro adopts the Intel Cluster Ready program to help simplify deployment, usage and management of high performance computing clusters to achieve faster and more accurate time-to-results. Learn how.
Those of you looking forward to Rock -- Sun's much anticipated 16-core processor originally scheduled for release later this year but now pushed to the second half of 2009 -- don't have to wait for those chips to come out to experience that launch party euphoria. This week Sun and Fujitsu announced the latest of their enterprise line of SPARC-based servers, sporting the new SPARC64 VII chip.
Read More...
The UK makes a multi-million pound investment in science and computing; the Defense Department funds a HPC software project; and TACC's Ranger shows off its new Opterons. John West recaps those stories and more in our weekly wrap-up.
Read More...
If anyone knows how to introduce a new programming language, it's Sun Microsystems. The company's highly successful Java language, which was introduced in 1991, has become ubiquitous in network-centric and embedded computing. Today, there's a whole research team at Sun Labs devoted to programming languages, and the big project there in recent years has been the development of the Fortress programming language. The end game is to "do for Fortran what Java did for C."
Read More...
Jul 22 | Harvard Medical School | A team of Harvard Medical School researchers have developed a computer programming language that can be used to model the biomolecular behavior of proteins. Read more...
Jul 21 | Custom PC | Nvidia responds to Pat Gelsinger’s comments about CUDA being just a ‘footnote’ in computing history. Read more...
Jul 21 | ElectronicsWeekly.com | Computers based on the Cell processor dominate the world ranking for energy efficient supercomputers, according to the just-published Green500 list. Read more...
Jul 21 | IT Jungle | Rumors have been circulating about IBM's future Power7 processor and how the chip fits into NCSA's upcoming "Blue Waters" supercomputer. Read more...
Jul 17 | DailyTech | AMD's 12-core and 8-core processors will get a new home in 2010. Read more...
Jun 05 | | As pressure increases on the upstream seismic processing community to deliver ever-higher levels of productivity and efficiency, a new generation of storage solutions will be required that allow the maximum utilisation of high-performance computing (HPC) Linux cluster resources, together with the minimum of management overhead.
Today, HPC organizations are requiring substantially more floating point performance to solve real-world problems. In this podcast, Ben Bennett, ClearSpeed General Manager, discusses how acceleration technology can improve the overall performance of standard x86-based systems...
Get updates and insights on the High Productivity Computing industry delivered driectly to your inbox.