Visit additional Tabor Communication Publications
October 28, 2010
As we move from multicore to manycore processors, memory bandwidth is going to become an increasingly annoying problem. For some HPC applications it already is. As pointed at in a recent HPCwire blog, a Sandia study found that certain classes of data-intensive applications actually run slower once you try to spread the computations beyond eight cores. The problem turned out to be insufficient memory bandwidth and the contention between processor for memory access.
That is certainly not the case for all applications. But beyond that, it's not always useful to focus on memory bandwidth limitations when considering how to get the most out of your processors. A recent blog post penned by TACC'ers Dan Stanzione and Tommy Minyard suggest we look at the problem somewhat differently. To being with, the authors think the whole notion of trying just to maximize core usage is somewhat misplaced. They write:
Leaving a core idle is considered "wasteful". This is not surprising, but upon careful reflection doesn't make that much sense... No one considers it a "waste" if while running a job on every core of your machine, half your memory is empty, or half your network is unused, or you are only using half the available IOPS or bandwidth to your disk drive.
Stanzione and Minyard go on to say that the real metric you should be concerned about is how much work your cluster is getting done in a given time period. So for certain workload mixes, it might make sense to let cores go idle in order to ensure the remaining cores are left with enough memory bandwidth for fast execution. Or you could mix compute-intensive applications with data-intensive ones so that both cores and memory usage can be more utilized -- assuming you have the right mix of applications to choose from.
Of course, not every HPC installation has the luxury of choosing an optimal mix of applications. What if you're stuck with running a memory-hungry application, like the Weather Research and Forecasting (WRF) code, all of the time?
The TACC authors actually came up with some interesting data points using WRF on Xeon platforms. They found that going beyond 8 cores per node yielded diminishing returns in speedup (not quite so bad as the Sandia study, which demonstrated lost performance beyond 8 cores). Using Intel Westmere CPUs they were only able to achieve a 12 percent performance improvement going from 8 to 10 cores, and just 2.7 percent when going from 10 to 12 cores.
So what do you do in this scenario? Stanzione and Minyard write:
Well, maybe it tells the WRF developers that you can do a whole lot more computation between memory accesses essentially for free on the new processors. Maybe it says you can run some not-so-memory-intensive jobs alongside your WRF jobs on those extra cores essentially for free. But perhaps the most important thing it says is that to get maximum throughput nowadays, you shouldn't assume that the best and most efficient configuration is to use every core in every socket for your job. For some kinds of programs you will, for some kinds of programs you won't... but isn't it nice to have all that extra compute power lying around for the times that you need it?
Well yes, that is nice, especially if you can afford to deploy such systems. On the other hand, the AMD folks might point out that their Opteron solutions achieve a better balance between CPU FLOPS and memory bandwidth than the Xeons. The NVIDIA folks, one assumes, would have an entirely different suggestion.
Full story at Dell Technology Center
In quieter times, sounding the bell of funding big science with big systems tends to resonate further than when ears are already burning with sour economic and national security news. For exascale's future, however, the time could be ripe to instill some sense of urgency....
In a recent solicitation, the NSF laid out needs for furthering its scientific and engineering infrastructure with new tools to go beyond top performance, Having already delivered systems like Stampede and Blue Waters, they're turning an eye to solving data-intensive challenges. We spoke with the agency's Irene Qualters and Barry Schneider about..
Large-scale, worldwide scientific initiatives rely on some cloud-based system to both coordinate efforts and manage computational efforts at peak times that cannot be contained within the combined in-house HPC resources. Last week at Google I/O, Brookhaven National Lab’s Sergey Panitkin discussed the role of the Google Compute Engine in providing computational support to ATLAS, a detector of high-energy particles at the Large Hadron Collider (LHC).
05/10/2013 | Cleversafe, Cray, DDN, NetApp, & Panasas | From Wall Street to Hollywood, drug discovery to homeland security, companies and organizations of all sizes and stripes are coming face to face with the challenges – and opportunities – afforded by Big Data. Before anyone can utilize these extraordinary data repositories, however, they must first harness and manage their data stores, and do so utilizing technologies that underscore affordability, security, and scalability.
04/15/2013 | Bull | “50% of HPC users say their largest jobs scale to 120 cores or less.” How about yours? Are your codes ready to take advantage of today’s and tomorrow’s ultra-parallel HPC systems? Download this White Paper by Analysts Intersect360 Research to see what Bull and Intel’s Center for Excellence in Parallel Programming can do for your codes.
In this demonstration of SGI DMF ZeroWatt disk solution, Dr. Eng Lim Goh, SGI CTO, discusses a function of SGI DMF software to reduce costs and power consumption in an exascale (Big Data) storage datacenter.
The Cray CS300-AC cluster supercomputer offers energy efficient, air-cooled design based on modular, industry-standard platforms featuring the latest processor and network technologies and a wide range of datacenter cooling requirements.