February 27, 2013
Reining TOP500 champ, Titan, is not performing as expected. Jeff Nichols, head of Oak Ridge National Laboratory's scientific computing division, told Knoxville News that the massive supercomputer encountered technical issues that halted the final acceptance test.
This means that the DOE's Oak Ridge National Laboratory (ORNL) won't yet be taking official ownership of the $100 million dollar machine, and payments to Cray will be put on hold.
On the bright side, the problem has been identified and both parties are working on a solution.
"We've found a few bugs that have held us back," Nichols said, "and we're doing some repair work with Cray in order to get the stability tests where we want them to be."
The problems were traced to the interconnect fabric that enables the CPU and GPU components to communicate. The CPU-side of this hybrid supercomputer is operational, but applications that call on GPUs have encountered sporadic faults. ORNL is sending back sections of the system to Cray on a rotating basis for repair.
Even with these issues, Titan came close to meeting the goals for a successful acceptance test. A passing score is awarded for completing 95 percent of the jobs in the test, and the Cray supercomputer came in at 92-93 percent, only a few percentage points shy.
From what Nichols told Knoxville News, the issues sound more like a speed bump as opposed to a fatal flaw. Nichols expects final acceptance of Titan to be delayed no more than a month or two at most. He believes that once the connecters are repaired, the rest of the process should be a "slam dunk."
Despite recent setbacks, Titan passed initial testing in time for the November 2012 TOP500 list. This 27-petaflops (peak) Cray XK7 scored 17.59 petaflops on the Linpack benchmark, earning it bragging rights as the "world's fastest supercomputer."
The DOE's Oak Ridge National Laboratory describes Titan as "the world's most powerful supercomputer for open science with a theoretical peak performance exceeding 20 petaflops (quadrillion calculations per second)." This unprecedented level of power opens up a new possibilities for ground-breaking research, including complex climate change models and sophisticated nuclear reactor simulations.
Large-scale, worldwide scientific initiatives rely on some cloud-based system to both coordinate efforts and manage computational efforts at peak times that cannot be contained within the combined in-house HPC resources. Last week at Google I/O, Brookhaven National Lab’s Sergey Panitkin discussed the role of the Google Compute Engine in providing computational support to ATLAS, a detector of high-energy particles at the Large Hadron Collider (LHC).
Read more...
The Xeon Phi coprocessor might be the new kid on the high performance block, but out of all first-rate kickers of the Intel tires, the Texas Advanced Computing Center (TACC) got the first real jab with its new top ten Stampede system.We talk with the center's Karl Schultz about the challenges of programming for Phi--but more specifically, the optimization...
Read more...
Although Horst Simon was named Deputy Director of Lawrence Berkeley National Laboratory, he maintains his strong ties to the scientific computing community as an editor of the TOP500 list and as an invited speaker at conferences.
Read more...
05/10/2013 | Cleversafe, Cray, DDN, NetApp, & Panasas | From Wall Street to Hollywood, drug discovery to homeland security, companies and organizations of all sizes and stripes are coming face to face with the challenges – and opportunities – afforded by Big Data. Before anyone can utilize these extraordinary data repositories, however, they must first harness and manage their data stores, and do so utilizing technologies that underscore affordability, security, and scalability.
04/15/2013 | Bull | “50% of HPC users say their largest jobs scale to 120 cores or less.” How about yours? Are your codes ready to take advantage of today’s and tomorrow’s ultra-parallel HPC systems? Download this White Paper by Analysts Intersect360 Research to see what Bull and Intel’s Center for Excellence in Parallel Programming can do for your codes.
In this demonstration of SGI DMF ZeroWatt disk solution, Dr. Eng Lim Goh, SGI CTO, discusses a function of SGI DMF software to reduce costs and power consumption in an exascale (Big Data) storage datacenter.
The Cray CS300-AC cluster supercomputer offers energy efficient, air-cooled design based on modular, industry-standard platforms featuring the latest processor and network technologies and a wide range of datacenter cooling requirements.