The Leading Source for Global News and Information Covering the Ecosystem of High Productivity Computing
October 06, 2006
Tailored allocation management on NCSA's Tungsten cluster proves an important part of many researchers' workflows.
Once the allocations have been made and the highest-quality projects have been given set amounts of time, there are two straightforward ways of scheduling users on a supercomputer. One is egalitarian. A queuing system applies a set of rules -- based on the amount of time a particular job is going to take, how many processors are going to be used, and the like -- and puts people in line to wait their turn. The other is totalitarian. The decks are cleared for a big user, and he or she runs on a massive number of processors, perhaps the whole machine, for a long time.
Neither approach is ideal, and neither addresses more nuanced or immediate needs.
Take the case of the MILC collaboration, which studies quantum chromodynamics. In 2004, they received an allocation of four million CPU hours on NCSA's Tungsten cluster. By any reckoning, even one that comprises researchers at nine institutions, that's a massive allocation of time.
To use those resources sensibly and efficiently requires human decisions and policies that are well tuned to the various ways that researchers use the center's systems.
"Sitting down, going to talk to the users, and figuring out what they want. It's the only way to do this" when you have a broad variety of user needs, according to John Towns, who leads NCSA's Persistent Infrastructure Directorate. "'This doesn't work for me' is the last thing you want to hear."
A powerful machine is still important, and Tungsten is certainly that. It has a peak capability of more than 15 trillion calculations per second, making it the largest computer supported by the National Science Foundation and available for open scientific research.
A popular machine is also important, and Tungsten is that, too. In September 2005, about 162 million normalized units of computing time were allocated on Tungsten -- about 20 percent of the total parceled out by NSF across the nation. User requests for Tungsten were almost double that number, far exceeding the number available. This made Tungsten the most requested and the most allocated system in NSF's arsenal in September.
"If you allocate this large and popular a resource in the traditional ways, somebody always suffers. People with large runs wait a long time in the queues or don't get to run at all because the queuing system is set up to handle a large number of smaller jobs. Or the smaller jobs get brushed aside in order to dedicate the machine to large jobs. It's a tough balance to strike," Towns says.
"Tungsten is a resource that satisfies specific needs of the user community. It's a critical part of their research workflow," NCSA Director Thom Dunning says. "That means we have to tailor allocations to suit them. We planned for this sort of approach when we installed Tungsten, and the popularity and productivity among users really showed us that it was the right way to go."
Tailored allocation management means that specific pieces of the machine are dedicated to particular users for given periods of time. These periods are planned in advance so that users know when they're going to get access, how long they're going to have it, and how much computing oomph they're going to have available. These dedicated runs give users the capability they need to complete crucial computations that must be done in a specific timeframe, that require an unusually large number of processors, or that otherwise give the queuing system fits. The dedicated runs still leave room for a traditional capacity-oriented system, with smaller jobs passing through the queue and running unimpeded.
About 40 percent of Tungsten is currently dedicated to tailored allocations.
"Nobody else is doing this this way. But it's the best way to provide access that balances individual productivity and servicing a large number of users. We want to be responsive to individual requests while still ensuring success for a broad range of people and disciplines," Towns says. "When we strike that balance, our users do special things."
What sort of users take advantage of tailored allocation management? And why? Here are some examples.
David Baker, University of Washington
David Baker and his team are in the business of refining protein structures. These structures are traditionally derived using limited experimental data or by starting from first principles and simulating the structure from scratch. This group's technique combines the two to produce much more accurate models.
A tailored account on NCSA's Tungsten cluster gave the team more than just the power they needed.
"We'd never computed on this system when we got our special allocation," he says, and they were still in search of the precise approach that they would use for their structure refinements. A tailored allocation is "very good for methods development. Having dedicated time over days allows you to make such rapid progress. You try different things quickly and get daily feedback. That's really, really helpful as you're trying to get on your feet."
Currently, the team has an in-house server system dedicated to conducting these sorts of protein structure refinements. It serves an entire community of researchers and is overtaxed. NCSA is configuring a portion of its Radium cluster to provide additional capacity to those researchers. It will expand their back-end capacity without any front-end change; researchers will continue to interact with the server as they always have.
The MILC collaboration
Members of the MILC collaboration are drawn to the strongest force in nature -- the force that binds together quarks into the protons and neutrons that form the nuclei of atoms. Their quantum chromodynamics calculations proceed in two steps. Ground state configurations are calculated through Monte Carlo simulations, then the group, along with many other physicists carrying out numerical studies of quantum chromodynamics, use those to simulate and explore a wide variety of other physical attributes of the subatomic world.
The bottleneck is the Monte Carlo calculations. "Each ground state configuration is generated from the preceding one, so we cannot run jobs in parallel or start one job before the previous one ends," explains the University of California at Santa Barbara's Robert Sugar, a member of the collaboration.
"As a result, we are in a poor position to compete for time with many of the users of normal queues who can have several jobs in the queues at once," Sugar says. Without a tailored allocation on Tungsten, there would be a ripple effect throughout the field. "The Department of Energy and the National Science Foundation spend approximately $750 million per year on their experimental programs in high-energy physics. A significant fraction of that is devoted to the study of weak decays of strongly interacting particles, a primary focus of our research."
"Our calculations are needed in order to fully capitalize on the investments being made in the experiments, and our results are needed in a timely fashion in order to keep pace with experiments," he says.
Joel Tohline, Louisiana State University
Every time a team of astrophysicists from Louisiana State University make a run on Tungsten, a star is born -- a pair of them, in fact. Tidal interactions among these stars can cause material to transfer between the stars and distort the stars' gravities, densities, sizes, and distance from one another. In some cases, the stars even merge in a spectacular and violent cosmic event. Work by this team is altering scientists' thinking on the mass ratio at which binary stars return to stability instead of coming to a catastrophic end.
When they asked for one of three tailored allocation on Tungsten, they had just received a referee report from a submission to The Astrophysical Journal. It said that "the conclusions we drew in the paper would be significantly strengthened if we could repeat one of our extended simulations using slightly different initial parameters. We knew from experience that, running on 128 processors without interruptions -- which never happens -- this simulation would require about a month to complete," says Joel Tohline, the professor at Louisiana State that led the team.
A week-long, 512-processor run on Tungsten was set up in short order, and the publication went to print. There are broader implications of working closely with users to supply the sort time and support they need, though.
"In any given year -- or decade, for that matter -- the most interesting problems in computational astrophysics -- substitute physics, biology, etc., as you like -- are often those that push the limits of available resources. To make meaningful progress…we design our simulations each year to take advantage of available computational resources at the national centers such as NCSA," Tohline says.
"Our peers and funding agencies expect to see measurable progress on challenging and timely, relevant problems. If we invest our time performing a simulation that can be completed in a week's time on 32 processors, it is not likely to be addressing one of the most challenging problems that presently confront us. NCSA's commitment to dedicate major resources when they're needed to a single problem is in synch with this overall philosophy. It has contributed demonstrably to my group's ability to make significant contributions to our field."
-----
Source: NCSA
Cray at SC08 – Celebrating Innovation
Visit us at booth #532 and see the latest technology from Cray, including the new Cray XT5 system with ECOphlex technology and the recently introduced Cray CX1 desk side supercomputer.
Visit IBM at SC08 - Experience the latest breakthroughs in High Performance Computing
As the world's leading provider of high performance computing solutions, IBM will showcase Exascale Stream Processing, Cloud Computing, Blue Brain, Interactive Ray Tracing along with many other exciting demos.
Harness the power of Sun to solve your most complex problems
Beat your competition by getting to market first, running more simulations, and solving complex problems with Sun HPC Systems. Sun HPC: Open, Simple, Reliable.
Last week, San-Francisco-based Complete Genomics came out of stealth mode to become the first provider of large-scale human genome sequencing services. HPCwire recently asked company representatives a few questions about their new offering.
Read More...
Intel has acquired the assets of NetEffect, an Austin-based company that makes iWARP-capable adapters. Intel will inherit NetEffect's product portfolio, which includes 1 and 10 GbE accelerated adapters, 10 GbE adapters for blade configurations as well as a 10 GbE ASIC.
Read More...
Woven Systems has added a new 10 Gigabit Ethernet top-of-rack switch to its product lineup. The TRX 200 is aimed at high performance datacenter environments requiring a scalable Ethernet fabric.
Read More...
Oct 15 | Linux Magazine | Today machines manage what we cannot. Are we dependent upon results or processes we do not understand? Read more...
Oct 15 | International Science Grid This Week | Exa-scale computing is probably years away. But GPUs and volunteer grids may provide a shortcut. Read more...
Oct 14 | Texas Advanced Computing Center | TACC has unveiled a new visualization laboratory capable of reproducing terascale data sets with exceptional clarity and resolution. Read more...
Oct 13 | Computerworld | Microsoft will have to overcome Windows' historical baggage if its new HPC Server 2008 offering is to be acceptable to users. Read more...
Oct 13 | Knoxville News Sentinel | Oak Ridge National Laboratory has petaflop computing in sight as it upgrades its 'Jaguar' supercomputer. Read more...
Sep 04 | | Disk drives are approximately 250 times denser today than a decade ago. This is good news for users who are creating, manipulating and storing more data than ever before. It gives them an opportunity to derive more value from their stored data and lowers the capital acquisition and operating expense associated with that data.
Sep 05 | | The excellent scalability features of Linux, in addition to robust security and performance makes it an excellent choice for server systems, especially in the high performance computing area.
BlueArc's Titan architecture represents an evolutionary step in file servers by creating a hardware-based file system that can scale bandwidth, IOPS, and overall data capacity well beyond conventional software-based devices. With its ability to virtualize a massive storage pool of up to four usable petabytes of tiered storage, Titan can scale with growing data requirements, offering a competitive advantage for businesses, researchers, or other enterprises seeking to better manage data growth while still ensuring optimal performance.
Get updates and insights on the High Productivity Computing industry delivered driectly to your inbox.