A Perfect Storm For Supercomputing

By Commentary from the High-End Crusader

April 15, 2005

The High-End Computing Revitalization Task Force came and went without much fanfare in this country, which is thoroughly depressing if you stop to think about it. In contrast, two recent visible regressions in how the government funds supercomputing research and/or access to supercomputers have provoked outcries, some of which have been echoed in the mainstream media, from both computing researchers and high-end users. Many bad things appear to be happening at the same time. Are they related at all? Do they have a common theme? Taken together, in how much trouble do they leave high-end computing? In September 2004, not seeing a clear mission for its two centers, the NSF dissolved the Partnerships for Advanced Computational Infrastructure (PACI) program, with a phased stepdown in funding for the two PACI (supercomputer) centers and an uncertain future. The PACI centers had long since evolved away—if they were ever involved at all—from research on supercomputer enabling technologies such as semiconductor materials, computer architecture, programming systems, and operating systems, but they did provide useful compute cycles to high-end users and they did fund interdisciplinary teams of discipline and computational scientists, say, a group of physicists or biologists working together with computational-science experts in simulation, large-scale data management, or whatever, with significant help from various experts who worked at the centers themselves. Then, in April 2005, DARPA officials helped to clarify some trends that hadn’t been immediately obvious from the budget-proposal numbers published earlier. The administration’s budget request for FY2006 proposed $1.3 billion for all of DoD’s basic research (6.1 research) and $4.1 billion for all of DoD’s applied research (6.2 research). In this budget, DARPA was slated to get $222 million for basic research and $2 billion for applied research. In testimony before the Senate, DARPA officials gave slightly different numbers, but readily admitted that the portion of computing-research funding going to university researchers had fallen, over the last several years, from roughly $200 million per year to $100 million per year. We have the makings of a perfect storm: no HECRTF funding for supercomputing R&D, DARPA cutting funding for basic computing research at universities, NSF phasing out the PACI program, with the bulk of computing researchers now forced to submit grant proposals to NSF’s Computer and Information Science and Engineering (CISE) directorate, with all that implies in terms of NSF’s computing-research priorities. NSF’s harshest critics bluntly describe—off the record—NSF’s research priorities as “anti-computer”, in the sense that NSF has limited interest in supporting R&D for future supercomputing, limited interest in providing big machines for high-end users, but a very genuine interest in ministering to computing’s “real priorities”, which—as we all know—include such things as large-scale databases, visualization software, and middleware services. Be sure you are seated before reading the next sentence. One wholly accurate (indeed, dead-on) characterization of NSF’s attitude to high-end computing is this: Although supercomputers have become less expensive and their programming techniques have become less arcane—no doubt this explains the boom in high-productivity computing we are all experiencing—many scientists and engineers have discovered that their most challenging problems don’t require real supercomputing muscle anymore, because clusters and grids give you more compute power than (almost) anyone would want. Instead, users need help with such things as scientific instrumentation, visualization software, and big databases. Like Iraq, computing—even high-end computing—is a very complex subject. To make sense of all this, we need to move outward slowly from a few pillars we all understand. Let’s start out with what HECRTF would have been if it had survived the short-sighted bureaucrats who gutted it. (For some reason, Congress seems to have missed this development). The real challenge that HECRTF (and the IHEC report and the recent NRC report) sought to address is devising—and funding in a sustained, predictable fashion—an HEC R&D initiative that successfully integrates i) basic research, ii) applied research, iii) advanced development, and iv) engineering and prototype development. Why are all these things necessary? Why must they be integrated? In what sense can they be planned? Already, we can see intertwined interests of multiple groups of computing stakeholders. No one claims that all computing research is research on high-end computing. Your correspondent starts with HECRTF because he thinks that supercomputing is in deep trouble _and_ that this will ultimately damage computing as a whole, quite independently of the national-security and other critical applications that must be postponed because we do not now have a scalable _general_ solution to the problem of latency tolerance, which is a major problem for any current or future application with significant amounts of long-range communication. In contrast, nation-class computing universities such as Berkeley are up in arms about the cuts in DARPA funding because they will lose large challenging research projects, which alone can provide the intellectual excitement to attract the brightest students to computing. The tech bust has redirected students in search of vocational training, so now disciplines compete largely on their intrinsic intellectual interest. The brain drain of top young people leaving high-end computing has been going on for some time—users find high-end machines unpleasant to program; few designers believe they can create real supercomputers in a world of clusters—and is certainly an urgent national problem—the mission agencies have time-critical computing needs; the availability of machines adequate to tomorrow’s missions is anything but guaranteed. Recently, a similar trend is beginning to affect computing research as a whole: our top young people are leaving computing per se. The computing-research community is struggling valiantly but unsuccessfully to articulate basic research challenges that would make computing the intellectual equal of physics or biology. Computer security is certainly fashionable but no fundamental progress will be achieved here without rebuilding the computer-systems research community, notably in operating systems. Very possibly, computer-systems research will only be revitalized after it has been explicitly tied to high-end computing. The distinctions between basic research, applied research, advanced development, etc., also need a bit of thought. Why would someone who is worried about the state of high-end computing be concerned about a scarcity of basic computing research? In a technological field like computing, all research, even basic research, is goal directed in that it is relevant (or not) to some larger agenda. Basic research, most often performed in universities, is needed today to generate the new ideas and technologies that will be required to conceive and build the supercomputers, not of 2010, but of 2020. And needed in 2010 for the machines of 2025, and so on. As we (intellectually) move through the high-end computing pipeline stages of basic research, applied research, advanced development, etc., we are keeping the goal essentially fixed but steadily increasing our requirements as to the importance and specificity of deliverables. After all, we are looking for real computers that we can use to solve problems of national importance. There has to be a healthy balance among supporting all pipeline stages because if any stage drys up, eventually the whole pipeline halts (or turns out endless copies of yesterday’s machines). That is, there must be an integrated plan that establishes, at each point in time, how much of which kind of work should be supported in each stage of the high-end computing pipeline. For every stage, priorities must be set, whether research priorities or development priorities. (These priorities are sets of important problems or activities; they are not a linear ranking). Setting priorities is declaring—at regular intervals—that certain problems are important because “at present” they constitute the major roadblocks that are holding supercomputing back. As far as humanly possible, defining which problems are important must be done without specifying any desired conceptual or technological solutions. A funding program must have clearly articulated goals, must define certain problems as important, must judge individual research programs as worthy or unworthy, but must not constrain the solution space, e.g., by including all the blind spots and lack of imagination of the program management office as to what the solutions are. An integrated plan for HEC R&D sounds vanilla but is not. Indeed, there is an enormous common misconception here. Many people confuse 1) federal funding for high-end computing facilities (open or closed) and their enabling software with 2) federal funding for an integrated HEC R&D initiative. Mixed messages on this point are a great way to baffle Congress, which then confuses 1) procurement of _existing_ machines with 2) funding the (new) research and development that will be required to create (more productive) _new_ machines. Both are candidates for federal support, but neither can substitute for the other. And, since they are distinct, we must ask, which has the higher priority? Which is the true national imperative? Note: Your correspondent _strongly_ supports DOE’s project to stand up a National Leadership Computing Facility (NLCF) at Oak Ridge. Although DARPA’s HPCS program starts in the middle with applied research and advanced development, and DOE’s NLCF project is basically procurement, these are the two primary federal activities at present that are beneficial to the nation’s high-end computing crisis. To belabor an obvious point, we can see how funding R&D differs fundamentally from funding high-end facilities by considering an entirely fictional world disjoint from current political and budgetary constraints. In this ideal world, DARPA keeps its High-Productivity Computing Systems (HPCS) program, but stands up a new program called Basic Research for High-End Computing. This program would resemble the old DARPA program that used to fund a multitude of projects in parallel computing. Using the outputs of the DARPA basic-research program as seed corn, an interagency joint program management office would decide which research ideas should be funded for applied research, which successfully applied ideas should move forward to advanced development, and which advanced development projects should be selected for engineering and protocol development. We need to combine federal support for _properly directed_ basic research with the IHEC plan, which basically takes care of the remaining stages of the high-end computing pipeline. (In reality, we would have multiple pipelines acting in parallel). You can always buy a cluster, one of the less innovative machines from the ASCI family, or even one of the novel hybrid machines that are “quasi-custom”, but—from now on—you can’t procure a real supercomputer unless it makes its way through a federally-supported high-end computing pipeline. (Question: What federal support was _necessary_ for you to be _able_ to buy, say, a Cray X1E?). Those who fought for HECRTF think the HEC pipelines will just stagnate without federal support (and direction) for each of their many stages. — What Went Wrong Where? The perfect storm that has, for all practical purposes, caused the high-end computing pipeline not to restart—with _few_ exceptions—was jointly caused by 1) a nonexisting HECRTF program, 2) DARPA’s reduction of funding for basic computing research at universities, and 3) NSF’s aberrant worldview of what computing needs. Which battles were lost to bring us to such a sorry state? The largest single blow to the future of high-end computing was the death by (financial) starvation of the HECRTF program, although losing vital basic research at universities is certainly a close second. Your correspondent has written at length about what happened in “Revitalizing HECRTF: A Focused Plan For High-End Computing” (HPCwire article [107765]). To review briefly, a group of academics and government scientists set out a good R&D agenda— although the enabling-technologies section was a bit of a rehash—of all the topics that needed attention in any federal plan to fund research and development for high-end computing. This work is preserved in the proceedings of the Workshop on the Roadmap for the Revitalization of High-End Computing. Unfortunately, bureaucrats from the administration’s OSTP and OMB blocked any meaningful financial commitment, causing HECRTF to wither on the vine. Quite apart from political and financial constraints, these bureaucrats had received, and swallowed whole, a dangerous message. In simple terms, if you believe that the problem of supercomputing—that is to say, all the research and development activities that need to be carried out to overcome the roadblocks to (current and future) supercomputing—has already been solved, then why on earth would you spend hundreds of millions of dollars to solve the problem of supercomputing? Who sold this bill of goods to OSTP and OMB? We know of retrograde forces in DoD and DOE who hold this view. Also, some vendors have a vested commercial interest in the view—or myth—that the problem of supercomputing has been solved. But clearly the root cause of this mistaken view is an astonishing ignorance of the principles of supercomputing. The admirable NRC report “Getting Up to Speed: The Future of Supercomputing”, which was reviewed by your correspondent in “Government Must Own The Problem Of Supercomputing” (HPCwire article [109181]), is a masterful demonstration that the status quo is unsustainable, that the increasing divergence between 1) the speeds at which processors can do arithmetic, and 2) the speeds at which the network/memory system can supply data operands to processors, will eventually force all high-performance applications to inhabit a very narrow corner of locality phase space (that is, be embarrassingly local). The solution to this problem, as readers of this space might remember, is to invest in high-bandwidth systems for those applications that _are not_ (or _should not be_) embarrassingly local. The DARPA story is different. The government believes that the United States is engaged in a long-term (unconventional) war against Islamic terrorism, and has large numbers of troops stationed in harm’s way in Iraq and other places. Many of these troops have been, and are being, killed. DARPA’s calling has always been to provide the military services with the technological tools they need to carry out their respective missions. The political pressure to produce deliverables, i.e., systems the military can actually use, is enormous. There are a multitude of unsolved problems in homeland security, which has its own “DARPA”. DARPA also supplies the intelligence community, although some agencies do most of their own work in house or contract it out separately. The Silberman-Robb Commission report has raised basic issues about the need to change the culture of the intelligence community, and others have pointed up growing problems in counterintelligence. It’s a whole new world out there. It is not surprising that DARPA is cutting spending at universities in favor of funding more classified work, often by military contractors, and more short-term projects with narrowly defined deliverables. Also, there were distractions. Congress perturbed utterly essential work on counterterrorism data mining, some of which was being done by academics, by raising a ruckus over the Total Information Awareness (TIA) project. (TIA had a few technical flaws and wasn’t very likely to lead to a deployable system anyway). In short, DARPA’s political environment is, quite simply, impossible. But has DARPA thrown the baby out with the bathwater? Yes, it has. If U.S. high-end computing tanks because too few new _good_ ideas are coming out of universities, then that is a far greater threat to national security than the absence of short-term deliverables, which admittedly are damn important. DARPA’s research agenda may not have always been letter perfect—in the good old days, Stephen Squires’ vision of a “scalable technology base” royally ignored the distinction between strongly and weakly parallel processors, and was equally deaf to the possibility of a new breed of scalable parallel vector processors—but at least it had one, and it was never laughably absurd. You can’t have an HECRTF program without some funding agency that is responsible for jump starting basic research—specifically, research that would be beneficial to high-end computing. The future of computing has everything to do with _performance_. Many of the roadblocks faced today by supercomputing are roadblocks that affect all of computing, but affect supercomputing earlier and to a more significant extent. The NSF problem is a delicate one. It is unfortunate that NSF’s ideas to stand up a (national?) “computational infrastructure” have gained some footing in Washington, since their whole reasoning presupposes that the problem of supercomputing has been solved. The NRC report has some calm words: “[The NSF] centers have increased the scope of their activities in order to support high-speed networking and grid computing and to expand their education mission. [These] increases in scope have not been accompanied by corresponding increases in funding, so less attention is paid to supercomputing, and support for computational scientists with capability needs has been diluted”. “It is important to repair the current situation at NSF, in which the computational-science users of supercomputing centers appear to have too little involvement in programmatic and budgetary planning. All the research communities in need of supercomputing capability have a shared responsibility to provide direction for the supercomputing infrastructure they use and to ensure that resources are available for sustaining the supercomputing ecosystems”. “Funding for the acquisition and operation of the research computing infrastructure should be clearly separated from funding for computer and computational science and engineering research. It should compete on an equal basis with other infrastructure needs of the science and engineering disciplines. That is not now the case”. There is nothing wrong with promoting the idea that a particle physicist working in California should be able to use high-speed networks and interoperable data archives to perform data mining on stored representations of physics experiments done at Chicago and Geneva, and many other similar things, but to claim that distributed-computing tools of this sort—which are extremely useful to some users—are the main thing that U.S. science and engineering in general need—apart from clusters and grids, of course—is to demonstrate an arrogant disconnect from the real needs of the majority of high-end users. It is like mistaking a tiny piece of the crust for the whole pie. — Conclusion Our problem is the fundamental ambivalence in federal policy circles about whether the problem of supercomputing has been solved. Those with even moderate knowledge of the defense and intelligence communities are aware of the many, many absolutely vital computational problems that exceed the capabilities of even the largest supercomputers available today. For example, exceptional (but unavailable) supercomputing performance and exceptional (but unavailable) programmability are jointly required to enable a fine-grained, full-airframe combined CFD and CEM simulation of an aerospace vehicle like the Joint Strike Fighter. The NRC report puts it well, speaking of the computational requirements of signals intelligence: “The highest-priority problems are chosen [by] foreign adversaries. They do this when they choose communication methods. This single characteristic puts phenomenal demands on both the development of solutions [i.e., time to solution] and their deployment on available computing platforms [i.e., raw capability]”. “While these specific mission-driven requirements are unique to the signals-intelligence [activity], their effect is seen across a fairly broad spectrum of mission agencies, both inside and outside the defense community. This is in contrast to computing that targets broad advances in technology and science. In [the latter] context, computations are selected more on the basis of their match to available resources and codes”. In other words, high-end users who do _elective supercomputing_ have a far lower anxiety level than high-end users who do _emergency supercomputing_. A certain diversity in the high-end user base, coupled with ignorance in some circles of the principles of supercomputing and ignorance in other circles of the vast scope of critical national problems that, quite simply, cannot be computed on any existing machine, allows the fundamental ambivalence about the problem of supercomputing to persist. Obviously, the combined force of 1) high-end scientific and industrial users, 2) supercomputer vendors, 3) the national labs, 4) leading academics, and 5) the national-security community, has the critical mass to shake Congress to its boots, and to rouse it from its dogmatic slumber for the purposes of overturning the short-sighted recommendations of OSTP and OMB. But how many people can we bring under the tent? What common (political) platform can most of us agree to support? If some of us say nice things about grids and clusters (and there are nice things to be said), will others admit that the problem of supercomputing must be solved anew in each generation? And also that this problem cannot be solved without federal support and federal planning? In a word, how many HPC leaders are willing to campaign for the NRC report? — The High-End Crusader, a noted expert in high-performance computing and communications, shall remain anonymous. He alone bears responsibility for these commentaries. Replies are welcome and may be sent to HPCwire editor Tim Curns at [email protected].

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

MLPerf Inference 4.0 Results Showcase GenAI; Nvidia Still Dominates

March 28, 2024

There were no startling surprises in the latest MLPerf Inference benchmark (4.0) results released yesterday. Two new workloads — Llama 2 and Stable Diffusion XL — were added to the benchmark suite as MLPerf continues Read more…

Q&A with Nvidia’s Chief of DGX Systems on the DGX-GB200 Rack-scale System

March 27, 2024

Pictures of Nvidia's new flagship mega-server, the DGX GB200, on the GTC show floor got favorable reactions on social media for the sheer amount of computing power it brings to artificial intelligence.  Nvidia's DGX Read more…

Call for Participation in Workshop on Potential NSF CISE Quantum Initiative

March 26, 2024

Editor’s Note: Next month there will be a workshop to discuss what a quantum initiative led by NSF’s Computer, Information Science and Engineering (CISE) directorate could entail. The details are posted below in a Ca Read more…

Waseda U. Researchers Reports New Quantum Algorithm for Speeding Optimization

March 25, 2024

Optimization problems cover a wide range of applications and are often cited as good candidates for quantum computing. However, the execution time for constrained combinatorial optimization applications on quantum device Read more…

NVLink: Faster Interconnects and Switches to Help Relieve Data Bottlenecks

March 25, 2024

Nvidia’s new Blackwell architecture may have stolen the show this week at the GPU Technology Conference in San Jose, California. But an emerging bottleneck at the network layer threatens to make bigger and brawnier pro Read more…

Who is David Blackwell?

March 22, 2024

During GTC24, co-founder and president of NVIDIA Jensen Huang unveiled the Blackwell GPU. This GPU itself is heavily optimized for AI work, boasting 192GB of HBM3E memory as well as the the ability to train 1 trillion pa Read more…

MLPerf Inference 4.0 Results Showcase GenAI; Nvidia Still Dominates

March 28, 2024

There were no startling surprises in the latest MLPerf Inference benchmark (4.0) results released yesterday. Two new workloads — Llama 2 and Stable Diffusion Read more…

Q&A with Nvidia’s Chief of DGX Systems on the DGX-GB200 Rack-scale System

March 27, 2024

Pictures of Nvidia's new flagship mega-server, the DGX GB200, on the GTC show floor got favorable reactions on social media for the sheer amount of computing po Read more…

NVLink: Faster Interconnects and Switches to Help Relieve Data Bottlenecks

March 25, 2024

Nvidia’s new Blackwell architecture may have stolen the show this week at the GPU Technology Conference in San Jose, California. But an emerging bottleneck at Read more…

Who is David Blackwell?

March 22, 2024

During GTC24, co-founder and president of NVIDIA Jensen Huang unveiled the Blackwell GPU. This GPU itself is heavily optimized for AI work, boasting 192GB of HB Read more…

Nvidia Looks to Accelerate GenAI Adoption with NIM

March 19, 2024

Today at the GPU Technology Conference, Nvidia launched a new offering aimed at helping customers quickly deploy their generative AI applications in a secure, s Read more…

The Generative AI Future Is Now, Nvidia’s Huang Says

March 19, 2024

We are in the early days of a transformative shift in how business gets done thanks to the advent of generative AI, according to Nvidia CEO and cofounder Jensen Read more…

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

March 18, 2024

Nvidia's latest and fastest GPU, codenamed Blackwell, is here and will underpin the company's AI plans this year. The chip offers performance improvements from Read more…

Nvidia Showcases Quantum Cloud, Expanding Quantum Portfolio at GTC24

March 18, 2024

Nvidia’s barrage of quantum news at GTC24 this week includes new products, signature collaborations, and a new Nvidia Quantum Cloud for quantum developers. Wh Read more…

Alibaba Shuts Down its Quantum Computing Effort

November 30, 2023

In case you missed it, China’s e-commerce giant Alibaba has shut down its quantum computing research effort. It’s not entirely clear what drove the change. Read more…

Nvidia H100: Are 550,000 GPUs Enough for This Year?

August 17, 2023

The GPU Squeeze continues to place a premium on Nvidia H100 GPUs. In a recent Financial Times article, Nvidia reports that it expects to ship 550,000 of its lat Read more…

Shutterstock 1285747942

AMD’s Horsepower-packed MI300X GPU Beats Nvidia’s Upcoming H200

December 7, 2023

AMD and Nvidia are locked in an AI performance battle – much like the gaming GPU performance clash the companies have waged for decades. AMD has claimed it Read more…

DoD Takes a Long View of Quantum Computing

December 19, 2023

Given the large sums tied to expensive weapon systems – think $100-million-plus per F-35 fighter – it’s easy to forget the U.S. Department of Defense is a Read more…

Synopsys Eats Ansys: Does HPC Get Indigestion?

February 8, 2024

Recently, it was announced that Synopsys is buying HPC tool developer Ansys. Started in Pittsburgh, Pa., in 1970 as Swanson Analysis Systems, Inc. (SASI) by John Swanson (and eventually renamed), Ansys serves the CAE (Computer Aided Engineering)/multiphysics engineering simulation market. Read more…

Choosing the Right GPU for LLM Inference and Training

December 11, 2023

Accelerating the training and inference processes of deep learning models is crucial for unleashing their true potential and NVIDIA GPUs have emerged as a game- Read more…

Intel’s Server and PC Chip Development Will Blur After 2025

January 15, 2024

Intel's dealing with much more than chip rivals breathing down its neck; it is simultaneously integrating a bevy of new technologies such as chiplets, artificia Read more…

Baidu Exits Quantum, Closely Following Alibaba’s Earlier Move

January 5, 2024

Reuters reported this week that Baidu, China’s giant e-commerce and services provider, is exiting the quantum computing development arena. Reuters reported � Read more…

Leading Solution Providers

Contributors

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

October 30, 2023

With long lead times for the NVIDIA H100 and A100 GPUs, many organizations are looking at the new NVIDIA L40S GPU, which it’s a new GPU optimized for AI and g Read more…

Shutterstock 1179408610

Google Addresses the Mysteries of Its Hypercomputer 

December 28, 2023

When Google launched its Hypercomputer earlier this month (December 2023), the first reaction was, "Say what?" It turns out that the Hypercomputer is Google's t Read more…

AMD MI3000A

How AMD May Get Across the CUDA Moat

October 5, 2023

When discussing GenAI, the term "GPU" almost always enters the conversation and the topic often moves toward performance and access. Interestingly, the word "GPU" is assumed to mean "Nvidia" products. (As an aside, the popular Nvidia hardware used in GenAI are not technically... Read more…

Shutterstock 1606064203

Meta’s Zuckerberg Puts Its AI Future in the Hands of 600,000 GPUs

January 25, 2024

In under two minutes, Meta's CEO, Mark Zuckerberg, laid out the company's AI plans, which included a plan to build an artificial intelligence system with the eq Read more…

Google Introduces ‘Hypercomputer’ to Its AI Infrastructure

December 11, 2023

Google ran out of monikers to describe its new AI system released on December 7. Supercomputer perhaps wasn't an apt description, so it settled on Hypercomputer Read more…

China Is All In on a RISC-V Future

January 8, 2024

The state of RISC-V in China was discussed in a recent report released by the Jamestown Foundation, a Washington, D.C.-based think tank. The report, entitled "E Read more…

Intel Won’t Have a Xeon Max Chip with New Emerald Rapids CPU

December 14, 2023

As expected, Intel officially announced its 5th generation Xeon server chips codenamed Emerald Rapids at an event in New York City, where the focus was really o Read more…

IBM Quantum Summit: Two New QPUs, Upgraded Qiskit, 10-year Roadmap and More

December 4, 2023

IBM kicks off its annual Quantum Summit today and will announce a broad range of advances including its much-anticipated 1121-qubit Condor QPU, a smaller 133-qu Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire