An Alternative Approach to NSF Funding of HPC

January 25, 2008

Two weeks ago, Vijay K. Agarwala, director of Research Computing and Cyberinfrastructure Information Technology Services at Penn State, sent a letter to all members of the Coalition for Academic Scientific Computation (CASC) regarding NSF funding of HPC at university-based research centers. In it, Vijay proposed an alternate strategy where a portion of the NSF funding destined for large-scale computing at a single large center be shifted to a number of smaller HPC systems in as many as 25 Tier 3 centers. The letter is intended to encourage members of CASC, an advocacy group for HPC and advanced computing technology, to consider some of letter’s recommendations and help shape funding priorities for the NSF. The text of the letter is provided below.

Dear Colleagues,

I would like to share a few thoughts on why the National Science Foundation (NSF) might find it meaningful to revisit the issue of how it funds cyberinfrastruture for research computations across the computing “pyramid.”

Summary of recommendations:

The science community and industry will be well served if a portion of the federal funding for large-scale computing systems is more evenly allocated rather than most of it being concentrated in a few centers. While the national centers (Tier I and II) with their ultra-large systems will continue to have an important role in meeting the capacity and capability computing needs of U.S. scientists and engineers, support for a number of university-based research computation centers will help fill existing funding gaps and address many important policy objectives and goals such as development of skilled HPC personnel, deeper university-industry partnerships, increased adoption of HPC systems as a discovery tool by larger number of academic researchers as well as by industry, improved industrial competitiveness, and economic revitalization. Support for 20 to 25 such university-based Tier 3 computing centers should be provided via a competitive solicitation and merit-based review and grant process. It is estimated that a program with $50 million in annual budget could fund, over a two-year period, 20 to 25 such university-based centers at the level of $2 million to $4 million per year.

If the number of Tier 2 centers funded is kept to a total of three, the program proposed here can take the place of the last (fourth) such proposed Tier 2 center, and in the process yield greater benefit to the U.S. science and engineering community by meeting many important needs.

It is useful to note that the major research instrumentation (MRI) program at NSF has a substantially different purpose than what is proposed here. In MRI, proposals for computing hardware compete with similar proposals for a range of research equipment. The total funding allocated to computing hardware is therefore a small proportion of the total MRI budget. Also, the MRI grants are given to a specific group of faculty co-PIs and it is principally for their use. What is proposed in this note is intended to meet the wider computational needs at the recipient institutions.

Here are some basic facts and observations that underlie the thought points of this note:

1. There is a growing shortage of HPC professionals (computational scientists). These are scientists and engineers who are well versed in some or all aspects of systems and architecture, programming, algorithms and numerical methods, some domain knowledge, the ability to think across disciplinary boundaries and integrate modeling ideas and computational techniques from different areas. The demand in academia and industry for such skilled human resources exceeds what is available today or what academia and industry together are likely to train in the near future.

There is also a nearly flat or falling U.S. undergraduate enrollment in computer science and engineering programs. This shortage impacts the rate of adoption of large-scale computations in industry as well as in academia.

2. There is a growing gap between the size of systems being deployed by the end-user industry in their research and development divisions to support in-house research computations and the kinds of systems being funded and deployed with federal funding at major national centers. Some of the difference points to what can be used cost-effectively in industry and what in some ways pushes the frontiers of academic research for the future. But it is also a worrisome indicator when the largest systems in the industry are smaller by an order of magnitude when compared to the largest ones at the national centers, with very few exceptions.

3. There has been a growing emphasis on system size or peak computing capacity, now measured in petaflops as compared to hundreds of teraflops two years ago. There isn’t as much attention being paid to sustained performance or end-to-end computational productivity. It seems to matter more where a system is on the TOP500 list rather than what the system, in conjunction with high-quality staff assistance, can deliver in terms of overall productivity for the scientists using it.

A large portion of research computation across many disciplines is done using codes from independent software vendors or community codes. Many of these codes do not scale well beyond 100 to 200 processors; sometimes the point of diminishing return is reached much sooner. Only a small number of codes (and researchers) can effectively use several hundred or thousands of processors.

4. There is a growing capacity gap, i.e., inability on the part of most academic institutions to consider submitting highly competitive grant applications in response to Track 2 solicitations. The physical infrastructure needed to host such systems puts it beyond the capacity of most universities, thus reducing the number of innovative ideas that can be put forward in a more competitive process.

What can targeted support for university-based Tier 3 research computation centers yield?

1. Such centers can work with a larger number of vendors to build and deploy computer systems that are more closely targeted to specific research computations being carried out on their respective university campuses. There will be far more input from the intended local user community in system design and also the ability to optimize all the software that is needed to to best meet their computational needs.

Rather than put all our energy in training students and researchers on how best to use a few large systems at remote locations, the sustained excitement that comes from active involvement of students ranging from undergraduates to postdocs in shaping the design and operation of compute systems is an essential ingredient to attract more people to HPC careers and thus expand the much-needed workforce committed to it.

2. A more robust and vibrant HPC market will emerge with a larger number of participants, despite relatively smaller acquisitions. There will be a substantial multiplier effect as well; a funding of $2 to $4 million from NSF will yield an equally substantial or even bigger investment by the recipient institutions. The proposed program will almost certainly spark greater and much-needed investments by campuses themselves in HPC cyberinfrastructure.

3. Compared to the 80’s and early 90’s, we have seen a far lower number of start-up companies focused on HPC. Commoditization and the era of cluster computing has much to do with it. But a larger number of acquisitions and participants will expand the market, encourage more innovation and newer ideas, more start-up companies; all this will expand the opportunity space for all participants.

4. If TeraGrid can be expanded to include 20 such Tier 3 centers, each bringing their unique compute engines to the mix, then that would provide the needed full-scale test-bed to take federally-funded middleware initiatives to the level it needs to go to. There will be an opportunity to confront policy and technical challenges in authentication, authorization, and resource scheduling and sharing that the HPC community hasn’t had the reason to do so far.

There is much to learn by increasing the number of participants and providers, and by making grid computing more of a reality. The networks that are in place and storage systems (file systems, data silos, etc.) make it possible to minimize moving too many files between locations, and when needed, being able to do so quickly and efficiently. That’s clearly been a major benefit. NSF can extend the same benefits and encourage more innovation by adding 25 Tier 3 centers which will principally serve researchers on their respective campuses, make the two-way migration between campuses and national resources far more common, and share only the unique capabilities of these systems with the national community through allocation procedures of TeraGrid.

Thank you for your time and attention.

Vijay K. Agarwala, Director, Research Computing and Cyberinfrastructure Information Technology Services Penn State University Park, PA 16802, [email protected], 814 865 2162

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

MLPerf Inference 4.0 Results Showcase GenAI; Nvidia Still Dominates

March 28, 2024

There were no startling surprises in the latest MLPerf Inference benchmark (4.0) results released yesterday. Two new workloads — Llama 2 and Stable Diffusion XL — were added to the benchmark suite as MLPerf continues Read more…

Q&A with Nvidia’s Chief of DGX Systems on the DGX-GB200 Rack-scale System

March 27, 2024

Pictures of Nvidia's new flagship mega-server, the DGX GB200, on the GTC show floor got favorable reactions on social media for the sheer amount of computing power it brings to artificial intelligence.  Nvidia's DGX Read more…

Call for Participation in Workshop on Potential NSF CISE Quantum Initiative

March 26, 2024

Editor’s Note: Next month there will be a workshop to discuss what a quantum initiative led by NSF’s Computer, Information Science and Engineering (CISE) directorate could entail. The details are posted below in a Ca Read more…

Waseda U. Researchers Reports New Quantum Algorithm for Speeding Optimization

March 25, 2024

Optimization problems cover a wide range of applications and are often cited as good candidates for quantum computing. However, the execution time for constrained combinatorial optimization applications on quantum device Read more…

NVLink: Faster Interconnects and Switches to Help Relieve Data Bottlenecks

March 25, 2024

Nvidia’s new Blackwell architecture may have stolen the show this week at the GPU Technology Conference in San Jose, California. But an emerging bottleneck at the network layer threatens to make bigger and brawnier pro Read more…

Who is David Blackwell?

March 22, 2024

During GTC24, co-founder and president of NVIDIA Jensen Huang unveiled the Blackwell GPU. This GPU itself is heavily optimized for AI work, boasting 192GB of HBM3E memory as well as the the ability to train 1 trillion pa Read more…

MLPerf Inference 4.0 Results Showcase GenAI; Nvidia Still Dominates

March 28, 2024

There were no startling surprises in the latest MLPerf Inference benchmark (4.0) results released yesterday. Two new workloads — Llama 2 and Stable Diffusion Read more…

Q&A with Nvidia’s Chief of DGX Systems on the DGX-GB200 Rack-scale System

March 27, 2024

Pictures of Nvidia's new flagship mega-server, the DGX GB200, on the GTC show floor got favorable reactions on social media for the sheer amount of computing po Read more…

NVLink: Faster Interconnects and Switches to Help Relieve Data Bottlenecks

March 25, 2024

Nvidia’s new Blackwell architecture may have stolen the show this week at the GPU Technology Conference in San Jose, California. But an emerging bottleneck at Read more…

Who is David Blackwell?

March 22, 2024

During GTC24, co-founder and president of NVIDIA Jensen Huang unveiled the Blackwell GPU. This GPU itself is heavily optimized for AI work, boasting 192GB of HB Read more…

Nvidia Looks to Accelerate GenAI Adoption with NIM

March 19, 2024

Today at the GPU Technology Conference, Nvidia launched a new offering aimed at helping customers quickly deploy their generative AI applications in a secure, s Read more…

The Generative AI Future Is Now, Nvidia’s Huang Says

March 19, 2024

We are in the early days of a transformative shift in how business gets done thanks to the advent of generative AI, according to Nvidia CEO and cofounder Jensen Read more…

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

March 18, 2024

Nvidia's latest and fastest GPU, codenamed Blackwell, is here and will underpin the company's AI plans this year. The chip offers performance improvements from Read more…

Nvidia Showcases Quantum Cloud, Expanding Quantum Portfolio at GTC24

March 18, 2024

Nvidia’s barrage of quantum news at GTC24 this week includes new products, signature collaborations, and a new Nvidia Quantum Cloud for quantum developers. Wh Read more…

Alibaba Shuts Down its Quantum Computing Effort

November 30, 2023

In case you missed it, China’s e-commerce giant Alibaba has shut down its quantum computing research effort. It’s not entirely clear what drove the change. Read more…

Nvidia H100: Are 550,000 GPUs Enough for This Year?

August 17, 2023

The GPU Squeeze continues to place a premium on Nvidia H100 GPUs. In a recent Financial Times article, Nvidia reports that it expects to ship 550,000 of its lat Read more…

Shutterstock 1285747942

AMD’s Horsepower-packed MI300X GPU Beats Nvidia’s Upcoming H200

December 7, 2023

AMD and Nvidia are locked in an AI performance battle – much like the gaming GPU performance clash the companies have waged for decades. AMD has claimed it Read more…

DoD Takes a Long View of Quantum Computing

December 19, 2023

Given the large sums tied to expensive weapon systems – think $100-million-plus per F-35 fighter – it’s easy to forget the U.S. Department of Defense is a Read more…

Synopsys Eats Ansys: Does HPC Get Indigestion?

February 8, 2024

Recently, it was announced that Synopsys is buying HPC tool developer Ansys. Started in Pittsburgh, Pa., in 1970 as Swanson Analysis Systems, Inc. (SASI) by John Swanson (and eventually renamed), Ansys serves the CAE (Computer Aided Engineering)/multiphysics engineering simulation market. Read more…

Choosing the Right GPU for LLM Inference and Training

December 11, 2023

Accelerating the training and inference processes of deep learning models is crucial for unleashing their true potential and NVIDIA GPUs have emerged as a game- Read more…

Intel’s Server and PC Chip Development Will Blur After 2025

January 15, 2024

Intel's dealing with much more than chip rivals breathing down its neck; it is simultaneously integrating a bevy of new technologies such as chiplets, artificia Read more…

Baidu Exits Quantum, Closely Following Alibaba’s Earlier Move

January 5, 2024

Reuters reported this week that Baidu, China’s giant e-commerce and services provider, is exiting the quantum computing development arena. Reuters reported � Read more…

Leading Solution Providers

Contributors

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

October 30, 2023

With long lead times for the NVIDIA H100 and A100 GPUs, many organizations are looking at the new NVIDIA L40S GPU, which it’s a new GPU optimized for AI and g Read more…

Shutterstock 1179408610

Google Addresses the Mysteries of Its Hypercomputer 

December 28, 2023

When Google launched its Hypercomputer earlier this month (December 2023), the first reaction was, "Say what?" It turns out that the Hypercomputer is Google's t Read more…

AMD MI3000A

How AMD May Get Across the CUDA Moat

October 5, 2023

When discussing GenAI, the term "GPU" almost always enters the conversation and the topic often moves toward performance and access. Interestingly, the word "GPU" is assumed to mean "Nvidia" products. (As an aside, the popular Nvidia hardware used in GenAI are not technically... Read more…

Shutterstock 1606064203

Meta’s Zuckerberg Puts Its AI Future in the Hands of 600,000 GPUs

January 25, 2024

In under two minutes, Meta's CEO, Mark Zuckerberg, laid out the company's AI plans, which included a plan to build an artificial intelligence system with the eq Read more…

Google Introduces ‘Hypercomputer’ to Its AI Infrastructure

December 11, 2023

Google ran out of monikers to describe its new AI system released on December 7. Supercomputer perhaps wasn't an apt description, so it settled on Hypercomputer Read more…

China Is All In on a RISC-V Future

January 8, 2024

The state of RISC-V in China was discussed in a recent report released by the Jamestown Foundation, a Washington, D.C.-based think tank. The report, entitled "E Read more…

Intel Won’t Have a Xeon Max Chip with New Emerald Rapids CPU

December 14, 2023

As expected, Intel officially announced its 5th generation Xeon server chips codenamed Emerald Rapids at an event in New York City, where the focus was really o Read more…

IBM Quantum Summit: Two New QPUs, Upgraded Qiskit, 10-year Roadmap and More

December 4, 2023

IBM kicks off its annual Quantum Summit today and will announce a broad range of advances including its much-anticipated 1121-qubit Condor QPU, a smaller 133-qu Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire