Two weeks ago, Vijay K. Agarwala, director of Research Computing and Cyberinfrastructure Information Technology Services at Penn State, sent a letter to all members of the Coalition for Academic Scientific Computation (CASC) regarding NSF funding of HPC at university-based research centers. In it, Vijay proposed an alternate strategy where a portion of the NSF funding destined for large-scale computing at a single large center be shifted to a number of smaller HPC systems in as many as 25 Tier 3 centers. The letter is intended to encourage members of CASC, an advocacy group for HPC and advanced computing technology, to consider some of letter’s recommendations and help shape funding priorities for the NSF. The text of the letter is provided below.
Dear Colleagues,
I would like to share a few thoughts on why the National Science Foundation (NSF) might find it meaningful to revisit the issue of how it funds cyberinfrastruture for research computations across the computing “pyramid.”
Summary of recommendations:
The science community and industry will be well served if a portion of the federal funding for large-scale computing systems is more evenly allocated rather than most of it being concentrated in a few centers. While the national centers (Tier I and II) with their ultra-large systems will continue to have an important role in meeting the capacity and capability computing needs of U.S. scientists and engineers, support for a number of university-based research computation centers will help fill existing funding gaps and address many important policy objectives and goals such as development of skilled HPC personnel, deeper university-industry partnerships, increased adoption of HPC systems as a discovery tool by larger number of academic researchers as well as by industry, improved industrial competitiveness, and economic revitalization. Support for 20 to 25 such university-based Tier 3 computing centers should be provided via a competitive solicitation and merit-based review and grant process. It is estimated that a program with $50 million in annual budget could fund, over a two-year period, 20 to 25 such university-based centers at the level of $2 million to $4 million per year.
If the number of Tier 2 centers funded is kept to a total of three, the program proposed here can take the place of the last (fourth) such proposed Tier 2 center, and in the process yield greater benefit to the U.S. science and engineering community by meeting many important needs.
It is useful to note that the major research instrumentation (MRI) program at NSF has a substantially different purpose than what is proposed here. In MRI, proposals for computing hardware compete with similar proposals for a range of research equipment. The total funding allocated to computing hardware is therefore a small proportion of the total MRI budget. Also, the MRI grants are given to a specific group of faculty co-PIs and it is principally for their use. What is proposed in this note is intended to meet the wider computational needs at the recipient institutions.
Here are some basic facts and observations that underlie the thought points of this note:
1. There is a growing shortage of HPC professionals (computational scientists). These are scientists and engineers who are well versed in some or all aspects of systems and architecture, programming, algorithms and numerical methods, some domain knowledge, the ability to think across disciplinary boundaries and integrate modeling ideas and computational techniques from different areas. The demand in academia and industry for such skilled human resources exceeds what is available today or what academia and industry together are likely to train in the near future.
There is also a nearly flat or falling U.S. undergraduate enrollment in computer science and engineering programs. This shortage impacts the rate of adoption of large-scale computations in industry as well as in academia.
2. There is a growing gap between the size of systems being deployed by the end-user industry in their research and development divisions to support in-house research computations and the kinds of systems being funded and deployed with federal funding at major national centers. Some of the difference points to what can be used cost-effectively in industry and what in some ways pushes the frontiers of academic research for the future. But it is also a worrisome indicator when the largest systems in the industry are smaller by an order of magnitude when compared to the largest ones at the national centers, with very few exceptions.
3. There has been a growing emphasis on system size or peak computing capacity, now measured in petaflops as compared to hundreds of teraflops two years ago. There isn’t as much attention being paid to sustained performance or end-to-end computational productivity. It seems to matter more where a system is on the TOP500 list rather than what the system, in conjunction with high-quality staff assistance, can deliver in terms of overall productivity for the scientists using it.
A large portion of research computation across many disciplines is done using codes from independent software vendors or community codes. Many of these codes do not scale well beyond 100 to 200 processors; sometimes the point of diminishing return is reached much sooner. Only a small number of codes (and researchers) can effectively use several hundred or thousands of processors.
4. There is a growing capacity gap, i.e., inability on the part of most academic institutions to consider submitting highly competitive grant applications in response to Track 2 solicitations. The physical infrastructure needed to host such systems puts it beyond the capacity of most universities, thus reducing the number of innovative ideas that can be put forward in a more competitive process.
What can targeted support for university-based Tier 3 research computation centers yield?
1. Such centers can work with a larger number of vendors to build and deploy computer systems that are more closely targeted to specific research computations being carried out on their respective university campuses. There will be far more input from the intended local user community in system design and also the ability to optimize all the software that is needed to to best meet their computational needs.
Rather than put all our energy in training students and researchers on how best to use a few large systems at remote locations, the sustained excitement that comes from active involvement of students ranging from undergraduates to postdocs in shaping the design and operation of compute systems is an essential ingredient to attract more people to HPC careers and thus expand the much-needed workforce committed to it.
2. A more robust and vibrant HPC market will emerge with a larger number of participants, despite relatively smaller acquisitions. There will be a substantial multiplier effect as well; a funding of $2 to $4 million from NSF will yield an equally substantial or even bigger investment by the recipient institutions. The proposed program will almost certainly spark greater and much-needed investments by campuses themselves in HPC cyberinfrastructure.
3. Compared to the 80’s and early 90’s, we have seen a far lower number of start-up companies focused on HPC. Commoditization and the era of cluster computing has much to do with it. But a larger number of acquisitions and participants will expand the market, encourage more innovation and newer ideas, more start-up companies; all this will expand the opportunity space for all participants.
4. If TeraGrid can be expanded to include 20 such Tier 3 centers, each bringing their unique compute engines to the mix, then that would provide the needed full-scale test-bed to take federally-funded middleware initiatives to the level it needs to go to. There will be an opportunity to confront policy and technical challenges in authentication, authorization, and resource scheduling and sharing that the HPC community hasn’t had the reason to do so far.
There is much to learn by increasing the number of participants and providers, and by making grid computing more of a reality. The networks that are in place and storage systems (file systems, data silos, etc.) make it possible to minimize moving too many files between locations, and when needed, being able to do so quickly and efficiently. That’s clearly been a major benefit. NSF can extend the same benefits and encourage more innovation by adding 25 Tier 3 centers which will principally serve researchers on their respective campuses, make the two-way migration between campuses and national resources far more common, and share only the unique capabilities of these systems with the national community through allocation procedures of TeraGrid.
Thank you for your time and attention.
Vijay K. Agarwala, Director, Research Computing and Cyberinfrastructure Information Technology Services Penn State University Park, PA 16802, [email protected], 814 865 2162