During our coverage of the High Performance Computing and Communication conference in March, HPCwire conducted an interview with Douglass Post, chief scientist of the DoD High Performance Computing Modernization Program, where he talked about the major challenges currently facing high performance computing. As the HPC community awaits DARPA's selection of the winners of the High Productivity Computer Systems (HPCS) Phase 2 competition, it may useful to review these challenges in order to understand some of the context of the impending decision. Below is an excerpt of this interview.
Last year, Michael van de Vanter, Mary Zosel and I gave a paper at the International Conference on Software Engineering entitled “HPC needs a tool strategy.” It's available for downloading at www.hpcmo.hpc.mil/Htdocs/HPCMO_Staff/doug_post/papers/HPCNeedsAToolStrategy.pdf. In that paper, we point out that development tools are lagging far behind what's needed. The gains in computer performance are being achieved by increasing the complexity of computer architectures. This increases the challenges associated with programming codes for these machines.
In fact, we find that most code developers consider development tools for massively parallel codes as a major risk item. There are good tools, but not enough, and the tools community has too much turnover. One major issue is that there isn't a good business model for parallel tools because the market is so small and unstable. If a tool doesn't attract enough customers, the company fails and the tool vanishes. If a tool attracts a lot of customers, the company prospers (moderately) and gets bought out. Then the tool gets focused on the priorities of the purchaser, and support for the rest of the community fades out.
Examples include the purchase of Pallas and of Kuck and Associates by Intel. Pallas developed and supported VAMPIR, a good tool for MPI, for most major massively parallel platforms. After Intel bought it, support for VAMPIR for non-Intel processors has waned, and by and large disappeared. The same thing happened with the very good Kuck and Associates C and C++ compilers for massively parallel computers. Only companies that make enough to stay in business, but not enough to be really prosperous, like Etnus, who makes the parallel debugger TotalView, and CEI, who make the visualization tool EnSight, are surviving. An exception maybe is the Portland Group, which seems to have carved out a niche in the Linux cluster environment.
Universities develop a lot of tools, but graduate students and post-docs have priorities other than software support for their university careers (like graduating and finding a real job). Vendors often develop tools for their machines, but those tools usually don't work on other platforms. Most major massively parallel codes have to run on many different platforms. The developers then need to learn to use many different sets of tools.
What is needed is a stable set of tools that work on all the relevant platforms, and give the code developers the tools they need to debug their code and optimize its performance, and give the users the tools they need to set up the problem, run the code and analyze the answers. Many different solutions have been discussed, but I think that the only solution that has a realistic chance of working is for the federal computing community to fund the development of set of tools. If the tools are developed and supported by industry, the federal government would have to subsidize the company to provide this service. It would also probably have to “own” the source code to ensure that the tool would survive the company being bought by another company, and there are other complications as well.
Another concept is a “tools consortium” with participation from the vendors. There was a tools consortium several years ago, but it died due to lack of resources. At some point, no one will buy computers they cannot use because the development tools are inadequate. Thus, the vendors have a vested interest in tools. We tried to get some interest in a joint development effort by the DARPA HPCS Phase II vendors, but without much success. The bigger vendors see tools as a source of competitive advantage. As I mentioned, the Portland Group seems to be a good provider of tools for the Linux cluster vendors who don't do their own development. This could grow if the major vendors (IBM, Cray, HP, etc.) started using the Portland tools, but I haven't seen that happening yet.
Computational tools offer society a new problem solving paradigm. They have the potential to provide, for the first time, accurate predictions of complex phenomena for realistic conditions. Before computational tools became available, predictions were generally possible only for simple model problems. Computational tools can include the effects of realistic geometries, all of the materials in the problem and all of the known physical/chemical/biological effects, and address a complete system rather than just a small of the system. Scientific and engineering computational tools offer the potential to rapidly produce optimized designs for systems, explore the limits of those designs, accelerate scientific discoveries, predict the behavior of natural systems like the weather, analyze and plan complex operations involving thousands to millions of individual entities, and analyze and organize enormous amounts of data. However, realizing this potential has many challenges.
I see at least seven major challenges in computational science and engineering, which I list below in rough order of the difficulty and importance:
1. Establishing the culture shift in the scientific and engineering community to add computational tools to the suite of existing engineering design and scientific discovery tools.
Although the use of computational science and engineering is steadily increasing, it's beginning to appear that it will take a generation or more for a paradigm shift from the predominant use of traditional scientific and engineering methods to the balanced use of computational and traditional methods to occur. It's an advance that is being made one tool at a time, one field at a time and one application at a time. This is partly due to conservatism and skepticism on the part of scientists and engineers who are understandably reluctant to rely on new, unproven methods when they have the traditional methods that work. Even though computational methods offer the potential to be able to enable discoveries and optimize designs much more quickly, flexibly and accurately, every engineering and scientific discipline is different, and most tools for one community have little or no applicability for other communities.
Also, computational tools are often not easy to use and require considerable judgment and expertise. Generally new tools are not “black boxes” that new users can rely on to give them accurate answers. In almost every case, it takes considerable time and experience for users to develop a comparable level of facility with computational tools that they have with their present methods. Many, if not most, computational tools are not mature in the sense that they have the same level reliability as traditional methods. Maturity will come only after the remaining six issues are dealt with, and there is a lot of experience in each individual community. Historically, this is not surprising. In the absence of catastrophic failures of an existing methodology, almost all new problem solving methodologies and technologies, and indeed all new intellectual paradigms and technological advances, have taken a generation or two to become accepted.
2. Getting sponsors to support the development of scientific and engineering computational tools that take large groups a decade or more to develop.
The development of effective computational tools takes many years (sometimes as long as 10 to 15 years) of significant sized teams (10 to 50 professionals), as well as the success with issues No. 3 through No. 7. This represents a large, upfront investment ($3 million to $15 million for 10 to 15 years, or $30 million to $300 million) before there are large payoffs. That's one reason why it's important for code development projects to emphasize incremental delivery of capability. It's a challenge to convince potential sponsors that they should make investments of this order for an unproven methodology. Although one can make “return on investment” arguments, the numbers are only estimates until one has experience with the computational tool, and that only occurs after the investment has been made. It's the traditional “chicken and egg” problem that has bedeviled most new cultural shifts and paradigms.
Today, if one proposes, as we are doing, to spend $100 million to build a new computational tool to design military aircraft, or plan military operations, there is considerable skepticism that the tool will be worth $100 million even if the tool would save billions of dollars by reducing the technical problems normally found late in the procurement process that lead to schedule delays and expensive design modifications. I think that if we made the same kind of proposal in 2036, people would respond with, “Why would you do it any other way? Tell us something that we don't know.” But it's 2006, not 2036, and the paradigm shift hasn't happened yet. The problem with this issue is that, unless someone supports the development of a computational tool for five to 10 years or more before it becomes available for large scale use, it will never exist, no matter how large the potential value of the tool.
3. Developing accurate computer application codes.
The development of large-scale computational scientific and engineering codes is a challenging endeavor. Case studies indicate that success requires a tightly-knit, well-led, multi-disciplinary and highly competent team to work together for five to 10 years to develop the code. The tool has to provide reasonably complete treatments of all the important effects, be able to run efficiently on all the necessary computer platforms, and produce accurate solutions. In many cases, effects that have time and distance scales that differ by many orders of magnitude have to be integrated, and general algorithms for accomplishing this have not been developed. The design of a computational tool depends crucially on the details of the domain science, and few general rules exist. A code for aircraft design is very different than a code for analysis of chemical reactions. Each code development project is a highly challenging task. The record indicates that as many as 50 percent of these types of code projects fail to achieve their initial milestones and that, in some areas, as many as 33 percent fail to ever produce anything useful. Software engineering for computational science and engineering is a brand-new field and still very immature. As has been the case with other problem solving methodologies, it will take several generations of code projects for the field to mature.
4. Verifying and validating the application codes for the problems of interest.
Verification is ensuring that the code solves the equations accurately (i.e., that the code has few defects and that the mathematics in the code are correct.) Validation is ensuring that the code includes treatments for all the important effects. Results from unverified and unvalidated codes are almost certainly inaccurate and misleading. They are worse than worthless because the user will almost certainly make an incorrect decision if he bases it on the results. The challenge is that both verification and validation methods are incomplete and few in number. Verification usually involves running test problems and comparing the code results with the expected results. The difficulty is that there are generally only a handful of test problems for sub-sections of the code with known results, and generally none for the integrated code. Better verification techniques are urgently needed.
Validation usually means comparing the calculations with experimental data for relevant experiments in the range of the problem of interest. Getting accurate data for validation is challenging. Generally, it has been difficult to find experimentalists who are interested in producing validation data. They are more interested in using experiments to make scientific discoveries. In addition, the agencies that fund experimental research are also much more interested in funding experiments to make scientific discoveries than they are in validating codes. The cost of validation should be part of the cost of developing and deploying the computational tool, yet almost no one budgets adequate funds for validation.
Verification and validation are essential if a computational tool is to be useful. Results from an inadequately verified code likely contain mathematical errors, and can't be relied upon in any way. Results from a code that hasn't been validated for the application of interest likely will miss some important effect, and can't be used as the basis for a decision. Verification and validation are another area that needs to become much more mature. They are beginning to receive more attention, but the challenge is large. At a higher level, if results from a computational tool are to be useful, the uncertainties in the answers are needed. Methodologies for determining the uncertainties are just now beginning to be worked out.
5. Continuing to improve computer performance.
It will be crucial to continue to improve the raw performance of computers above the present levels. Fortunately, computer performance is continuing the exponential growth begun in the 1950s. Today, we have computers in the 100 teraflops range and, within a few years, we will computers in the petaflop range. Memory sizes and storage capacity also are continuing to grow exponentially. Keeping the thermal power within acceptable levels continues to be a major issue, but solutions are being found. The growth in computer power is being accomplished partially by the introduction of massive parallelization. This is usually accompanied by distributing the memory into many discrete segments. As a result, bandwidth and memory latency remain major issues. It takes longer and longer to collect and distribute data from remote memory locations. Massive parallelization has greatly increased the complexity of machine architectures.
The extent of these problems has been masked by the benchmarks used to measure computer performance. The benchmark widely used to rank computers in order of processing power is a linear algebra package, Linpack, that solves a dense system of linear equations (http://www.netlib.org/benchmark/hpl/). It basically tests the speed of the floating point arithmetic units. Performance with this benchmark determines the ranking on the Top500 list, which ranks supercomputers in terms of computing power. The problem is that the performance of most computational science and engineering codes are not measured very well by a single benchmark like the Linpack benchmark. Thus, the computer vendors are in danger of optimizing using the wrong criteria.
The Linpack benchmark doesn't degrade with increasing memory latency nearly as fast as most real applications because memory access is structured for the Linpack benchmark, whereas most real problems require some random access to memory. Also, real problems require integer arithmetic, some need a lot of memory and so on. Due to memory latency, the multiplicity of different types of computing required for a real problem, etc., real codes usually fall far short of the Linpack performance. As a result, computers that are optimized to do well with Linpack are not necessarily optimized to run most scientific and engineering codes.
This has led to the an effort to develop a set of benchmarks that do a better job of representing the workload of a standard set of scientific and engineering codes (e.g., HPC Challenge; http://icl.cs.utk.edu/hpcc/). However, even the HPC Challenge is not really representative. The DoD High Performance Computing Modernization Program bases its measures the performance of candidate systems by running a set of 10 applications that represent their workload.
6. Programming complex massively parallel computers.
The growth in the complexity of computer architecture due to massive parallelization is making it very challenging to develop programs for these new computers. With programs and data strewn across hundreds of thousands of distinct processors and separated memory banks, organizing the exchange of data and the order computations requires very complex logic, a lot of specialized programming and the ability to tolerate faults. Most programs rely on MPI, a message passing library that requires fairly low level logic and commands. Specialized debugging and performance optimization tools are needed. Better programming tools, better memory access models, etc., are needed. Languages that express parallelization at higher levels of abstraction are needed, but they will face the challenge of gaining wide acceptance. Developers of large code projects that take tens of people years to develop will preferentially choose languages that are mature and that are used on many different platforms.
The challenge of performance optimization is heightened by the multiplicity of computer vendors, operating systems and architectures, and the turnover in architectures and platforms every three to five years. Most large codes need to run on several different platforms at any given moment, and have to be ported to new platforms every three to five years. This is much shorter than the 20- to 30-year life of many large application codes. In addition, the codes should ideally be optimized for performance on all of the platforms. In reality, the emphasis on performance optimization gives way to the requirement to port the code to multiple platforms. In reality, computer vendors have pushed a large part of the challenge to get good performance from the hardware onto the applications programmers. Code developers now not only have to develop codes with much greater domain science complexity, but also have to cope with computer architectures of greater complexity.
Part of the performance challenge is many solution algorithms don't scale well with the number of processors. Many codes will have to be rewritten to employ algorithms that scale better. In cases where scalable algorithms don't yet exist, the challenge of inventing them awaits the code developer.
7. Using the complicated computational science and engineering codes to solve problems.
Finally, the payoff for developing the computational tool and the computer comes when the production user employs the computational tools to solve real problems.
Finally, the payoff for developing the computational tools and the computer comes when the production user employs the computational tools to solve real problems. Getting solutions to their problems is, after all, why sponsors pay for the computers and the codes. Almost none of the largest scientific and engineering codes can be treated as “black boxes.” A skilled and knowledgeable user with a lot of knowledge in the problem domain is an absolute necessity. Examples abound of users who get incorrect answers with a good code for cases where a skilled user using the same code was able to get a correct answer. Interpretation of the code results is also challenging. A large, massively parallel computation may produce terabytes of data. Extracting information from such datasets is a massive challenge. Setting problems up to run is also challenging. It can often take three to six months to setup a mesh for a complicated problem starting with a geometric description from a CADCAM output file.
Thus, while computational science and engineering has great potential, there are significant challenges to realize that promise.
Douglass E. Post has been developing and applying large-scale multi-physics simulations for almost 35 years. He is the Chief Scientist of the DoD High Performance Computing Modernization Program and a member of the senior technical staff of the Carnegie Mellon University Software Engineering Institute. He also leads the multi-institutional DARPA High Productivity Computing Systems Existing Code Analysis team. Doug received a Ph.D. in Physics from Stanford University in 1975. He led the tokamak modeling group at Princeton University Plasma Physics Laboratory from 1975 to 1993 and served as head of International Thermonuclear Experimental Reactor (ITER) Joint Central Team Physics Project Unit (1988-1990), and head of ITER Joint Central Team In-vessel Physics Group (1993-1998). More recently, he was the A-X Associate Division Leader for Simulation at Lawrence Livermore National Laboratory (1998-2000) and the Deputy X Division Leader for Simulation at the Los Alamos National Laboratory (2001-2002), positions that involved leadership of major portions of the US nuclear weapons simulation program. He has published over 230 refereed papers, conference papers and books in computational, experimental and theoretical physics and software engineering with over 5,000 citations. He is a Fellow of the American Physical Society, the American Nuclear Society, and the Institute of Electrical and Electronic Engineers. He serves as an Associate Editor-in-Chief of the joint AIP/IEEE publication Computing in Science and Engineering.