Seven Challenges of High Performance Computing

By Nicole Hemsoth

July 21, 2006

During our coverage of the High Performance Computing and Communication conference in March, HPCwire conducted an interview with Douglass Post, chief scientist of the DoD High Performance Computing Modernization Program, where he talked about the major challenges currently facing high performance computing. As the HPC community awaits DARPA's selection of the winners of the High Productivity Computer Systems (HPCS) Phase 2 competition, it may useful to review these challenges in order to understand some of the context of the impending decision. Below is an excerpt of this interview.

—–

Last year, Michael van de Vanter, Mary Zosel and I gave a paper at the International Conference on Software Engineering entitled “HPC needs a tool strategy.” It's available for downloading at www.hpcmo.hpc.mil/Htdocs/HPCMO_Staff/doug_post/papers/HPCNeedsAToolStrategy.pdf. In that paper, we point out that development tools are lagging far behind what's needed. The gains in computer performance are being achieved by increasing the complexity of computer architectures. This increases the challenges associated with programming codes for these machines.

In fact, we find that most code developers consider development tools for massively parallel codes as a major risk item. There are good tools, but not enough, and the tools community has too much turnover. One major issue is that there isn't a good business model for parallel tools because the market is so small and unstable. If a tool doesn't attract enough customers, the company fails and the tool vanishes. If a tool attracts a lot of customers, the company prospers (moderately) and gets bought out. Then the tool gets focused on the priorities of the purchaser, and support for the rest of the community fades out.

Examples include the purchase of Pallas and of Kuck and Associates by Intel. Pallas developed and supported VAMPIR, a good tool for MPI, for most major massively parallel platforms. After Intel bought it, support for VAMPIR for non-Intel processors has waned, and by and large disappeared. The same thing happened with the very good Kuck and Associates C and C++ compilers for massively parallel computers. Only companies that make enough to stay in business, but not enough to be really prosperous, like Etnus, who makes the parallel debugger TotalView, and CEI, who make the visualization tool EnSight, are surviving. An exception maybe is the Portland Group, which seems to have carved out a niche in the Linux cluster environment.

Universities develop a lot of tools, but graduate students and post-docs have priorities other than software support for their university careers (like graduating and finding a real job). Vendors often develop tools for their machines, but those tools usually don't work on other platforms. Most major massively parallel codes have to run on many different platforms. The developers then need to learn to use many different sets of tools.

What is needed is a stable set of tools that work on all the relevant platforms, and give the code developers the tools they need to debug their code and optimize its performance, and give the users the tools they need to set up the problem, run the code and analyze the answers. Many different solutions have been discussed, but I think that the only solution that has a realistic chance of working is for the federal computing community to fund the development of set of tools. If the tools are developed and supported by industry, the federal government would have to subsidize the company to provide this service. It would also probably have to “own” the source code to ensure that the tool would survive the company being bought by another company, and there are other complications as well.

Another concept is a “tools consortium” with participation from the vendors. There was a tools consortium several years ago, but it died due to lack of resources. At some point, no one will buy computers they cannot use because the development tools are inadequate. Thus, the vendors have a vested interest in tools. We tried to get some interest in a joint development effort by the DARPA HPCS Phase II vendors, but without much success. The bigger vendors see tools as a source of competitive advantage. As I mentioned, the Portland Group seems to be a good provider of tools for the Linux cluster vendors who don't do their own development. This could grow if the major vendors (IBM, Cray, HP, etc.) started using the Portland tools, but I haven't seen that happening yet.

Computational tools offer society a new problem solving paradigm. They have the potential to provide, for the first time, accurate predictions of complex phenomena for realistic conditions. Before computational tools became available, predictions were generally possible only for simple model problems. Computational tools can include the effects of realistic geometries, all of the materials in the problem and all of the known physical/chemical/biological effects, and address a complete system rather than just a small of the system. Scientific and engineering computational tools offer the potential to rapidly produce optimized designs for systems, explore the limits of those designs, accelerate scientific discoveries, predict the behavior of natural systems like the weather, analyze and plan complex operations involving thousands to millions of individual entities, and analyze and organize enormous amounts of data. However, realizing this potential has many challenges.

I see at least seven major challenges in computational science and engineering, which I list below in rough order of the difficulty and importance:

1. Establishing the culture shift in the scientific and engineering community to add computational tools to the suite of existing engineering design and scientific discovery tools.

Although the use of computational science and engineering is steadily increasing, it's beginning to appear that it will take a generation or more for a paradigm shift from the predominant use of traditional scientific and engineering methods to the balanced use of computational and traditional methods to occur. It's an advance that is being made one tool at a time, one field at a time and one application at a time. This is partly due to conservatism and skepticism on the part of scientists and engineers who are understandably reluctant to rely on new, unproven methods when they have the traditional methods that work. Even though computational methods offer the potential to be able to enable discoveries and optimize designs much more quickly, flexibly and accurately, every engineering and scientific discipline is different, and most tools for one community have little or no applicability for other communities.

Also, computational tools are often not easy to use and require considerable judgment and expertise. Generally new tools are not “black boxes” that new users can rely on to give them accurate answers. In almost every case, it takes considerable time and experience for users to develop a comparable level of facility with computational tools that they have with their present methods. Many, if not most, computational tools are not mature in the sense that they have the same level reliability as traditional methods. Maturity will come only after the remaining six issues are dealt with, and there is a lot of experience in each individual community. Historically, this is not surprising. In the absence of catastrophic failures of an existing methodology, almost all new problem solving methodologies and technologies, and indeed all new intellectual paradigms and technological advances, have taken a generation or two to become accepted.

2. Getting sponsors to support the development of scientific and engineering computational tools that take large groups a decade or more to develop.

The development of effective computational tools takes many years (sometimes as long as 10 to 15 years) of significant sized teams (10 to 50 professionals), as well as the success with issues No. 3 through No. 7. This represents a large, upfront investment ($3 million to $15 million for 10 to 15 years, or $30 million to $300 million) before there are large payoffs. That's one reason why it's important for code development projects to emphasize incremental delivery of capability. It's a challenge to convince potential sponsors that they should make investments of this order for an unproven methodology. Although one can make “return on investment” arguments, the numbers are only estimates until one has experience with the computational tool, and that only occurs after the investment has been made. It's the traditional “chicken and egg” problem that has bedeviled most new cultural shifts and paradigms.

Today, if one proposes, as we are doing, to spend $100 million to build a new computational tool to design military aircraft, or plan military operations, there is considerable skepticism that the tool will be worth $100 million even if the tool would save billions of dollars by reducing the technical problems normally found late in the procurement process that lead to schedule delays and expensive design modifications. I think that if we made the same kind of proposal in 2036, people would respond with, “Why would you do it any other way? Tell us something that we don't know.” But it's 2006, not 2036, and the paradigm shift hasn't happened yet. The problem with this issue is that, unless someone supports the development of a computational tool for five to 10 years or more before it becomes available for large scale use, it will never exist, no matter how large the potential value of the tool.

3. Developing accurate computer application codes.

The development of large-scale computational scientific and engineering codes is a challenging endeavor. Case studies indicate that success requires a tightly-knit, well-led, multi-disciplinary and highly competent team to work together for five to 10 years to develop the code. The tool has to provide reasonably complete treatments of all the important effects, be able to run efficiently on all the necessary computer platforms, and produce accurate solutions. In many cases, effects that have time and distance scales that differ by many orders of magnitude have to be integrated, and general algorithms for accomplishing this have not been developed. The design of a computational tool depends crucially on the details of the domain science, and few general rules exist. A code for aircraft design is very different than a code for analysis of chemical reactions. Each code development project is a highly challenging task. The record indicates that as many as 50 percent of these types of code projects fail to achieve their initial milestones and that, in some areas, as many as 33 percent fail to ever produce anything useful. Software engineering for computational science and engineering is a brand-new field and still very immature. As has been the case with other problem solving methodologies, it will take several generations of code projects for the field to mature.

4. Verifying and validating the application codes for the problems of interest.

Verification is ensuring that the code solves the equations accurately (i.e., that the code has few defects and that the mathematics in the code are correct.) Validation is ensuring that the code includes treatments for all the important effects. Results from unverified and unvalidated codes are almost certainly inaccurate and misleading. They are worse than worthless because the user will almost certainly make an incorrect decision if he bases it on the results. The challenge is that both verification and validation methods are incomplete and few in number. Verification usually involves running test problems and comparing the code results with the expected results. The difficulty is that there are generally only a handful of test problems for sub-sections of the code with known results, and generally none for the integrated code. Better verification techniques are urgently needed.

Validation usually means comparing the calculations with experimental data for relevant experiments in the range of the problem of interest. Getting accurate data for validation is challenging. Generally, it has been difficult to find experimentalists who are interested in producing validation data. They are more interested in using experiments to make scientific discoveries. In addition, the agencies that fund experimental research are also much more interested in funding experiments to make scientific discoveries than they are in validating codes. The cost of validation should be part of the cost of developing and deploying the computational tool, yet almost no one budgets adequate funds for validation.

Verification and validation are essential if a computational tool is to be useful. Results from an inadequately verified code likely contain mathematical errors, and can't be relied upon in any way. Results from a code that hasn't been validated for the application of interest likely will miss some important effect, and can't be used as the basis for a decision. Verification and validation are another area that needs to become much more mature. They are beginning to receive more attention, but the challenge is large. At a higher level, if results from a computational tool are to be useful, the uncertainties in the answers are needed. Methodologies for determining the uncertainties are just now beginning to be worked out.

5. Continuing to improve computer performance.

It will be crucial to continue to improve the raw performance of computers above the present levels. Fortunately, computer performance is continuing the exponential growth begun in the 1950s. Today, we have computers in the 100 teraflops range and, within a few years, we will computers in the petaflop range. Memory sizes and storage capacity also are continuing to grow exponentially. Keeping the thermal power within acceptable levels continues to be a major issue, but solutions are being found. The growth in computer power is being accomplished partially by the introduction of massive parallelization. This is usually accompanied by distributing the memory into many discrete segments. As a result, bandwidth and memory latency remain major issues. It takes longer and longer to collect and distribute data from remote memory locations. Massive parallelization has greatly increased the complexity of machine architectures.

The extent of these problems has been masked by the benchmarks used to measure computer performance. The benchmark widely used to rank computers in order of processing power is a linear algebra package, Linpack, that solves a dense system of linear equations (http://www.netlib.org/benchmark/hpl/). It basically tests the speed of the floating point arithmetic units. Performance with this benchmark determines the ranking on the Top500 list, which ranks supercomputers in terms of computing power. The problem is that the performance of most computational science and engineering codes are not measured very well by a single benchmark like the Linpack benchmark. Thus, the computer vendors are in danger of optimizing using the wrong criteria.

The Linpack benchmark doesn't degrade with increasing memory latency nearly as fast as most real applications because memory access is structured for the Linpack benchmark, whereas most real problems require some random access to memory. Also, real problems require integer arithmetic, some need a lot of memory and so on. Due to memory latency, the multiplicity of different types of computing required for a real problem, etc., real codes usually fall far short of the Linpack performance. As a result, computers that are optimized to do well with Linpack are not necessarily optimized to run most scientific and engineering codes.

This has led to the an effort to develop a set of benchmarks that do a better job of representing the workload of a standard set of scientific and engineering codes (e.g., HPC Challenge; http://icl.cs.utk.edu/hpcc/). However, even the HPC Challenge is not really representative. The DoD High Performance Computing Modernization Program bases its measures the performance of candidate systems by running a set of 10 applications that represent their workload.

6. Programming complex massively parallel computers.

The growth in the complexity of computer architecture due to massive parallelization is making it very challenging to develop programs for these new computers. With programs and data strewn across hundreds of thousands of distinct processors and separated memory banks, organizing the exchange of data and the order computations requires very complex logic, a lot of specialized programming and the ability to tolerate faults. Most programs rely on MPI, a message passing library that requires fairly low level logic and commands. Specialized debugging and performance optimization tools are needed. Better programming tools, better memory access models, etc., are needed. Languages that express parallelization at higher levels of abstraction are needed, but they will face the challenge of gaining wide acceptance. Developers of large code projects that take tens of people years to develop will preferentially choose languages that are mature and that are used on many different platforms.

The challenge of performance optimization is heightened by the multiplicity of computer vendors, operating systems and architectures, and the turnover in architectures and platforms every three to five years. Most large codes need to run on several different platforms at any given moment, and have to be ported to new platforms every three to five years. This is much shorter than the 20- to 30-year life of many large application codes. In addition, the codes should ideally be optimized for performance on all of the platforms. In reality, the emphasis on performance optimization gives way to the requirement to port the code to multiple platforms. In reality, computer vendors have pushed a large part of the challenge to get good performance from the hardware onto the applications programmers. Code developers now not only have to develop codes with much greater domain science complexity, but also have to cope with computer architectures of greater complexity.

Part of the performance challenge is many solution algorithms don't scale well with the number of processors. Many codes will have to be rewritten to employ algorithms that scale better. In cases where scalable algorithms don't yet exist, the challenge of inventing them awaits the code developer.

7. Using the complicated computational science and engineering codes to solve problems.

Finally, the payoff for developing the computational tool and the computer comes when the production user employs the computational tools to solve real problems.

Finally, the payoff for developing the computational tools and the computer comes when the production user employs the computational tools to solve real problems. Getting solutions to their problems is, after all, why sponsors pay for the computers and the codes. Almost none of the largest scientific and engineering codes can be treated as “black boxes.” A skilled and knowledgeable user with a lot of knowledge in the problem domain is an absolute necessity. Examples abound of users who get incorrect answers with a good code for cases where a skilled user using the same code was able to get a correct answer. Interpretation of the code results is also challenging. A large, massively parallel computation may produce terabytes of data. Extracting information from such datasets is a massive challenge. Setting problems up to run is also challenging. It can often take three to six months to setup a mesh for a complicated problem starting with a geometric description from a CADCAM output file.

Thus, while computational science and engineering has great potential, there are significant challenges to realize that promise.

—–

Douglass E. Post has been developing and applying large-scale multi-physics simulations for almost 35 years. He is the Chief Scientist of the DoD High Performance Computing Modernization Program and a member of the senior technical staff of the Carnegie Mellon University Software Engineering Institute. He also leads the multi-institutional DARPA High Productivity Computing Systems Existing Code Analysis team. Doug received a Ph.D. in Physics from Stanford University in 1975. He led the tokamak modeling group at Princeton University Plasma Physics Laboratory from 1975 to 1993 and served as head of International Thermonuclear Experimental Reactor (ITER) Joint Central Team Physics Project Unit (1988-1990), and head of ITER Joint Central Team In-vessel Physics Group (1993-1998). More recently, he was the A-X Associate Division Leader for Simulation at Lawrence Livermore National Laboratory (1998-2000) and the Deputy X Division Leader for Simulation at the Los Alamos National Laboratory (2001-2002), positions that involved leadership of major portions of the US nuclear weapons simulation program. He has published over 230 refereed papers, conference papers and books in computational, experimental and theoretical physics and software engineering with over 5,000 citations. He is a Fellow of the American Physical Society, the American Nuclear Society, and the Institute of Electrical and Electronic Engineers. He serves as an Associate Editor-in-Chief of the joint AIP/IEEE publication Computing in Science and Engineering.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

Empowering High-Performance Computing for Artificial Intelligence

April 19, 2024

Artificial intelligence (AI) presents some of the most challenging demands in information technology, especially concerning computing power and data movement. As a result of these challenges, high-performance computing Read more…

Kathy Yelick on Post-Exascale Challenges

April 18, 2024

With the exascale era underway, the HPC community is already turning its attention to zettascale computing, the next of the 1,000-fold performance leaps that have occurred about once a decade. With this in mind, the ISC Read more…

2024 Winter Classic: Texas Two Step

April 18, 2024

Texas Tech University. Their middle name is ‘tech’, so it’s no surprise that they’ve been fielding not one, but two teams in the last three Winter Classic cluster competitions. Their teams, dubbed Matador and Red Read more…

2024 Winter Classic: The Return of Team Fayetteville

April 18, 2024

Hailing from Fayetteville, NC, Fayetteville State University stayed under the radar in their first Winter Classic competition in 2022. Solid students for sure, but not a lot of HPC experience. All good. They didn’t Read more…

Software Specialist Horizon Quantum to Build First-of-a-Kind Hardware Testbed

April 18, 2024

Horizon Quantum Computing, a Singapore-based quantum software start-up, announced today it would build its own testbed of quantum computers, starting with use of Rigetti’s Novera 9-qubit QPU. The approach by a quantum Read more…

2024 Winter Classic: Meet Team Morehouse

April 17, 2024

Morehouse College? The university is well-known for their long list of illustrious graduates, the rigor of their academics, and the quality of the instruction. They were one of the first schools to sign up for the Winter Read more…

Kathy Yelick on Post-Exascale Challenges

April 18, 2024

With the exascale era underway, the HPC community is already turning its attention to zettascale computing, the next of the 1,000-fold performance leaps that ha Read more…

Software Specialist Horizon Quantum to Build First-of-a-Kind Hardware Testbed

April 18, 2024

Horizon Quantum Computing, a Singapore-based quantum software start-up, announced today it would build its own testbed of quantum computers, starting with use o Read more…

MLCommons Launches New AI Safety Benchmark Initiative

April 16, 2024

MLCommons, organizer of the popular MLPerf benchmarking exercises (training and inference), is starting a new effort to benchmark AI Safety, one of the most pre Read more…

Exciting Updates From Stanford HAI’s Seventh Annual AI Index Report

April 15, 2024

As the AI revolution marches on, it is vital to continually reassess how this technology is reshaping our world. To that end, researchers at Stanford’s Instit Read more…

Intel’s Vision Advantage: Chips Are Available Off-the-Shelf

April 11, 2024

The chip market is facing a crisis: chip development is now concentrated in the hands of the few. A confluence of events this week reminded us how few chips Read more…

The VC View: Quantonation’s Deep Dive into Funding Quantum Start-ups

April 11, 2024

Yesterday Quantonation — which promotes itself as a one-of-a-kind venture capital (VC) company specializing in quantum science and deep physics  — announce Read more…

Nvidia’s GTC Is the New Intel IDF

April 9, 2024

After many years, Nvidia's GPU Technology Conference (GTC) was back in person and has become the conference for those who care about semiconductors and AI. I Read more…

Google Announces Homegrown ARM-based CPUs 

April 9, 2024

Google sprang a surprise at the ongoing Google Next Cloud conference by introducing its own ARM-based CPU called Axion, which will be offered to customers in it Read more…

Nvidia H100: Are 550,000 GPUs Enough for This Year?

August 17, 2023

The GPU Squeeze continues to place a premium on Nvidia H100 GPUs. In a recent Financial Times article, Nvidia reports that it expects to ship 550,000 of its lat Read more…

Synopsys Eats Ansys: Does HPC Get Indigestion?

February 8, 2024

Recently, it was announced that Synopsys is buying HPC tool developer Ansys. Started in Pittsburgh, Pa., in 1970 as Swanson Analysis Systems, Inc. (SASI) by John Swanson (and eventually renamed), Ansys serves the CAE (Computer Aided Engineering)/multiphysics engineering simulation market. Read more…

Intel’s Server and PC Chip Development Will Blur After 2025

January 15, 2024

Intel's dealing with much more than chip rivals breathing down its neck; it is simultaneously integrating a bevy of new technologies such as chiplets, artificia Read more…

Choosing the Right GPU for LLM Inference and Training

December 11, 2023

Accelerating the training and inference processes of deep learning models is crucial for unleashing their true potential and NVIDIA GPUs have emerged as a game- Read more…

Baidu Exits Quantum, Closely Following Alibaba’s Earlier Move

January 5, 2024

Reuters reported this week that Baidu, China’s giant e-commerce and services provider, is exiting the quantum computing development arena. Reuters reported � Read more…

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

October 30, 2023

With long lead times for the NVIDIA H100 and A100 GPUs, many organizations are looking at the new NVIDIA L40S GPU, which it’s a new GPU optimized for AI and g Read more…

Shutterstock 1179408610

Google Addresses the Mysteries of Its Hypercomputer 

December 28, 2023

When Google launched its Hypercomputer earlier this month (December 2023), the first reaction was, "Say what?" It turns out that the Hypercomputer is Google's t Read more…

AMD MI3000A

How AMD May Get Across the CUDA Moat

October 5, 2023

When discussing GenAI, the term "GPU" almost always enters the conversation and the topic often moves toward performance and access. Interestingly, the word "GPU" is assumed to mean "Nvidia" products. (As an aside, the popular Nvidia hardware used in GenAI are not technically... Read more…

Leading Solution Providers

Contributors

Shutterstock 1606064203

Meta’s Zuckerberg Puts Its AI Future in the Hands of 600,000 GPUs

January 25, 2024

In under two minutes, Meta's CEO, Mark Zuckerberg, laid out the company's AI plans, which included a plan to build an artificial intelligence system with the eq Read more…

China Is All In on a RISC-V Future

January 8, 2024

The state of RISC-V in China was discussed in a recent report released by the Jamestown Foundation, a Washington, D.C.-based think tank. The report, entitled "E Read more…

Shutterstock 1285747942

AMD’s Horsepower-packed MI300X GPU Beats Nvidia’s Upcoming H200

December 7, 2023

AMD and Nvidia are locked in an AI performance battle – much like the gaming GPU performance clash the companies have waged for decades. AMD has claimed it Read more…

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

March 18, 2024

Nvidia's latest and fastest GPU, codenamed Blackwell, is here and will underpin the company's AI plans this year. The chip offers performance improvements from Read more…

DoD Takes a Long View of Quantum Computing

December 19, 2023

Given the large sums tied to expensive weapon systems – think $100-million-plus per F-35 fighter – it’s easy to forget the U.S. Department of Defense is a Read more…

Eyes on the Quantum Prize – D-Wave Says its Time is Now

January 30, 2024

Early quantum computing pioneer D-Wave again asserted – that at least for D-Wave – the commercial quantum era has begun. Speaking at its first in-person Ana Read more…

GenAI Having Major Impact on Data Culture, Survey Says

February 21, 2024

While 2023 was the year of GenAI, the adoption rates for GenAI did not match expectations. Most organizations are continuing to invest in GenAI but are yet to Read more…

The GenAI Datacenter Squeeze Is Here

February 1, 2024

The immediate effect of the GenAI GPU Squeeze was to reduce availability, either direct purchase or cloud access, increase cost, and push demand through the roof. A secondary issue has been developing over the last several years. Even though your organization secured several racks... Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire