The evolution of 4th generation surgery tools will help spread brain surgery to the masses, altogether dispensing with neurosurgeons in small hospitals that cannot afford their high pay.
Do you feel that I am pulling your leg? I am. But so is the HPCwire editor when he claims that 4th generation programming languages will make HPC programming available to the masses. Programming — at least, programming of a large, complex code — is a specialized task that requires a specialist — a software engineer — to the same extent that brain surgery requires a specialist. You might claim that many non-specialists do write code. It is also true that most of us take care of our routine health problems. But only a fool would try brain surgery because he was successful in removing a corn from his foot. To believe that better languages will soon make software engineers redundant is to believe that Artificial General Intelligence will progress much faster than most of us expect, or to belittle the specialized skills of software engineers.
The editor expects that the masses will program clusters on their own, while software engineers will continue to be needed for programming leading-edge supercomputers. However, the difference is not between bleeding-edge supercomputers and clusters. It is between complex programming tasks and simple programming tasks. Writing a simple program with little concern with performance and not too much worry about correctness is tantamount to removing a corn. Even writing a simple program — for example, an FFT routine that achieves close to optimal performance, is tantamount to brain surgery, even if the target system is the processor that operates my laptop. Writing a moderately complex program that is bug free with high confidence and can be used to control a critical system is also tantamount to brain surgery. Finally, writing a large, complex program that more or less satisfies specifications seems to be harder than brain surgery (large software projects seem to have a higher mortality rate than brain surgery patients).
Programming is harder when the program is more complex and when constraints of high efficiency or high confidence are stricter. Performance constraints can appear on large systems, and can appear on small systems: It can be extremely hard to shoehorn a compute intensive application into the power and memory constraints of a cell phone. Performance can matter a lot for cluster programs that are frequently used: the programmers of the MPI or ScaLAPACK libraries have good reasons to carefully tune the performance of their libraries on clusters: these libraries consume many cycles on many clusters, and improving their performance will improve the performance of many applications. While the difficulty of performance tuning relates to the complexity of the target architecture, one can well argue that a cluster is a more complex architecture than a leading edge supercomputer, because of the more complex software environment and the less controllable behavior of commercial LAN switches.
There is no obvious reason for cluster programs to be smaller or for confidence requirements on clusters to be less stringent than for supercomputers. However, it is true that supercomputer computations are more likely to be resource constrained than cluster computations. Indeed, a program will be run on a capability platform only if it cannot execute in a reasonable time on a smaller cluster. Such programs may tax even the resources of a leading edge supercomputer. On the other hand, performance may be less critical for cluster programs that do not consume significant hardware resources. I am not sure this represents a large fraction of cluster cycles.
The editor draws a dichotomy between MPI with C or Fortran for the high priests of supercomputing and MATLAB or SQL for the masses. This dichotomy is false. The most high-performing commercial transaction systems use SQL, but SQL by itself does not make a commercial application. Such an application will use a variety of services and frameworks, and will be written in a variety of programming languages. SQL itself is written, by experts, in C or other such language.
The same holds true for scientific and engineering computations, be it on clusters or on supercomputers: Whenever possible, users will use available libraries or frameworks. The libraries will be implemented in Fortran, C or such similar languages, and users will use these languages for their glue code. Libraries have been used for many decades to extend the expressiveness of low level programming languages such as Fortran or C.
Computational frameworks are increasingly replacing low level programming languages as the main mechanism for expressing computations in many domains. Such frameworks can be specialized by plugging in specific methods, often written in lower level languages, and can be extended in a variety of ways. I am not sure what the difference between a well-designed computational framework (such as Cactus) and a “fourth generation language” or “domain specific language” is. Such computational frameworks are domain specific. They emphasize higher levels of abstraction, and the execution model is often interactive. Furthermore, computational frameworks are increasingly used for codes that run on the largest supercomputers.
Programming on supercomputers, like programming on any other platform, is likely to evolve toward higher level, more powerful programming languages or frameworks. The use of languages or frameworks that are more extensible, have more powerful type systems with better type inference, and provide support for generic programming, are safer and increase productivity. Such high-level languages are likely to have specialized idioms for specific application domains. To a large extent this is already true for languages such as Java or C#, since much programming is done using powerful domain specific classes. For example, programming a GUI in Java using Swing is very different from programming a business application using Enterprise JavaBeans, and programmers specialize in one or the other. To the same extent, languages such as C# or Java, or next generation languages, can be extended with idioms for scientific computing. This has been done for Java (www.javagrande.org).
The evolution of programming language and compiler technology provides more powerful mechanisms for language extension. The extension mechanisms encompass not just predefined and pre-coded methods. Code generation can occur at run-time or, indeed, whenever new relevant information on characteristics of the computation becomes available. The user can control at various levels the implementation mechanisms for the high-level objects and their methods and even the implementation mechanisms for control structures. The Telescopic Languages project of the late Ken Kennedy or the Fortress language project at Sun are showing the strength of such techniques.
A common thread in these projects is that the high level language should match well the application domain — the way application specialists think. The mapping from the logic of the application to the logic of the machine may involve multiple layers of translation, and these translations cannot be fully automated. A specialist programmer is needed to guide these mappings, by implementing run-time code and libraries, by developing preprocessors and application generators or by adding implementation annotations to the core code. The distinction between the application programmer and the language implementer becomes blurred, since application programmers can modify the language and can modify its implementation. However, such a hierarchical design supports high levels of specialization, where some programmers are more focused on application logic and others are more focused on application performance.
The parallel MATLAB solutions of The MathWorks or ISC are examples of this trend. MATLAB was not developed for HPC, and would not be a viable product if uniquely targeting HPC. The goal was to provide a notation that is closer to the way scientific programmers think. In both cases, the mapping of a MATLAB code to a parallel machine is not fully automated, and the programmer has to manually parallelize the code. Parallelism is expressed using well known (low level) paradigms: message-passing (MPI) and distributed arrays and forall loops (HPF). The parallel notation becomes part of the source code, but it should be possible (and desirable) to keep it separate, as an implementation annotation, and to make sure that it does not change the program semantics.
This general approach to high-level language design, while important for HPC, is not unique to HPC. Indeed, one can well argue that designing high-level languages specifically for high performance computing is a contradiction in terms: High-level languages should match the application domain, not the architecture of the compute platform. Developing high-level languages that satisfy the needs of HPC but are less convenient to use on more modest platforms is a waste of money.
Unique to HPC is the need for low level implementation languages that can be used to write libraries and implement the high-level objects and methods so as to run efficiently on clusters and supercomputers. This implementation language would be, today, MPI with Fortran or C. What should it be tomorrow (i.e., in five years from now)? Could the Partitioned Global Address Space (PGAS) languages, such as UPC, CAF and Titanium fulfill this role? (In a nutshell, these languages provide the same SPMD model of MPI, with multiple processes each executing on its own data. However they also provide partitioned global arrays that can be accessed by all processes. Communication occurs though access to the non-local part of a global array; simple barrier constructs are available for synchronization.)
An “implementation language” (IL) for HPC should satisfy the following requirements:
1. Performance. It should be possible to achieve close to optimal performance for programs written in IL. Recent research has shown that programs written in CAF or UPC can sometimes beat the performance of MPI programs. This is very encouraging given that the compiler technology for these languages is still immature, while implementations of MPI are very mature. There are two reasons to believe that PGAS languages could lead to better performance as compared to MPI: (1) The support by supercomputers and by the interconnect technology used on clusters (Myrinet, Quadrics, InfiniBand) of direct remote memory access entails that better communication performance can be achieved using one-sided puts and gets, rather than two-sided message-passing. The design of MPI is well suited to two-sided communication, but perhaps less suited to one-sided communication. (2) A compiler can optimize communication and avoid the overhead of message-passing libraries, further reducing communication overhead. These languages do not yet offer good support for collective communications, and for parallel I/O, but these problems should be fixed within a few years.
2. Transparency. It should be possible for a programmer to predict, with reasonable accuracy, the performance of a code. The transformation done by the compiler or the run-time should not only preserve the semantics of the code, ensuring that the computation is correct, but should also “preserve” performance, i.e., should support a simple formula for translating program execution metrics into an approximate execution time. ILs are used by programmers to deal with performance issues, but if the programmer has no way of reasoning about performance trade-offs, then performance can be achieved only through an exhaustive search through all possible program versions. PGAS languages are reasonably transparent.
3. User control. The IL should provide the programmer means of controlling how critical resources are used. In particular, for HPC it is important to exercise some control on scheduling (to achieve load balancing and prevent idle time) and on communication. Load balancing and locality (communication reduction) are often algorithmic problems. Without some control on those, one cannot achieve close to optimal performance. Scheduling and communication are under user control with PGAS languages.
4. Modularity and composability. A large application will be composed of independently developed modules. The internal details of one module should not impact other modules, and one should be able to compose modules with limited knowledge of their interface. Sequential programs support only “sequential composition”: a program invokes a module, and control is transferred to that module; upon completion control is transferred back. Programmers have been warned to avoid side effects, leading to a simple interface specification. Parallel programming also requires support for “parallel composition”, or “fork-merge”: several modules execute concurrently, and then combine back into a unified parallel computation. This is essential, for example, in multiphysics simulations, where multiple physics modules work in parallel and periodically exchange information. MPI supports fork-merge via its Communicators: a group of processes can be split into independent subgroups, and then merged back. The code executed by each subgroup is totally independent of the code executed by other subgroups. UPC and CAF have not yet implemented similar concepts and, hence lack good support for modularity. (The CAF community seems to be working on this problem as part of the Fortran 2008 standard effort.)
5. Backward compatibility. Code written in IL should be able to invoke libraries written using MPI or other common message passing interfaces. While this has not been a focus of CAF or UPC, there are no inherent obstacles to compatibility.
There is another set of properties that I believe are important and can be supported efficiently. Their efficient support, however, is still a matter for research. The properties are:
1. Determinism. Deterministic, repeatable execution should be the default. Nondeterminism should occur only if the programmer explicitly uses nondeterministic constructs. Races and synchronization bugs are hard to detect, and are one of the major difficulties of parallel programming. The use of global address space worsens the problem as it becomes easier to write buggy code and harder to detect the bugs.
Transactions and transactional memory are not a solution to this problem. Transactional memory provides efficient mechanisms to ensure the atomicity of transactions, but does not enforce an order between two transactions that access the same data. Transactions are a natural idiom to express the behavior of systems where concurrency is inherent in the problem specification. An online transaction system has to handle concurrent purchasing requests and has to ensure that only one passenger gets the last seat in a plane and that the seat is assigned to the same customer whose credit card was charged — hence atomicity. Transactions are not a natural idiom for most of scientific computing. It is seldom the case that we specify a computation with two conflicting noncommutative updates, where we do not care about their execution order, as long as each executes atomically. The natural idiom for scientific computing is (partial) order, not mutual exclusion. Therefore, races and nondeterminism result most often from programming bugs. The current PGAS languages do not prevent and do not detect races. I believe that race prevention is as essential to parallel programming as memory safety is to sequential programming. Furthermore, it seems plausible that races can be prevented using suitable programming languages and suitable compiler technology, without encumbering the programmer or significantly slowing down execution. We should work hard to ensure this happens, before “race exploits” become daily occurrences.
2. Global name space. A very common idiom in scientific computing is that of a global data structure (e.g., a mesh) that is used to represent the discretization of a continuous field. A simulation step may consist of applying an updating function to this field, or computing the interactions between the field and a set of particles. On a parallel machine one needs to break the structure into patches that are allocated to individual processes, but the patches are not natural objects in the problem definition. They appear only because of the mapping to a parallel system.
Similarly, in a particle computation, it may be necessary to partition the particles into chunks in order to reduce communication and synchronization. While each particle is a natural object in the problem specification, the chunks are not. In both cases, it is more convenient to specify the logic of the computation using global data structures and a global name space. It is desirable to be able to refine such a program and partition data structures without having to change the names of the variables. The name of a variable should relate to its logical role, not to its physical location. (I, therefore, speak of a global name space, not a global address space.)
In order to control communication and parallelism, the user should be able to control where data is located. But this should not require changing the names of variables. PGAS languages do provide a global name space, but support only simple, static partitions of arrays. In cases where more complex or more dynamic partitions of global data structures are needed, one needs to explicitly copy and permute data, and change the names of variables.
3. Dynamic data partitioning and dynamic control partitioning. Parallelism is expressed using two main idioms: data parallelism and control parallelism. In data parallelism, data is partitioned. Execution gets partitioned by executing statements on the site where their main operands reside. This is done, implicitly, with languages such as HPF and the “owner compute” rule, and explicitly, with forall statements and “on” clauses. In control parallelism, control is partitioned and data is moved implicitly to where it is accessed. Both forms of parallelism are useful. (As an aside: the two are identical in single-assignment languages, such as NESL.)
The use of adaptive algorithms, such as Adaptive Mesh Refinement, or multiscale algorithms, require that partitions be dynamic, as data structures change and the amounts of storage and work associated with a patch change. Current PGAS languages do not support dynamic repartitioning of control and data any better than MPI. Such repartitioning will require explicit copying of data and the application then has to maintain the correspondence between the logical name of a variable and its physical location. Dynamic control partitioning is easy for languages such as OpenMP that use a global name space and parallel loops for parallel control. But such languages do not provide good control for locality.
Efficient support for dynamic data and control partitioning is still a research issue. Languages with limited, static partitions (such as current PGAS languages) can be implemented efficiently, but force the user to do the work. Languages that support powerful, dynamic data and control repartitioning can too easily lead to inefficient codes. One limited but well-tested and fairly powerful step toward supporting dynamic data and control partitioning is to use process virtualization. The model provided by MPI or by the PGAS languages is that of a fixed number of processes, each with its own address space, and (usually) one thread of control. Implementations associate one process with each processor (or core) and applications are written assuming a dedicated fixed set of identical processors. A suitable run-time can be used to virtualize the processes of MPI, UPC or CAF (the AMPI system is already doing this for MPI). The run-time scheduler can map multiple virtual processes (that are actually implemented as user-level threads) onto each physical processor, and can dynamically migrate the processes and change the mapping so as to balance load or reduce communication.
Process virtualization greatly enhances the modularity of complex parallel codes. Consider, for example, a multiphysics code that couples two physics modules. Normally, each module runs on a dedicated set of processors. The modules execute independently a time step of their simulation, and then exchange data. Suppose that the first module executes a dynamic mesh refinement. The internal logic of this module presumably includes code for repartitioning the mesh and rebalancing the computation when the mesh is refined. But, after the refinement, this module will take longer to execute a time step, so that the global computation becomes unbalanced. It becomes necessary to steal resources from the second module in order to rebalance the computation. This other module may not have, on its own, any need for dynamic load balancing, and very few parallel programs are written so as to accommodate a run-time change in the number of processors they use. With virtual processes, each module may be written for a fixed number of (virtual) processes, while still allowing resources to be moved from one module to another in a multiphysics computation.
Similarly, consider a multiscale computation, where it may be necessary to spawn a new parallel module that refines the computation in one region, using a finer scale, more compute intensive method. With virtual processes, resources can be reallocated within a fixed partition to the spawned module.
In summary, PGAS languages may, with some needed enhancements, be quite useful as HPC implementation languages. Additional work is needed for such languages to support modern scientific codes — work that, unfortunately, does not seem to be part of the DARPA HPCS agenda.
My discussion, so far, has focused on programming languages. However, it is important to remember that programming languages are only one of many contributors to programmer productivity — not the most important one, and not very significant, in isolation. Research on the productivity of object oriented languages has shown that the use of OO languages does not contribute much to productivity, per se. Rather, OO languages contribute indirectly in that they encourage and facilitate code reuse and other useful programming techniques. It would be useful to submit newly proposed programming languages for HPC to that same test: In what way do they support more efficient software development processes?
By far, the most important contributor to software productivity is the quality and experience of the software developer. This, by itself, already suggests that “parallel programming for the masses” is misguided. One should not attempt to develop languages and tools so that Joe Schmo is able to program clusters or supercomputers. Rather, one should educate high quality software engineers that understand programming for HPC, and provide enough rewards and stability to ensure that they stay in their profession and amass experience.
Software productivity is also heavily influenced by the quality of the process used to develop the software and by the quality of the tools and environments used by the software developers. It is important to understand what best practices in the development of HPC software are, and to ensure that these practices are broadly applied. While much of the knowledge from general software development will apply, scientific computing may need different testing and validation processes, and HPC computing may need a different emphasis and a different approach to performance tuning. One can hope that the DARPA HPCS program will result in advances in this area.
HPC software developers have traditionally used programming environments and tools that lagged behind those used in commercial software development. The HPC market has been too small to justify commercial investments in high quality HPC Integrated Development Environments (IDEs), and the government has not had the vision to support such development. Eclipse, the open source IDE framework that is now broadly used for Java development, offers a promise for change. Eclipse based IDEs for Java are as good or better than any, and the open architecture of Eclipse supports the construction of IDEs for other languages and programming models. It has become possible to have a community effort that will create a modern, high-quality IDE for HPC. This work is already happening in national labs and universities.
One major contributor to the productivity of software developers is the availability of significant compute resources, so as to shorten the edit-compile-test cycle. The limited availability of interactive HPC platforms may be one of the most significant impediments to HPC software development. One should carefully weigh the right balance between the allocation of resources to production and the allocation to development. And one should ensure that HPC software development does not remain stuck in the era of batch processing.
In summary, there is no magic wand that will make software development for clusters or supercomputers significantly easier than it is now — to the same extent that no magic wand will make brain surgery significantly easier. The technology used in brain surgery continues to improve, enabling brain surgeons to perform more complicated surgeries, and improving the prognoses of brain surgeries. To the same extent, when we think of programming languages or tools that will enhance the productivity of HPC programmers, it is not very useful to focus on “HPC programming for dummies.” Rather, one should focus on better languages and tools for the HPC experts that will enable these experts to develop more complex or better performing software for HPC platforms.
Professor Marc Snir is the head of the Computer Science Department at the University of Illinois at Urbana-Champaign. He is currently pursuing research on parallel programming languages and environments, parallel programming patterns, and performance tuning patterns. He is also involved in the DOE funded Center for Programming Models for Scalable Parallel Computing. For more biographical information visit http://www.cs.uiuc.edu/homes/snir/.