I was invited by the editors of HPCwire to address some issues of cluster management, but others have recently done an admirable job of that. In a recent interview in HPCwire, for example, Tom Quinn, Director of Government Business Development at Linux Networx, spoke of the need to accurately measure the real performance of a system, focusing on true productivity, not just raw speed.
This reminded me of my own efforts as a post-doc in the 1980's to replace MIPS or MFLOPS as a comparison measure with a more meaningful unit which I called MYPS, a direct measure of how fast a VAX-like mini or early supercomputer ran MY PROGRAM. As a many-body physicist, I wanted to get real physics done, not just keep a processor chugging along. As we return, it seems, to the primitive days of “build your own” supercomputing, I am not sure we have learned the lessons of high performance computing history. And so I want to address not just cluster management, but the computational science to be achieved in a well-managed cluster-computing environment.
Twenty years of advances in high performance computing have ushered in three significant changes in the conduct of computational science. First, for the most part, we have been able to concentrate more on modeling instead of programming. Second, given the large volumes of data that are available and necessary to describe complex systems, we emphasize visualization instead of graphing. Third, individual desktop computing is “good enough” for many problems, at least in education, which once required allocations of time on a national supercomputer.
At the same time, the problems we really want to solve exhaust the combined power of the world's fastest supercomputers. As a simple example, just to initialize the spatial coordinates for a computation of a drop of water at the molecular level would take about a decade on the Earth Simulator! Go ahead, do the math; this can be a good exercise for your students learning units conversion. And while the problems at the forefront of computational science are as large and complex as ever, the replacement of large-scale stable systems with build-your-own clusters suggest we may be backpedaling on the advances that made computational platforms usable by computational scientists themselves.
But what about these home-grown cluster computing platforms that will be at the heart of any advances in computational science? Twenty years ago, in the frustration of lack of access to true supercomputing, computational physicists such as Mal Kalos at NYU and Norman Christ at Columbia doubled as computer scientists designing and building their own computing environments because they didn't trust the market place to provide them the power they needed to advance the science. One memorable comment by Kalos summed up the zeitgeist of that era: “The temptation to retire from physics and become a computer scientist is strong,” he mused. “Why try to perform hard calculations when, for less work and more money, one can simply talk about them?”
By the early 1990's access to national and even state supercomputing centers – along with significant advances in trustworthy compilers that harnessed a significant fraction of the computing power- allowed many of us to concentrate on being scientists. Yet now, in the age of commodity computing and reconfigurable clusters, it seems we have cycled back to a “if you build it, it will hum” approach to computational science. While we were computing, someone convinced administrators and funding agencies that only a select few needed and were using the high performance computing centers, and that in a time of budget cuts, most scientists should now be able to “get by” with a self-assembled and self-managed cluster. North Carolina dismantled its state supercomputing center in favor of a yet-to-be-realized promise of a state-wide grid of campus-based clusters. In the meantime, at least in the opinion of many, less science is being done.
Many years ago, Plato posed the dilemma of the philosopher king thusly: “Inasmuch as philosophers only are able to grasp the eternal and unchangeable, and those who wander in the region of the many and variable are not philosophers, I must ask you which of the two classes should be the rulers of our State? And how can we rightly answer that question?”
More recently a similar debate has been raging in the computational science literature as to the appropriate education of a new generation of computational biologists, physicists, engineers, and chemists. Would it be better for well-grounded biologists to learn a little math, or whether well- prepared mathematicians should learn a little biology. Is it better to teach a physicist enough about clusters and their management to get some real science done, or should we be trying to teach a clusterist just enough physics to be dangerous?
Ultimately, the crucial question of computational science is: how do you know if the computation is to be believed? We ask whether the computation is verifiable – are the results reproducible, and whether the computation is validated – that good science is being computed. In order to take advantage of various configurations of processors and networks, the physical problems to be studied usually need to be approximated or expressed in different ways, yet many of the “uncontrolled approximations” degrade the science in the process of improving system “performance.” Isn't this the real question Quinn raised about productivity? Computational scientists want machines that work without changing the science too much to get them to work well. Measuring the performance and productivity of a cluster or grid of clusters still comes down to the quality of the science that the system will enable.
When I talk to other physicists using home-grown clusters, the conversation invariably descends to lamentations on the poor quality of compilers that let us maximize performance AND productivity. Most of my colleagues have returned to writing their own MPI versions of their codes for both computation and data handling (sorry, OpenMP just isn't “there” yet). It's back to being programmers instead of modelers. And what about my colleagues who have been able to get their codes to run on the clusters? They tend to spend their time at meetings discussing details about tricks to minimize network latency in a cluster, as opposed to insights into the physics coming off of that cluster. Grid computing over national networks complicates the communication-to- computation latency and magnifies the challenges of resource allocation. This is progress?
One program funded by the National Science Foundation through the ATE program is looking at ways of training the technicians needed to set up and manage the clusters through a consortium of community colleges (see http://highperformancecomputing.org), but they are realizing that “cluster management” requires a degree of understanding of the science in the codes to be run. While new tools make it easier to set up clusters (BCCD from Paul Gray at the University of Northern Iowa, along with earlier tools such as OSCAR and ROCKS), to manage them, and provide for some moderate level of system security, even the simplest examples of real science codes that run on clusters exceed the science and math preparation of the technician. Exemplary templates are being developed to make it easier to explore the cluster computing paradigms in real science applications; some of these are accessible at the new Pathway project of the National Science Digital Library, the Computational Science Education Reference Desk (http://cserd.nsdl.org). These approaches can help.
And what of the issue of how to educate the current and future generations of scientists and engineers, starting at the undergraduate level, to use these resources effectively? Unless we are satisfied to allow education to lag research by ten or more years, we need to start looking at ways of making the computational experience part of the science training. Life and physical scientists need to be able to communicate with computer scientists in such a way that their collaboration gives rise to a productive computing infrastructure for the science to advance. This is one area that needs new ideas, collective effort, software development, and -I would expect- considerable funding. A renewed effort should be undertaken to develop computational science problem-solving environments that make as much of the underlying computing as transparent as possible, while allowing direct performance monitoring of the cluster and grid resources for validation and verification purposes.
One example of a content domain that has recognized the need to “let scientists be scientists” is the area of computational chemistry, where there are well-tested applications such as Gaussian and GAMESS. The Computational Chemistry Grid Project ( https://www.gridchem.org/project/faq.htm ) has taken on the task of porting important applications to a cluster/grid environment for the community, and this has enabled chemists to stay chemists. Physics and biology have yet to have a similar set of common applications, and so we seem to be left taking care of our own code and cluster management.
George Santayana is often quoted as saying that those who fail to learn from history are doomed to repeat it. Another George (Bernard Shaw) said that one thing we have learned from history is that we have learned nothing from history. Perhaps by the time the history of high performance computing is written computer scientists and computational scientists will have found a way to work with each other and to learn from each other to the advancement of both. We can always hope.
Dr. Robert M. Panoff is founder and Executive Director of The Shodor Education Foundation, Inc., a non-profit education and research corporation dedicated to reform and improvement of mathematics and science education by appropriate incorporation of computational and communication technologies.
Dr. Panoff has been a consultant at several national laboratories and is a frequent presenter at NSF- sponsored workshops on visualization, supercomputing, and networking. He has served on the advisory panel for Applications of Advanced Technology program at NSF, and is a founding partner of NSF-affiliated Corporate and Foundation Alliance.
Dr. Panoff received his B.S. in physics from the University of Notre Dame and his M.A. and Ph.D. in theoretical physics from Washington University in St. Louis, undertaking both pre- and postdoctoral work at the Courant Institute of Mathematical Sciences at New York University.