Readers add their opinions on an article by Dr. Bob Panoff last week in HPCwire [http://news.taborcommunications.com/msgget.jsp?mid=363815&xsl=story.xsl]. Dr. Panoff responds to the comments as well.
Editor,
The April 8 edition of HPCWire carried two items I found interesting when considered as part of a larger benchmarking issue. The first regarding a new supercomputer said, in part, “HP announced that the U.S. Department of Energy's new state-of-the art research supercomputer has completed a two-year acceptance process by the agency's Pacific Northwest National Laboratory (PNNL).” Another said, in part, “What our students are learning, however, just from these three examples, is that the measurement of “high performance” is highly contextual, and the same machine configuration and assumptions for one exemplar completely break down for the others.”
Students should be made aware that a “two-year acceptance process” is not the ideal for most HPC shops (a Government agency can't pay their vendor until a system passes “Acceptance”) and that benchmark codes should not only be comprehensive in their performance assessment but also inexpensive to run. Nothing in Dr. Panoff's article implied that students' work would lead to excessive benchmarking procedures but there was no mention of the economics of the benchmarking process. The HPC world is nothing if not competitive.
Gary Wohl
NOAA, National Weather Service
———-
Hello,
I just read the article on HPCwire which mentions the homework parallel applications you assign to your students, and I have a few of comments:
– All the applications you mention have a large Compute to Communicate ratio. Even the N-body problem, when N is large enough, becomes communication friendly. This is not always the case, and I see application kernels where the ratio is almost constant. For example when using spectral methods, the FFT computation requires to transpose the whole dataset, communication is O(N) while computation is O(N.log(N)). I believe a transposition of a distributed 2D array is a good simple illustration.
– The only thing embarassing about the Embarassingly Parallel applications is that they do not parallelize …always that well, e.g. ray tracing. Ray tracing is often mentioned in the EP category, but when the work/pixel ratio is not uniform it raises some interesting load-balancing problems. Often, when considering ray tracing as EP, one implicitely assumes that the database has been replicated on each thread. Now consider a model dynamically changing… And what if the model is so large that each thread holds only a portion of it ?
– Also, I am surprised you mention the master-slave model for parallel programming. I did not think anybody would consider this approach anymore. My experience is that, as you increase the number of threads, the “master” becomes a serious bottleneck, all the “slaves” spin-waiting for it. Free the slaves, get rid of the masters 😉
– Finally, you seem to consider only a cluster architecture, there are large shared-memory systems available, and the applications will look different (better I think) on SMP architecture.
best regards
Jean-Pierre Panziera Principal Engineer, Silicon Graphics Inc.
———-
First, I agree with many of your comments, and appreciate the time you took to reply.
I only considered the cluster examples since I was commenting in large part on the continuing spate of efforts to define HPC in terms of clusters and the Grid. From my perspective, the shared-memory architecture has proven to be more productive in the main over the years and look forward to those machines competing for more attention. I am actually a holder of one of the early Cray Gigaflop awards. . .I had forgotten about that whole effort of the Gigaflop awards in putting together my thoughts, but perhaps my subconscious was trying to remind me. Perhaps my article was really a cry to return to those good old days when vendors welcomed the challenge to have their machines pushed to the limits by real codes,
As for the set of examples I gave, my listing was part of the challenge: as I wrote in the piece, I wanted to solicit from the community a richer set of examples than those we are using now. The only way to define “richer” was to expose what we are using now, so folks will send in their better examples. From some of the mail I received, I think I succeeded in stirring the pot a bit.
Lastly, the compute to communication ration is not large for N-body problems unless one switches algorithms as N gets large. For a direct force calculation, the exchange of data between processors dwarfs the computation time as N gets large. Again, this has been kept as a simple example for pedagogical reasons.
With best wishes,
Bob Panoff
———-
Thanks for your answer,
I appreciate what you are saying about shared memory architectures and as you know we (SGI) provide such systems with hundreds of procs.
Regarding the N-body problem, let me explain why I believe it becomes communication friendly as N increases when using the direct force calculation. Consider P threads, each thread “owns” M bodies, M=N/P, the M forces you need to compute involve M*(N-1) elementary contributions, that is N^2/P. Each thread receives the data for the N*(P-1)/P bodies owned by the (P-1) other threads. Thus, for a given thread, the compute/communicate ratio is O(N). You can add one level of sophistication using the reciprocity principle (action/reaction). You're then halving the computation and need to send back the force contribution, doubling the communications. Overall this divides by 4 the compute/communicate ratio which still remains O(N). The favorable Compute/Communicate ratio makes this kind of computation a good candidate for clusters and FPGAs.
I might have misunderstood what you are doing since I do not see how it is possible that “the exchange of data between processors dwarfs the computation time as N gets large.”
Best regards,
Jean-Pierre Panziera, SGI
———-
Dear Dr. Panoff,
The following comments are not claimed to represent the views of SAIC or those of its customers.
Many thanks for your recent excellent note in HPCWire ([1]) on practical performance analysis and tuning.
As a computational performance analysis and tuning professional, everything you said rang true to me. I was particularly impressed with the emphasis you put on the end-to-end systems engineering context.
Here are some distillations that have continued to inform my reasoning about performance.
1. The production hydrodynamics codes on which I worked typically use ~1% of peak. In the load-store architectures on which these codes ran (clusters of SGI Origin 2000, SGI Origin 3000, Compaq Alpha, IBM SP2, and commodity- and custom-Intel units), the processors were data-starved because memory-to-register transfers took at least 10 times as long as it took to operate on a datum, once that datum was in a register. 10% of peak is therefore the best attainable performance on codes that have to move data to and from general memory; performance is all downhill from there. Until this problem is solved in hardware, 10% of peak will remain usable peak.
2. Real codes rarely scale on tree-structured networks.
3. The performance of real codes, including the performance of at least one well known Monte Carlo code, is often enough sensitive to problem type, not just problem size.
4. Platform architecture matters deeply, but the syntax and semantics of high-level languages badly underdetermines the structure and dynamics of the machine codes on these platforms.
5. Anyone can implement an inefficient data structure that can outwit even the smartest compiler optimization.
6. Once a legacy code grows to more than about 50,000 source lines of code, it becomes institutionally impossible to get the resources to restructure it for performance, no matter how poor the code is admitted to be.
7. In any sufficiently large cluster, hardware is free (compared to the cost of the labor lost in trying to use it efficiently).
8. There is no substitute for hardware-level performance monitoring tools, but most vendors provide little to no support for it.
9. “Linux” is the name of at least 10 *different* operating systems, each with its own performance trades.
10. Optimizing performance by altering process management on a platform is the beginning of a re-design of the operating system. At the end of that effort, you will know less than you need to about operating systems, and if you're exceptionally lucky, your code will run no slower than it did when you started.
For several years, I taught a performance analysis and tuning recipe in various parallel computing workshops. Although the recipe worked if followed, I found it was all but impossible to get the code developers (typically physicists with no software engineering background) to adopt it.
Jack
—
[1] R. Panoff. Benching the Benchmarks: Measuring Performance in HPC … HPCWire M363815.
Jack K. Horner SAIC P. O. Box 3827 Santa Fe, NM 87501 Voice: 505-455-3804 Fax: 505-455-0382 email: [email protected]
Have comments on an HPCwire article? Please direct comments/questions to HPCwire editor Tim Curns at [email protected].