A recent article (http://news.tgc.com/msgget.jsp?mid=328047 [M328047]) by Christopher Lazou published in HPCwire espoused the virtues of the new HPCC benchmarking suite being developed at the University of Tennessee which is intended to evaluate high performance computing systems by stressing their workload capacities above and beyond the relatively simple computational kernel used in the Linpack benchmark. The HPCC benchmarks are much more applicable to the typical workload of a modern supercomputer where many applications do not spend their time in optimized linear algebra routines but instead require high memory and interconnect bandwidth, and may use other computational kernels such as Fast Fourier Transforms. Whether the end result is a series of individual awards for each benchmark or an overall winner of a computing decathlon is less important than the ability to gain an insight into the relative strengths and weaknesses of different systems, determine whether there are areas in which many systems are able to offer similar performance or whether there are some manufacturers whose technology is highly advanced in comparison to their competitors.
It is in this latter area where the article contained serious flaws in that the results presented on the HPCC website were taken out of context and, depending in the benchmark in question, either the highest or lowest three numbers were taken as the best performers in that category without any consideration of the methods employed to improve performance by the submitting vendor or institution. This suite of benchmarks is intended to evaluate the performance of real computer systems, but the level of information provided on the website is sufficient to allow the reader to come to reasonable conclusions about the applicability of the benchmark results to real life situations. For example, it may be justifiable for a small number of processors available on a shared memory machine to be given over to operating system tasks, as might be usual practice in a production environment, but the submission of results for a large machine which only made use of a small number of processors in order to benefit the embarrassingly parallel per-processor scores would show up in the data, allowing the reader to deduce the usefulness of these numbers in assessing computer systems.
As an example, let us consider the highest performers for the embarrassingly parallel DGEMM benchmark, which intends to demonstrate the ability of individual processors to perform linear algebra in parallel without the need for communication between processes. Each process running the benchmark carries out its own individual matrix multiplication, and the results are collected and an average gigaflop rating is output as the result. The highest number reported on the website as of February 18th, 2005 is for a 32 processor NEC SX7 which has a published figure of approximately 140 gigaflops, whilst some way behind are an IBM pSeries 655 machine with just under 18 gigaflops and a Cray X1 with 10.9 gigaflops. These figures are, however, misleading since the popup balloon which appears over the column heading on the website and the recent article both refer to “per-CPU” performance, but these figures are per-process numbers and both the NEC and IBM machines are running multi-threaded versions of the code. The actual per-CPU numbers for the NEC and IBM machines should be 8.8 and 4.5 gigaflops respectively, leaving the Cray as the best performer, and it is quite clear that the reported results for the two multi-threaded versions of the code are much higher than the peak capabilities of the processors on which they ran.
A similar analysis of the stream triad benchmarks also shows up anomalies, again involving the NEC and IBM machines, with the NEC reporting a benchmark figure of 492 gigabytes per second and the IBM a figure of 7.7 gigabytes per second, but in the case of the NEC this is again the result for a single process with 16 threads. Although the IBM uses 4 threads per process, it might have been possible to achieve similar performance without executing a multi-threaded version of the code. Leaving 3 out of 4 processors idle would allow the remaining processor to utilize a much higher proportion of the available bus bandwidth between memory and processor while increasing the value of the per-process figure. The highest per-processor performance in this category is still an NEC machine which is able to achieve a bandwidth of over 28 gigabytes per second, almost twice that of its nearest competitor.
There is an erroneous explanation given in the earlier article with regard to the ability of the NEC to achieve 492 gigabytes per second bandwidth due to the absence of an interconnect. Not only is this irrelevant, but it opens up the possibility to reevaluate several benchmarks such as highest random-ring performer across an interconnect or best scalar processor performance on the numerically intensive kernels. The small number of benchmarks in this suite is ideal for a reasoned analysis, and the level of information provided allows individuals to carry out their own comparisons and should allow some degree of repeatability of results.
The article concludes with the statement that “the difference in performance between a vector and a scalar system can be up to a factor of 60” and, while it may be true that the most efficient vector system can outperform a large scalar systems on some of the benchmarks, the factor of 60 difference is not supported by the data so far submitted as part of HPC challenge. All of the information needed to make this analysis is available on the HPCC benchmark website, including numbers of MPI_COMM_WORLD processes, numbers of threads per process and the compiler options used to generate the executables. The HPCC benchmark suite is still at version 0.8 beta and any dissemination of erroneous information at this early stage could cause unnecessary damage to this excellent and laudable project.
Neil Stringfellow,
Benchmarking and Development Team
CSCS – Swiss National Supercomputer Center
via Cantonale
6928 Manno
Switzerland
Views expressed in this e-mail do not necessarily reflect the opinions of CSCS.