Hard Questions While Waiting For The HPCS Downselect

By the High-End Crusader

May 5, 2006

 Proposals from Cray Inc., IBM, and Sun Microsystems for Phase III awards from DARPA’s High-Productivity Computing Systems (HPCS) program are due as these words are being written.  To differing degrees, the vendors are understandably coy about precisely what their proposals will contain.  In a fiercely competitive environment where perception management is key, a loosely guarded detail might open up one of the proposals to successful sniping by a rival vendor.  But an innovative petaflops system architecture that, in the most optimistic case, might reinvigorate the stagnant field of high-end computing is hardly in the same category as a stealth fighter aircraft or a classified graph algorithm.  First, taxpayers have at least some right to know how their tax dollars are being spent.  Second, one of the critical questions in postmodern computing is this: Is there some form of thread migration, in shared-memory and distributed-memory architectures, respectively, that effectively solves—or fails to solve—the dynamic load- balancing problem?  This question deserves to be answered in full view of the entire HPC community.  While it may be naive to expect a final resolution of this perennial question in short order, we can demand more openness from vendors as the evidence accumulates.  After the awards are made, it is likely that the government will engage in some perception management of its own, perhaps arguing that the innovative system architectures of the two winners (people expect two winners) are prospectively likely to be as good as, or even better than, the forthcoming Japanese 10-PFs/s heterogeneous-processing machine, which appears to this observer to be architecturally quite impressive.  Notwithstanding the vendors’ undisputed proprietary rights, your correspondent would like to preemptively _shame_ each Phase III winner into coming clean soon after the awards are made.  He asks each successful PI, at such time, to explain the broad outlines—with at least some details—of his company’s system architecture.  Obviously, there are major differences in the degree to which the various vendors have been coy up to now, but in all three cases one can reasonably ask: What are your aspirations for this machine?  What do you really think it can do?  What are you really going to put into it?  — A Problem And Its Solution  A significant source of performance and programmability shortfalls in current “high-end” system architectures is inter-application and intra-application diversity.  Since we can safely leave the embarrassingly _local_ (not “parallel”!) applications to clusters and their ilk, the applications we consider have the following characteristics: 1) they compute over globally dispersed data structures; 2) they engage in long-range communication, sometimes frequently; 3) they contain a dynamic heterogeneous mix of algorithm characteristics due to dynamic variation of a) the amount, kind, and granularity of the application’s parallelism, and b) the amount, kind, and granularity of the application’s locality; and 4) in both classical and postmodern problems, there are frequently rather serious load-balancing problems that generally require flexible, dynamic load balancing mechanisms to resolve.  We can easily imagine an ideal heterogeneous system architecture that would gracefully deal with all these forms of application diversity. Such a machine would have the following characteristics: 1) it would have high global system bandwidth—an expensive, critical, finite system resource—a) to permit tolerating the latency of long-range communication, etc., and b) to ease the programming burden by mitigating the performance nonuniformity of memory accessing; 2) it would have a wide variety of _combined_ compile-time, runtime, and hardware parallelism mechanisms that would adapt to dynamic variation of the amount, kind, and granularity of an application’s parallelism; 3) it would have a wide variety of _combined_ compile-time, runtime, and hardware locality mechanisms that would adapt to dynamic variation in the amount, kind, and granularity of an application’s locality; and 4) it would include—as mandated by your correspondent’s architectural tradition—various work queues of fine-grained and medium-grained parallel activities (threads, chores, threadlets, etc.) that would be dynamically self-scheduled by various instances of various “virtual processors” (VPs).  This notion of VP is fully generic.  An individual VP could be, for example, 1) a multithreaded-processor instruction-stream processing resource, 2) a vector processor, 3) a RISC superscalar processor, or 3) _any_ processing element with an assignable  tuple, i.e., program counter plus frame pointer, which specifies the data (or execution) context.  Work queues, i.e., unordered collections of ready-to-execute continuations, were originally popularized by dataflow folk and later enthusiastically recommended by (massively) multithreaded folk.  Perhaps vector processors—which come from an entirely different architectural tradition—should ease up on static scheduling by their scalar units, and try a little dynamic self- scheduling instead.  Just a suggestion.  Note: A vector threadlet can be much smaller than a full vector thread.  In any case, work queues in shared-memory systems are known to allow flexible, dynamic load balancing by using the _work-stealing_ technique popularized by Blumofe in the MIT Cilk project.  Ah, but any cost-effective system architecture must find some way to use all these parallelism, locality, and load-balancing mechanisms to achieve high performance without over-consuming the precious, finite global system bandwidth resource.  — Heterogeneous Parallelism and Heterogeneous Locality  Much of the misunderstanding of parallelism and locality comes from the confusion between 1) parallelism and locality as _attributes_ of programs and computations, and 2) parallelism and locality as various architectural (and software) _mechanisms_ that allow us to extract, generate, shape, and exploit parallelism and locality, respectively (although sometimes—to really confuse us—parallelism mechanisms are used to exploit locality).  Compounding this misunderstanding is a heap of partially obsolete and potentially misleading terms such as “instruction-level parallelism” (ILP), “data-level parallelism” (DLP), and “thread-level parallelism” (TLP), not to mention such hoary favorites as “temporal locality” and “spatial locality”. While there is no hope of escaping this muddled terminology, we might at least try to clean it up.  For parallelism, massive abstraction simplifies the discussion. Consider the flow-dependence graph of a computation, where the graph’s vertices are elementary operations.  We can take this as the finest- granularity view of the parallelism in the computation.  Absence of a direct (or transitive) flow dependence between two operations means that it is permissible to execute these two operations in parallel. Similarly, grouping subgraphs of the flow-dependence graph into supervertices of a new dependence graph allows us to view this parallelism at coarser granularity.  At this level of abstraction, reasonable people agree there is no way to identify any portion of this parallelism as ILP, DLP, or TLP—or any other kind of P— because all the parallelism specified in this dependence graph is _fungible_.  This simply means that it could very well be exploited by any one of several architectural parallelism mechanisms.  If you find this paradoxical, think of a parallelizing compiler transforming TLP into ILP or vice versa.  For better or worse, “instruction-level parallelism” is the term we use to denote exploiting parallelism by conventional RISC superscalar pipelining with dynamic instruction scheduling and superscalar instruction issue.  Code so executed is called _serial code_ (or single-threaded scalar code).  Of course, even if all the instructions in a thread are totally ordered by dependences, we can still execute them “in parallel” by overlapping pipeline stages of different instructions, where the dependences are now between pipeline stages. Alas, since instructions can specify long-latency operations (e.g., loads), and since RISC superscalar processors do _very_ advanced pipelining, it helps if the operations in the thread are as unordered as possible.  If not, we may find it essentially impossible to fill the instruction-issue slots.  Does serial code exist?  Practically speaking, it exists whenever other forms of extracting and exploiting parallelism are more expensive—for whatever reason (perhaps because we are unwilling to reprogram a legacy application). Single-thread performance on a RISC superscalar processor is often quite respectable when parallelism is not an option.  “Vector-level parallelism” is the term we _should_ use to denote exploiting parallelism by conventional or hypermodern vector processors.  Code so executed is called _vectorized code_.  Precisely what code is vectorizable is a long story but, roughly speaking, such code 1) contains large groups of independent homogeneous operations that can be performed in parallel on vast arrays of data, and 2) has limited branch intensity and/or irregularity of control flow, which might very well overwhelm the compiler and lead to diminishing performance returns.  Modern vector processors have spatially parallel execution units and spatially parallel datapaths.  When vectorization “succeeds”, in several senses of that word, it is by far the most cost-effective architectural mechanism for exploiting parallelism.  Finally, “thread-level parallelism” is the term we use to denote exploiting parallelism by conventional or hypermodern (massively) multithreaded processors.   Code so executed is called _multithreaded code_.  What makes code threadable but not vectorizable?  Classic answers include: 1) frequent conditional branches, and short-vector or even scalar arithmetic; 2) data-dependent loop exits, recursive function invocations, and code selection based on data values; 3) non- linear and pointer-based data structures in data-intensive applications, such as postmodern graph applications; and 4) unpredictable fine-grained synchronization.  But maybe a better question is what code _should_ be threaded?  — A Model of Heterogeneity  The ultimate payoff of a heterogeneous architecture is the ability to construct a hybrid system architecture that combines characteristics of both 1) parallel von Neumann machines, and 2) parallel non-von Neumann machines, which are basically (meta)dataflow machines.  Any reader who understands that facts can change and opinions can evolve is referred to your correspondent’s first attempt to articulate the subtle distinction between von Neumann and non-von Neumann machines: “Asynchronous Heterogeneity: The Next Big Thing in HEC”, HPCwire, August 12, 2005 [M450881]).  An early example of such a hybrid architecture is the MIT *T machine.  What does this mean in the current context?  Broadly speaking, vector (and serial) threads _necessarily_ have high thread state and thus are suitable for some von Neumann computing tasks, while multithreaded scalar threads _can_ have arbitrarily low thread state and thus can _become_ suitable for other (possibly complementary) non-von Neumann computing tasks.  This is not to criticize threaded von Neumann computing, but merely to show that threading is more general purpose– -it can do both!  There is no question that applications present with mixtures of vector parallelism and scalar parallelism and that this often dictates whether vectors or threading is more appropriate.  But an entirely different question is whether high-state threads or low-state threads _as such_ are more appropriate.  Thread state matters so much because of 1) the frequent need for lightweight synchronization, and other issues related to dynamic self-scheduling; and 2) the desire to use inexpensive mechanisms for migrating low-state threads to exploit novel forms of locality.  Already, to take one topical example, new attributed-graph algorithms in health care, proteomics, counterterrorism, bioinformatics, etc., etc., are showing us that there are entirely novel styles of computation that must be addressed.  For locality, in contrast to parallelism, a more concrete (indeed, almost architecture-dependent) abstraction simplifies the discussion. One useful distinction is this: 1) there are temporal relationships among data operations, and 2) there are spatial relationships among data items and also among executing threads and (thread-relevant) data items.  What we really want is an efficient general-purpose mechanism to leverage instances of the first kind of spatial relationship into instances of the second kind, of which more later.  Intuitively, our model of (system) locality starts by abstracting the rich complexity of the _intraprocessor_ memory hierarchy into a one- level store.  Consider the flow-dependence graph of a computation, where—for simplicity— the graph’s vertices are elementary operations of precisely two types: either loads or arithmetic operations.  Each vertex is labeled with its type.  Since marshaling operations into threads strongly affects locality, we view this computation as resulting from the execution of some thread (serial, vector, or threaded).  Obviously, both the loads and the arithmetic operations are _producers_ of data values, while the arithmetic operations are also _consumers_.  We include a separate dependence edge every time a value produced by one operation is (nondestructively) consumed, i.e., used as an operand, by another operation. For even more simplicity, we will shortly elide some data-producing operations from this dependence graph.  Our performance metric here is the average amortized cost of producing a data value.  If a produced value is reused, i.e., consumed several times, we amortize its production cost over the number of its uses.  We can _attribute_ this labeled graph by specifying the data- production cost of each data-producing operation.  To do this, we need an architectural model. The simplest model is that the thread runs on a processor with a local store (e.g., the register file plus the D- cache) and is connected through a pipe to a memory.  Qualitatively, we say that each load produces a data value at _high_ cost, while each arithmetic operation produces a data value at _low_ cost.  Ultimately, we must assign precise quantitative values to these data-production costs, and design locality mechanisms to reduce them whenever possible.  To keep the model simple, we elide all (low-cost) “local reads”, i.e., all loads satisfied from the D-cache—as well as all (implicit) reads from the register file, so only (expensive) loads and (cheap) arithmetics are retained in the elided flow-dependence graph.  Reads from nonarchitected pipeline registers are included as part of the cost of an arithmetic operation.  Of course, in any real-world implementation, we would have to reinstate all the (implicit and explicit) elided vertices—and worry about each and every data- production cost.  Here, we are merely engaged in conceptual exposition.  In any execution schedule permitted by the flow-dependence graph, at each moment of time, there will be a set of produced values whose _final_ consumption lies in the future.  This is the _live set_ of the thread.  Since we are trying to minimize communication (i.e., wire) cost, things work best if the thread’s live set fits into the thread’s state, which is just the size of the thread’s effective local store (possibly a small fraction of the processor’s local store).  Two observations lead to an important punchline.  1) Temporal relationships among high-cost and low-cost data operations are determined by the topology of the _labeled_ dependence graph (after all, each ‘load’ vertex has a given out-degree and there is a given ratio of loads to arithmetics).  2) Spatial relationships among data items—and among executing threads and (thread-relevant) data items– -can strongly influence the data-production costs of the _attributed_ dependence graph (for example, a thread sitting at the _center_ of a cluster of useful data items reduces the average cost of performing a load).  The data-production cost attributes add _metric_ information to what is otherwise (largely) a _topological_ structure.  Note: The data-production costs of elided data-producing operations are generally controlled by minimizing wire distances within the processor’s microarchitecture—for example, by segmenting register files, caches, and functional-unit groups.  Bill Dally’s paper on the Imagine stream processor explains all this in great detail.  We have simply chosen to abstract from microarchitecture locality to focus on _system-architecture_ locality.  Quite naturally, the architectural mechanisms for exploiting locality fall into two classes: 1) _intraprocessor_ locality mechanisms optimized to exploit dependence graphs with high “internal locality”, i.e., those with significant reuse of loaded data values and/or significant local production of data values; and 2) _extraprocessor_ locality mechanisms optimized to exploit dependence graphs with low internal locality, i.e., those with little or no data reuse and little or no local production of data values.  What “optimization” of the second type of dependence graph is possible?  The latter dependence graphs _may_ have potentially high “external locality”, in the sense that the communication costs of their high-cost data-producing operations (viz., their expensive loads) can potentially be significantly reduced.  In class 1, threads execute on a fixed processor with a local store that can store the peak amount of the thread’s live set.  In this (von Neumann) model, threads are immobile and gradually build up their thread state—in the processor’s local store—to maximize their ratio of arithmetic operations to communication operations (the thread’s “arithmetic intensity”).  Obviously, to be properly exploited, a massive live set inherent in a thread’s flow-dependence graph requires an equally massive local store.  Dependence graphs with high internal locality _and_ small live sets do exist, but they tend to be the exception rather than the rule.  An immobile thread that gradually accumulates thread state is the essence of von Neumann computing.  Contrariwise, a thread denied thread-state accumulation by algorithm constraints may seek performance by engaging in non-von Neumann computing.  In class 2, how a thread—which has been marshaled from some flow- dependence graph—is intended to behave depends on whether 1) the application is control and data decomposable, 2) the application’s data is distributable, or 3) neither.  The first attribute is true of _localizable_ applications, i.e., the ones we ceded at the beginning of this piece to clusters and their ilk as undemanding applications not requiring a true supercomputer architect’s full attention.  For a strong proponent of heterogeneity, this may sound a tad sectarian, but clusters remain very cost effective for easily localizable applications.  The problem is that, for demanding applications, most serious people have lost hope that, given a moderate amount of programmer effort, the compiler will automatically do a good job of allocating and distributing data to suit the computation and minimize communication. In a word, you can’t assume that your thread will automatically run in a “data-friendly” environment, where the bulk of the data you need is close by.  Is there a more modest goal that might be achievable with modest programmer effort and a reasonable amount of automatic help from the compiler and the runtime system?  Yes, any time you can _physically_ group together data that _logically_ belong together.  This is not manually fragmenting data structures and designing and coordinating threads to compute individual fragments.  Rather, it is, at least in part, simply the “degree 0” of affinity scheduling.  In non-von Neumann computing, when we encounter (or manufacture) a dependence graph with low internal locality, we should ask: Can I shape and decompose this dependence graph into fine-grained parallel activities, i.e., low-state threads, with the property that each low- state thread would perform well if it was put into a local storage environment containing, say, at least 75% of the data the thread needed?  If so, we should further ask: Can I program this application so that, for each low-state thread, at least 75% of the relevant data becomes concentrated _around_ some location in the computer’s virtual address space, before the thread executes?  (Since we only care about average cost, this “concentration” can take many forms; for example, we might shape the distribution function of thread-relevant data into a crude approximation to the Dirac delta function).  If so, and if thread migration is dirt cheap, we can freely launch countless low- state threads, from many other (possibly high-state) threads, and have them migrate to these small regions of data concentration.  If the bulk of the loads can be satisfied from a small neighborhood— in virtual address space—of the low-state thread’s new location, and if we can get low-state threads to migrate cheaply to essentially arbitrary virtual locations, we have clearly minimized the average cost of that thread’s loads. Obviously, however, the virtual-address metric space cannot be _independent_ of the physical-location metric space, since—in the general case—cost equals wire distance along the global system interconnection network.  Agile, low-state threads 1) are freely spawned with minimal overhead by many types of thread, 2) migrate cheaply to prestaged local concentrations of data, and 3) engage in lightweight synchronization as often as they please. Deployment of these agile threads, and programmed construction of their data environments, is the essence of non-von Neumann computing.  Note: In a homogeneous multithreaded multiprocessor with logically flat shared memory, this problem does not arise.  Take one of the famously unpartitionable graphs in postmodern graph computing.  There is no need to partition or distribute or rebalance this graph because, logically speaking, every portion of the graph is at the same distance from every thread.  Threads don’t need exotic migration machinery because they gain no performance from migrating!  With an unlimited budget, you could build a rather large homogeneous multithreaded multiprocessor.  You could do the same thing if you were willing to cut corners on the system’s global hardware bandwidth.  But the future of computing demands that we back off—at least for a generation—from arbitrarily large flat shared memories.  We must, for a while, become more architecturally modest; we must say, “I can’t guarantee sufficient performance with parallelism alone or locality alone, if I truly want to scale; rather, I need to exploit (almost) every scrap of parallelism, of any kind, and (almost) every scrap of locality, of any kind, that I can dig out of this application”. It is this new-found architectural modesty that is the driving force behind heterogeneous processing.  Note: Programmability is simply the desire not to be forced to dig out all this parallelism, and all this locality, with your teeth!  A decent programming model, grounded in a decent programming language, is simply one that allows the programmer, the compiler, and the runtime system to extract almost all the parallelism, and almost all the locality, without forcing the programmer to do hard labor.  That being said, performance tuning may require a larger contribution from the programmer.  — Heterogeneous Systems vs. Heterogeneous Hardware  If the first hard question was, how well can we load balance in shared-memory and distributed-memory system architectures, respectively?, the second hard question is this: Even if a vendor provides heterogeneous hardware to deal with different forms of parallelism and—far less commonly—different forms of locality, how committed is the vendor to the massive system-software “architecture” research—in new compiler technology, in the design of new runtime systems, and in getting Unix to live on what is essentially a new planet—that is required if heterogeneous processing is to achieve even a small fraction of its true potential?  More bluntly, what must be added to even an impressive collection of heterogeneous system hardware to produce a genuine heterogeneous system architecture?  Is this asking for too much?  Isn’t it at least conceivable that certain vendors—at least those with actual system architectures in their HPCS proposals—are already stretched to their limits (technical, financial, commercial, ideological, or whatever)?  Would additional government pressure from HPCS on vendors at this time— asking for more innovation or more-radical system design—really be productive, or would it be counterproductive?  These are delicate questions.  But just as we have all scrutinized the intentions of a potential wife (or husband), so may we scrutinize the intentions of any vendor who is asking for a large chunk of public money. Notionally, your correspondent can imagine at least two possible vendor attitudes: 1) We fully agree that a thorough exploitation of heterogeneous processing is the nation’s _only_ way out of the high-end crisis; we offer the best design we can produce at this time given current constraints of human and financial resources; we have considered—and postponed—more expensive designs; and 2) While the notion of heterogeneous processing is intriguing, and might lead to some interesting bragging rights down the road, we are not totally sure that the heterogeneous architecture proposed by our academic partners will fit well with our traditional business model, i.e., the traditional way we make our money and the traditional “high- end” brand we currently promote.  Not to put too fine a point on it, sometimes a suitor does not fully understand his own intentions, much less those of his intended match. And, throughout history, some suitors have been cads.  From a crusader’s point of view, it makes sense to articulate the full vision of heterogeneous processing.  After all, no system architecture, and no government program, is cast in concrete.  At present, HPCS is the only high-end computing program our government has.  We can’t reduce our support  for this utterly vital national program just because—as several reputable persons have suggested— it has watered down its original high expectations, indeed, its original mandatory requirements, for an HPCS machine.  Should it be the case that giant steps by U.S. vendors at this time exceed our current capabilities, let us at least make sure that we are marching in the right direction.  — Conclusion  Heterogeneous processing, towards which your correspondent is radically biased, is turning out to be a big tent.  It is not the only future of computing, but it must be a large part of any possible future.  The drivers of heterogeneous processing are inter-application and intra-application diversity; the big tent is just the fact that we are incrementally widening our diversity net.  In fact, some innovative _homogeneous_ machines are targeted at important biomedical and national-security challenges.  Still, choices abound.  What are the natural limits of homogeneous processing? Are homogeneous systems in some sense essentially niche machines?  Do shared memory and distributed memory have profound effects on our ability to load balance?  How is this (alleged) difference related to programmer mediation of data distribution?  Why are designers comfortable with parallelism diversity but willfully obtuse when it comes to locality diversity?  Did the regrettable PIM mania destroy the credibility of non-von Neumann computing?  How much should one pay for a genuine heterogeneous system architecture? Since it is a sensitive subject, your correspondent has not rehashed arguments he made in late September showing that low- state threads deserve a _distinct_ intraprocessor memory hierarchy (essentially to lower the low-state thread-migration overhead).  Is a polymorphic intraprocessor memory hierarchy even conceivable?  Is it implementable?  The mission agencies have computational problems they need solved yesterday. Crusaders—especially multithreaded crusaders—dream of killer apps that will follow postmodern graph processing, and novel styles of computation that will define the future paradigms of high- end computing.  Computational biomedicine and computational national security are converging towards massive data-intensive applications where very limited (conventional) locality trashes all (conventional) locality mechanisms.  This is the proper driver of non-von Neumann computing.  —  The High-End Crusader, a noted expert in high-performance computing and communications, shall remain anonymous.  He alone bears responsibility for these commentaries.  Replies are welcome and may be sent to HPCwire editor Michael Feldman at [email protected] 

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

Supercomputer Research Reveals Star Cluster Born Outside Our Galaxy

July 11, 2020

The Milky Way is our galactic home, containing our solar system and continuing into a giant band of densely packed stars that stretches across clear night skies around the world – but, it turns out, not all of those st Read more…

By Oliver Peckham

Max Planck Society Begins Installation of Liquid-Cooled Supercomputer from Lenovo

July 9, 2020

Lenovo announced today that it is supplying a new high performance computer to the Max Planck Society, one of Germany's premier research organizations. Comprised of Intel Xeon processors and Nvidia A100 GPUs, and featuri Read more…

By Tiffany Trader

Xilinx Announces First Adaptive Computing Challenge

July 9, 2020

A new contest is challenging the computing world. Xilinx has announced the first Xilinx Adaptive Computing Challenge, a competition that will task developers and startups with finding creative workload acceleration solutions. Xilinx is running the Adaptive Computing Challenge in partnership with Hackster.io, a developing community... Read more…

By Staff report

Reviving Moore’s Law? LBNL Researchers See Promise in Heterostructure Oxides

July 9, 2020

The reality of Moore’s law’s decline is no longer doubted for good empirical reasons. That said, never say never. Recent work by Lawrence Berkeley National Laboratory researchers suggests heterostructure oxides may b Read more…

By John Russell

President’s Council Targets AI, Quantum, STEM; Recommends Spending Growth

July 9, 2020

Last week the President Council of Advisors on Science and Technology (PCAST) met (webinar) to review policy recommendations around three sub-committee reports: 1) Industries of the Future (IotF), chaired be Dario Gil (d Read more…

By John Russell

AWS Solution Channel

Best Practices for Running Computational Fluid Dynamics (CFD) Workloads on AWS

The scalable nature and variable demand of CFD workloads makes them well-suited for a cloud computing environment. Many of the AWS instance types, such as the compute family instance types, are designed to include support for this type of workload.  Read more…

Intel® HPC + AI Pavilion

Supercomputing the Pandemic: Scientific Community Tackles COVID-19 from Multiple Perspectives

Since their inception, supercomputers have taken on the biggest, most complex, and most data-intensive computing challenges—from confirming Einstein’s theories about gravitational waves to predicting the impacts of climate change. Read more…

Penguin Computing Brings Cascade Lake-AP to OCP Form Factor

July 7, 2020

Penguin Computing, a subsidiary of SMART Global Holdings, Inc., announced yesterday (July 6) a new Tundra server, Tundra AP, that is the first to implement the Intel Xeon Scalable 9200 series processors (codenamed Cascad Read more…

By Tiffany Trader

Max Planck Society Begins Installation of Liquid-Cooled Supercomputer from Lenovo

July 9, 2020

Lenovo announced today that it is supplying a new high performance computer to the Max Planck Society, one of Germany's premier research organizations. Comprise Read more…

By Tiffany Trader

President’s Council Targets AI, Quantum, STEM; Recommends Spending Growth

July 9, 2020

Last week the President Council of Advisors on Science and Technology (PCAST) met (webinar) to review policy recommendations around three sub-committee reports: Read more…

By John Russell

Google Cloud Debuts 16-GPU Ampere A100 Instances

July 7, 2020

On the heels of the Nvidia’s Ampere A100 GPU launch in May, Google Cloud is announcing alpha availability of the A100 “Accelerator Optimized” VM A2 instance family on Google Compute Engine. The instances are powered by the HGX A100 16-GPU platform, which combines two HGX A100 8-GPU baseboards using... Read more…

By Tiffany Trader

Q&A: HLRS’s Bastian Koller Tackles HPC and Industry in Germany and Europe

July 6, 2020

In this exclusive interview for HPCwire – sadly not face to face – Steve Conway, senior advisor for Hyperion Research, talks with Dr.-Ing Bastian Koller about the state of HPC and its collaboration with Industry in Europe. Koller is a familiar figure in HPC. He is the managing director at High Performance Computing Center Stuttgart (HLRS) and also serves... Read more…

By Steve Conway, Hyperion

OpenPOWER Reboot – New Director, New Silicon Partners, Leveraging Linux Foundation Connections

July 2, 2020

Earlier this week the OpenPOWER Foundation announced the contribution of IBM’s A21 Power processor core design to the open source community. Roughly this time Read more…

By John Russell

Hyperion Forecast – Headwinds in 2020 Won’t Stifle Cloud HPC Adoption or Arm’s Rise

June 30, 2020

The semiannual taking of HPC’s pulse by Hyperion Research – late fall at SC and early summer at ISC – is a much-watched indicator of things come. This yea Read more…

By John Russell

Racism and HPC: a Special Podcast

June 29, 2020

Promoting greater diversity in HPC is a much-discussed goal and ostensibly a long-sought goal in HPC. Yet it seems clear HPC is far from achieving this goal. Re Read more…

Top500 Trends: Movement on Top, but Record Low Turnover

June 25, 2020

The 55th installment of the Top500 list saw strong activity in the leadership segment with four new systems in the top ten and a crowning achievement from the f Read more…

By Tiffany Trader

Supercomputer Modeling Tests How COVID-19 Spreads in Grocery Stores

April 8, 2020

In the COVID-19 era, many people are treating simple activities like getting gas or groceries with caution as they try to heed social distancing mandates and protect their own health. Still, significant uncertainty surrounds the relative risk of different activities, and conflicting information is prevalent. A team of Finnish researchers set out to address some of these uncertainties by... Read more…

By Oliver Peckham

[email protected] Turns Its Massive Crowdsourced Computer Network Against COVID-19

March 16, 2020

For gamers, fighting against a global crisis is usually pure fantasy – but now, it’s looking more like a reality. As supercomputers around the world spin up Read more…

By Oliver Peckham

[email protected] Rallies a Legion of Computers Against the Coronavirus

March 24, 2020

Last week, we highlighted [email protected], a massive, crowdsourced computer network that has turned its resources against the coronavirus pandemic sweeping the globe – but [email protected] isn’t the only game in town. The internet is buzzing with crowdsourced computing... Read more…

By Oliver Peckham

Supercomputer Simulations Reveal the Fate of the Neanderthals

May 25, 2020

For hundreds of thousands of years, neanderthals roamed the planet, eventually (almost 50,000 years ago) giving way to homo sapiens, which quickly became the do Read more…

By Oliver Peckham

DoE Expands on Role of COVID-19 Supercomputing Consortium

March 25, 2020

After announcing the launch of the COVID-19 High Performance Computing Consortium on Sunday, the Department of Energy yesterday provided more details on its sco Read more…

By John Russell

Honeywell’s Big Bet on Trapped Ion Quantum Computing

April 7, 2020

Honeywell doesn’t spring to mind when thinking of quantum computing pioneers, but a decade ago the high-tech conglomerate better known for its control systems waded deliberately into the then calmer quantum computing (QC) waters. Fast forward to March when Honeywell announced plans to introduce an ion trap-based quantum computer whose ‘performance’ would... Read more…

By John Russell

Neocortex Will Be First-of-Its-Kind 800,000-Core AI Supercomputer

June 9, 2020

Pittsburgh Supercomputing Center (PSC - a joint research organization of Carnegie Mellon University and the University of Pittsburgh) has won a $5 million award Read more…

By Tiffany Trader

Global Supercomputing Is Mobilizing Against COVID-19

March 12, 2020

Tech has been taking some heavy losses from the coronavirus pandemic. Global supply chains have been disrupted, virtually every major tech conference taking place over the next few months has been canceled... Read more…

By Oliver Peckham

Leading Solution Providers


10nm, 7nm, 5nm…. Should the Chip Nanometer Metric Be Replaced?

June 1, 2020

The biggest cool factor in server chips is the nanometer. AMD beating Intel to a CPU built on a 7nm process node* – with 5nm and 3nm on the way – has been i Read more…

By Doug Black

Nvidia’s Ampere A100 GPU: Up to 2.5X the HPC, 20X the AI

May 14, 2020

Nvidia's first Ampere-based graphics card, the A100 GPU, packs a whopping 54 billion transistors on 826mm2 of silicon, making it the world's largest seven-nanom Read more…

By Tiffany Trader

‘Billion Molecules Against COVID-19’ Challenge to Launch with Massive Supercomputing Support

April 22, 2020

Around the world, supercomputing centers have spun up and opened their doors for COVID-19 research in what may be the most unified supercomputing effort in hist Read more…

By Oliver Peckham

Australian Researchers Break All-Time Internet Speed Record

May 26, 2020

If you’ve been stuck at home for the last few months, you’ve probably become more attuned to the quality (or lack thereof) of your internet connection. Even Read more…

By Oliver Peckham

15 Slides on Programming Aurora and Exascale Systems

May 7, 2020

Sometime in 2021, Aurora, the first planned U.S. exascale system, is scheduled to be fired up at Argonne National Laboratory. Cray (now HPE) and Intel are the k Read more…

By John Russell

Summit Supercomputer is Already Making its Mark on Science

September 20, 2018

Summit, now the fastest supercomputer in the world, is quickly making its mark in science – five of the six finalists just announced for the prestigious 2018 Read more…

By John Russell

TACC Supercomputers Run Simulations Illuminating COVID-19, DNA Replication

March 19, 2020

As supercomputers around the world spin up to combat the coronavirus, the Texas Advanced Computing Center (TACC) is announcing results that may help to illumina Read more…

By Staff report

$100B Plan Submitted for Massive Remake and Expansion of NSF

May 27, 2020

Legislation to reshape, expand - and rename - the National Science Foundation has been submitted in both the U.S. House and Senate. The proposal, which seems to Read more…

By John Russell

  • arrow
  • Click Here for More Headlines
  • arrow
Do NOT follow this link or you will be banned from the site!
Share This