Hard Questions While Waiting For The HPCS Downselect

By the High-End Crusader

May 5, 2006

 Proposals from Cray Inc., IBM, and Sun Microsystems for Phase III awards from DARPA’s High-Productivity Computing Systems (HPCS) program are due as these words are being written.  To differing degrees, the vendors are understandably coy about precisely what their proposals will contain.  In a fiercely competitive environment where perception management is key, a loosely guarded detail might open up one of the proposals to successful sniping by a rival vendor.  But an innovative petaflops system architecture that, in the most optimistic case, might reinvigorate the stagnant field of high-end computing is hardly in the same category as a stealth fighter aircraft or a classified graph algorithm.  First, taxpayers have at least some right to know how their tax dollars are being spent.  Second, one of the critical questions in postmodern computing is this: Is there some form of thread migration, in shared-memory and distributed-memory architectures, respectively, that effectively solves—or fails to solve—the dynamic load- balancing problem?  This question deserves to be answered in full view of the entire HPC community.  While it may be naive to expect a final resolution of this perennial question in short order, we can demand more openness from vendors as the evidence accumulates.  After the awards are made, it is likely that the government will engage in some perception management of its own, perhaps arguing that the innovative system architectures of the two winners (people expect two winners) are prospectively likely to be as good as, or even better than, the forthcoming Japanese 10-PFs/s heterogeneous-processing machine, which appears to this observer to be architecturally quite impressive.  Notwithstanding the vendors’ undisputed proprietary rights, your correspondent would like to preemptively _shame_ each Phase III winner into coming clean soon after the awards are made.  He asks each successful PI, at such time, to explain the broad outlines—with at least some details—of his company’s system architecture.  Obviously, there are major differences in the degree to which the various vendors have been coy up to now, but in all three cases one can reasonably ask: What are your aspirations for this machine?  What do you really think it can do?  What are you really going to put into it?  — A Problem And Its Solution  A significant source of performance and programmability shortfalls in current “high-end” system architectures is inter-application and intra-application diversity.  Since we can safely leave the embarrassingly _local_ (not “parallel”!) applications to clusters and their ilk, the applications we consider have the following characteristics: 1) they compute over globally dispersed data structures; 2) they engage in long-range communication, sometimes frequently; 3) they contain a dynamic heterogeneous mix of algorithm characteristics due to dynamic variation of a) the amount, kind, and granularity of the application’s parallelism, and b) the amount, kind, and granularity of the application’s locality; and 4) in both classical and postmodern problems, there are frequently rather serious load-balancing problems that generally require flexible, dynamic load balancing mechanisms to resolve.  We can easily imagine an ideal heterogeneous system architecture that would gracefully deal with all these forms of application diversity. Such a machine would have the following characteristics: 1) it would have high global system bandwidth—an expensive, critical, finite system resource—a) to permit tolerating the latency of long-range communication, etc., and b) to ease the programming burden by mitigating the performance nonuniformity of memory accessing; 2) it would have a wide variety of _combined_ compile-time, runtime, and hardware parallelism mechanisms that would adapt to dynamic variation of the amount, kind, and granularity of an application’s parallelism; 3) it would have a wide variety of _combined_ compile-time, runtime, and hardware locality mechanisms that would adapt to dynamic variation in the amount, kind, and granularity of an application’s locality; and 4) it would include—as mandated by your correspondent’s architectural tradition—various work queues of fine-grained and medium-grained parallel activities (threads, chores, threadlets, etc.) that would be dynamically self-scheduled by various instances of various “virtual processors” (VPs).  This notion of VP is fully generic.  An individual VP could be, for example, 1) a multithreaded-processor instruction-stream processing resource, 2) a vector processor, 3) a RISC superscalar processor, or 3) _any_ processing element with an assignable  tuple, i.e., program counter plus frame pointer, which specifies the data (or execution) context.  Work queues, i.e., unordered collections of ready-to-execute continuations, were originally popularized by dataflow folk and later enthusiastically recommended by (massively) multithreaded folk.  Perhaps vector processors—which come from an entirely different architectural tradition—should ease up on static scheduling by their scalar units, and try a little dynamic self- scheduling instead.  Just a suggestion.  Note: A vector threadlet can be much smaller than a full vector thread.  In any case, work queues in shared-memory systems are known to allow flexible, dynamic load balancing by using the _work-stealing_ technique popularized by Blumofe in the MIT Cilk project.  Ah, but any cost-effective system architecture must find some way to use all these parallelism, locality, and load-balancing mechanisms to achieve high performance without over-consuming the precious, finite global system bandwidth resource.  — Heterogeneous Parallelism and Heterogeneous Locality  Much of the misunderstanding of parallelism and locality comes from the confusion between 1) parallelism and locality as _attributes_ of programs and computations, and 2) parallelism and locality as various architectural (and software) _mechanisms_ that allow us to extract, generate, shape, and exploit parallelism and locality, respectively (although sometimes—to really confuse us—parallelism mechanisms are used to exploit locality).  Compounding this misunderstanding is a heap of partially obsolete and potentially misleading terms such as “instruction-level parallelism” (ILP), “data-level parallelism” (DLP), and “thread-level parallelism” (TLP), not to mention such hoary favorites as “temporal locality” and “spatial locality”. While there is no hope of escaping this muddled terminology, we might at least try to clean it up.  For parallelism, massive abstraction simplifies the discussion. Consider the flow-dependence graph of a computation, where the graph’s vertices are elementary operations.  We can take this as the finest- granularity view of the parallelism in the computation.  Absence of a direct (or transitive) flow dependence between two operations means that it is permissible to execute these two operations in parallel. Similarly, grouping subgraphs of the flow-dependence graph into supervertices of a new dependence graph allows us to view this parallelism at coarser granularity.  At this level of abstraction, reasonable people agree there is no way to identify any portion of this parallelism as ILP, DLP, or TLP—or any other kind of P— because all the parallelism specified in this dependence graph is _fungible_.  This simply means that it could very well be exploited by any one of several architectural parallelism mechanisms.  If you find this paradoxical, think of a parallelizing compiler transforming TLP into ILP or vice versa.  For better or worse, “instruction-level parallelism” is the term we use to denote exploiting parallelism by conventional RISC superscalar pipelining with dynamic instruction scheduling and superscalar instruction issue.  Code so executed is called _serial code_ (or single-threaded scalar code).  Of course, even if all the instructions in a thread are totally ordered by dependences, we can still execute them “in parallel” by overlapping pipeline stages of different instructions, where the dependences are now between pipeline stages. Alas, since instructions can specify long-latency operations (e.g., loads), and since RISC superscalar processors do _very_ advanced pipelining, it helps if the operations in the thread are as unordered as possible.  If not, we may find it essentially impossible to fill the instruction-issue slots.  Does serial code exist?  Practically speaking, it exists whenever other forms of extracting and exploiting parallelism are more expensive—for whatever reason (perhaps because we are unwilling to reprogram a legacy application). Single-thread performance on a RISC superscalar processor is often quite respectable when parallelism is not an option.  “Vector-level parallelism” is the term we _should_ use to denote exploiting parallelism by conventional or hypermodern vector processors.  Code so executed is called _vectorized code_.  Precisely what code is vectorizable is a long story but, roughly speaking, such code 1) contains large groups of independent homogeneous operations that can be performed in parallel on vast arrays of data, and 2) has limited branch intensity and/or irregularity of control flow, which might very well overwhelm the compiler and lead to diminishing performance returns.  Modern vector processors have spatially parallel execution units and spatially parallel datapaths.  When vectorization “succeeds”, in several senses of that word, it is by far the most cost-effective architectural mechanism for exploiting parallelism.  Finally, “thread-level parallelism” is the term we use to denote exploiting parallelism by conventional or hypermodern (massively) multithreaded processors.   Code so executed is called _multithreaded code_.  What makes code threadable but not vectorizable?  Classic answers include: 1) frequent conditional branches, and short-vector or even scalar arithmetic; 2) data-dependent loop exits, recursive function invocations, and code selection based on data values; 3) non- linear and pointer-based data structures in data-intensive applications, such as postmodern graph applications; and 4) unpredictable fine-grained synchronization.  But maybe a better question is what code _should_ be threaded?  — A Model of Heterogeneity  The ultimate payoff of a heterogeneous architecture is the ability to construct a hybrid system architecture that combines characteristics of both 1) parallel von Neumann machines, and 2) parallel non-von Neumann machines, which are basically (meta)dataflow machines.  Any reader who understands that facts can change and opinions can evolve is referred to your correspondent’s first attempt to articulate the subtle distinction between von Neumann and non-von Neumann machines: “Asynchronous Heterogeneity: The Next Big Thing in HEC”, HPCwire, August 12, 2005 [M450881]).  An early example of such a hybrid architecture is the MIT *T machine.  What does this mean in the current context?  Broadly speaking, vector (and serial) threads _necessarily_ have high thread state and thus are suitable for some von Neumann computing tasks, while multithreaded scalar threads _can_ have arbitrarily low thread state and thus can _become_ suitable for other (possibly complementary) non-von Neumann computing tasks.  This is not to criticize threaded von Neumann computing, but merely to show that threading is more general purpose– -it can do both!  There is no question that applications present with mixtures of vector parallelism and scalar parallelism and that this often dictates whether vectors or threading is more appropriate.  But an entirely different question is whether high-state threads or low-state threads _as such_ are more appropriate.  Thread state matters so much because of 1) the frequent need for lightweight synchronization, and other issues related to dynamic self-scheduling; and 2) the desire to use inexpensive mechanisms for migrating low-state threads to exploit novel forms of locality.  Already, to take one topical example, new attributed-graph algorithms in health care, proteomics, counterterrorism, bioinformatics, etc., etc., are showing us that there are entirely novel styles of computation that must be addressed.  For locality, in contrast to parallelism, a more concrete (indeed, almost architecture-dependent) abstraction simplifies the discussion. One useful distinction is this: 1) there are temporal relationships among data operations, and 2) there are spatial relationships among data items and also among executing threads and (thread-relevant) data items.  What we really want is an efficient general-purpose mechanism to leverage instances of the first kind of spatial relationship into instances of the second kind, of which more later.  Intuitively, our model of (system) locality starts by abstracting the rich complexity of the _intraprocessor_ memory hierarchy into a one- level store.  Consider the flow-dependence graph of a computation, where—for simplicity— the graph’s vertices are elementary operations of precisely two types: either loads or arithmetic operations.  Each vertex is labeled with its type.  Since marshaling operations into threads strongly affects locality, we view this computation as resulting from the execution of some thread (serial, vector, or threaded).  Obviously, both the loads and the arithmetic operations are _producers_ of data values, while the arithmetic operations are also _consumers_.  We include a separate dependence edge every time a value produced by one operation is (nondestructively) consumed, i.e., used as an operand, by another operation. For even more simplicity, we will shortly elide some data-producing operations from this dependence graph.  Our performance metric here is the average amortized cost of producing a data value.  If a produced value is reused, i.e., consumed several times, we amortize its production cost over the number of its uses.  We can _attribute_ this labeled graph by specifying the data- production cost of each data-producing operation.  To do this, we need an architectural model. The simplest model is that the thread runs on a processor with a local store (e.g., the register file plus the D- cache) and is connected through a pipe to a memory.  Qualitatively, we say that each load produces a data value at _high_ cost, while each arithmetic operation produces a data value at _low_ cost.  Ultimately, we must assign precise quantitative values to these data-production costs, and design locality mechanisms to reduce them whenever possible.  To keep the model simple, we elide all (low-cost) “local reads”, i.e., all loads satisfied from the D-cache—as well as all (implicit) reads from the register file, so only (expensive) loads and (cheap) arithmetics are retained in the elided flow-dependence graph.  Reads from nonarchitected pipeline registers are included as part of the cost of an arithmetic operation.  Of course, in any real-world implementation, we would have to reinstate all the (implicit and explicit) elided vertices—and worry about each and every data- production cost.  Here, we are merely engaged in conceptual exposition.  In any execution schedule permitted by the flow-dependence graph, at each moment of time, there will be a set of produced values whose _final_ consumption lies in the future.  This is the _live set_ of the thread.  Since we are trying to minimize communication (i.e., wire) cost, things work best if the thread’s live set fits into the thread’s state, which is just the size of the thread’s effective local store (possibly a small fraction of the processor’s local store).  Two observations lead to an important punchline.  1) Temporal relationships among high-cost and low-cost data operations are determined by the topology of the _labeled_ dependence graph (after all, each ‘load’ vertex has a given out-degree and there is a given ratio of loads to arithmetics).  2) Spatial relationships among data items—and among executing threads and (thread-relevant) data items– -can strongly influence the data-production costs of the _attributed_ dependence graph (for example, a thread sitting at the _center_ of a cluster of useful data items reduces the average cost of performing a load).  The data-production cost attributes add _metric_ information to what is otherwise (largely) a _topological_ structure.  Note: The data-production costs of elided data-producing operations are generally controlled by minimizing wire distances within the processor’s microarchitecture—for example, by segmenting register files, caches, and functional-unit groups.  Bill Dally’s paper on the Imagine stream processor explains all this in great detail.  We have simply chosen to abstract from microarchitecture locality to focus on _system-architecture_ locality.  Quite naturally, the architectural mechanisms for exploiting locality fall into two classes: 1) _intraprocessor_ locality mechanisms optimized to exploit dependence graphs with high “internal locality”, i.e., those with significant reuse of loaded data values and/or significant local production of data values; and 2) _extraprocessor_ locality mechanisms optimized to exploit dependence graphs with low internal locality, i.e., those with little or no data reuse and little or no local production of data values.  What “optimization” of the second type of dependence graph is possible?  The latter dependence graphs _may_ have potentially high “external locality”, in the sense that the communication costs of their high-cost data-producing operations (viz., their expensive loads) can potentially be significantly reduced.  In class 1, threads execute on a fixed processor with a local store that can store the peak amount of the thread’s live set.  In this (von Neumann) model, threads are immobile and gradually build up their thread state—in the processor’s local store—to maximize their ratio of arithmetic operations to communication operations (the thread’s “arithmetic intensity”).  Obviously, to be properly exploited, a massive live set inherent in a thread’s flow-dependence graph requires an equally massive local store.  Dependence graphs with high internal locality _and_ small live sets do exist, but they tend to be the exception rather than the rule.  An immobile thread that gradually accumulates thread state is the essence of von Neumann computing.  Contrariwise, a thread denied thread-state accumulation by algorithm constraints may seek performance by engaging in non-von Neumann computing.  In class 2, how a thread—which has been marshaled from some flow- dependence graph—is intended to behave depends on whether 1) the application is control and data decomposable, 2) the application’s data is distributable, or 3) neither.  The first attribute is true of _localizable_ applications, i.e., the ones we ceded at the beginning of this piece to clusters and their ilk as undemanding applications not requiring a true supercomputer architect’s full attention.  For a strong proponent of heterogeneity, this may sound a tad sectarian, but clusters remain very cost effective for easily localizable applications.  The problem is that, for demanding applications, most serious people have lost hope that, given a moderate amount of programmer effort, the compiler will automatically do a good job of allocating and distributing data to suit the computation and minimize communication. In a word, you can’t assume that your thread will automatically run in a “data-friendly” environment, where the bulk of the data you need is close by.  Is there a more modest goal that might be achievable with modest programmer effort and a reasonable amount of automatic help from the compiler and the runtime system?  Yes, any time you can _physically_ group together data that _logically_ belong together.  This is not manually fragmenting data structures and designing and coordinating threads to compute individual fragments.  Rather, it is, at least in part, simply the “degree 0” of affinity scheduling.  In non-von Neumann computing, when we encounter (or manufacture) a dependence graph with low internal locality, we should ask: Can I shape and decompose this dependence graph into fine-grained parallel activities, i.e., low-state threads, with the property that each low- state thread would perform well if it was put into a local storage environment containing, say, at least 75% of the data the thread needed?  If so, we should further ask: Can I program this application so that, for each low-state thread, at least 75% of the relevant data becomes concentrated _around_ some location in the computer’s virtual address space, before the thread executes?  (Since we only care about average cost, this “concentration” can take many forms; for example, we might shape the distribution function of thread-relevant data into a crude approximation to the Dirac delta function).  If so, and if thread migration is dirt cheap, we can freely launch countless low- state threads, from many other (possibly high-state) threads, and have them migrate to these small regions of data concentration.  If the bulk of the loads can be satisfied from a small neighborhood— in virtual address space—of the low-state thread’s new location, and if we can get low-state threads to migrate cheaply to essentially arbitrary virtual locations, we have clearly minimized the average cost of that thread’s loads. Obviously, however, the virtual-address metric space cannot be _independent_ of the physical-location metric space, since—in the general case—cost equals wire distance along the global system interconnection network.  Agile, low-state threads 1) are freely spawned with minimal overhead by many types of thread, 2) migrate cheaply to prestaged local concentrations of data, and 3) engage in lightweight synchronization as often as they please. Deployment of these agile threads, and programmed construction of their data environments, is the essence of non-von Neumann computing.  Note: In a homogeneous multithreaded multiprocessor with logically flat shared memory, this problem does not arise.  Take one of the famously unpartitionable graphs in postmodern graph computing.  There is no need to partition or distribute or rebalance this graph because, logically speaking, every portion of the graph is at the same distance from every thread.  Threads don’t need exotic migration machinery because they gain no performance from migrating!  With an unlimited budget, you could build a rather large homogeneous multithreaded multiprocessor.  You could do the same thing if you were willing to cut corners on the system’s global hardware bandwidth.  But the future of computing demands that we back off—at least for a generation—from arbitrarily large flat shared memories.  We must, for a while, become more architecturally modest; we must say, “I can’t guarantee sufficient performance with parallelism alone or locality alone, if I truly want to scale; rather, I need to exploit (almost) every scrap of parallelism, of any kind, and (almost) every scrap of locality, of any kind, that I can dig out of this application”. It is this new-found architectural modesty that is the driving force behind heterogeneous processing.  Note: Programmability is simply the desire not to be forced to dig out all this parallelism, and all this locality, with your teeth!  A decent programming model, grounded in a decent programming language, is simply one that allows the programmer, the compiler, and the runtime system to extract almost all the parallelism, and almost all the locality, without forcing the programmer to do hard labor.  That being said, performance tuning may require a larger contribution from the programmer.  — Heterogeneous Systems vs. Heterogeneous Hardware  If the first hard question was, how well can we load balance in shared-memory and distributed-memory system architectures, respectively?, the second hard question is this: Even if a vendor provides heterogeneous hardware to deal with different forms of parallelism and—far less commonly—different forms of locality, how committed is the vendor to the massive system-software “architecture” research—in new compiler technology, in the design of new runtime systems, and in getting Unix to live on what is essentially a new planet—that is required if heterogeneous processing is to achieve even a small fraction of its true potential?  More bluntly, what must be added to even an impressive collection of heterogeneous system hardware to produce a genuine heterogeneous system architecture?  Is this asking for too much?  Isn’t it at least conceivable that certain vendors—at least those with actual system architectures in their HPCS proposals—are already stretched to their limits (technical, financial, commercial, ideological, or whatever)?  Would additional government pressure from HPCS on vendors at this time— asking for more innovation or more-radical system design—really be productive, or would it be counterproductive?  These are delicate questions.  But just as we have all scrutinized the intentions of a potential wife (or husband), so may we scrutinize the intentions of any vendor who is asking for a large chunk of public money. Notionally, your correspondent can imagine at least two possible vendor attitudes: 1) We fully agree that a thorough exploitation of heterogeneous processing is the nation’s _only_ way out of the high-end crisis; we offer the best design we can produce at this time given current constraints of human and financial resources; we have considered—and postponed—more expensive designs; and 2) While the notion of heterogeneous processing is intriguing, and might lead to some interesting bragging rights down the road, we are not totally sure that the heterogeneous architecture proposed by our academic partners will fit well with our traditional business model, i.e., the traditional way we make our money and the traditional “high- end” brand we currently promote.  Not to put too fine a point on it, sometimes a suitor does not fully understand his own intentions, much less those of his intended match. And, throughout history, some suitors have been cads.  From a crusader’s point of view, it makes sense to articulate the full vision of heterogeneous processing.  After all, no system architecture, and no government program, is cast in concrete.  At present, HPCS is the only high-end computing program our government has.  We can’t reduce our support  for this utterly vital national program just because—as several reputable persons have suggested— it has watered down its original high expectations, indeed, its original mandatory requirements, for an HPCS machine.  Should it be the case that giant steps by U.S. vendors at this time exceed our current capabilities, let us at least make sure that we are marching in the right direction.  — Conclusion  Heterogeneous processing, towards which your correspondent is radically biased, is turning out to be a big tent.  It is not the only future of computing, but it must be a large part of any possible future.  The drivers of heterogeneous processing are inter-application and intra-application diversity; the big tent is just the fact that we are incrementally widening our diversity net.  In fact, some innovative _homogeneous_ machines are targeted at important biomedical and national-security challenges.  Still, choices abound.  What are the natural limits of homogeneous processing? Are homogeneous systems in some sense essentially niche machines?  Do shared memory and distributed memory have profound effects on our ability to load balance?  How is this (alleged) difference related to programmer mediation of data distribution?  Why are designers comfortable with parallelism diversity but willfully obtuse when it comes to locality diversity?  Did the regrettable PIM mania destroy the credibility of non-von Neumann computing?  How much should one pay for a genuine heterogeneous system architecture? Since it is a sensitive subject, your correspondent has not rehashed arguments he made in late September showing that low- state threads deserve a _distinct_ intraprocessor memory hierarchy (essentially to lower the low-state thread-migration overhead).  Is a polymorphic intraprocessor memory hierarchy even conceivable?  Is it implementable?  The mission agencies have computational problems they need solved yesterday. Crusaders—especially multithreaded crusaders—dream of killer apps that will follow postmodern graph processing, and novel styles of computation that will define the future paradigms of high- end computing.  Computational biomedicine and computational national security are converging towards massive data-intensive applications where very limited (conventional) locality trashes all (conventional) locality mechanisms.  This is the proper driver of non-von Neumann computing.  —  The High-End Crusader, a noted expert in high-performance computing and communications, shall remain anonymous.  He alone bears responsibility for these commentaries.  Replies are welcome and may be sent to HPCwire editor Michael Feldman at [email protected] 

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

China’s Tencent Server Design Will Use AMD Rome

November 13, 2019

Tencent, the Chinese cloud giant, said it would use AMD’s newest Epyc processor in its internally-designed server. The design win adds further momentum to AMD’s bid to erode rival Intel Corp.’s dominance of the glo Read more…

By George Leopold

NCSA Industry Conference Recap – Part 1

November 13, 2019

Industry Program Director Brendan McGinty welcomed guests to the annual National Center for Supercomputing Applications (NCSA) Industry Conference, October 8-10, on the University of Illinois campus in Urbana (UIUC). On Read more…

By Elizabeth Leake, STEM-Trek

Cray, Fujitsu Both Bringing Fujitsu A64FX-based Supercomputers to Market in 2020

November 12, 2019

The number of top-tier HPC systems makers has shrunk due to a steady march of M&A activity, but there is increased diversity and choice of processing components with Intel Xeon, AMD Epyc, IBM Power, and Arm server ch Read more…

By Tiffany Trader

Intel AI Summit: New ‘Keem Bay’ Edge VPU, AI Product Roadmap

November 12, 2019

At its AI Summit today in San Francisco, Intel touted a raft of AI training and inference hardware for deployments ranging from cloud to edge and designed to support organizations at various points of their AI journeys. Read more…

By Doug Black

SIA Recognizes Robert Dennard with 2019 Noyce Award

November 12, 2019

If you don’t know what Dennard Scaling is, the chances are strong you don’t labor in electronics. Robert Dennard, longtime IBM researcher, inventor of the DRAM and the fellow for whom Dennard Scaling was named, is th Read more…

By John Russell

AWS Solution Channel

Making High Performance Computing Affordable and Accessible for Small and Medium Businesses with HPC on AWS

High performance computing (HPC) brings a powerful set of tools to a broad range of industries, helping to drive innovation and boost revenue in finance, genomics, oil and gas extraction, and other fields. Read more…

IBM Accelerated Insights

Leveraging Exaflops Performance to Remediate Nuclear Waste

November 12, 2019

Nuclear waste storage sites are a subject of intense controversy and debate; nobody wants the radioactive remnants in their backyard. Now, a collaboration between Berkeley Lab, Pacific Northwest National University (PNNL Read more…

By Oliver Peckham

Cray, Fujitsu Both Bringing Fujitsu A64FX-based Supercomputers to Market in 2020

November 12, 2019

The number of top-tier HPC systems makers has shrunk due to a steady march of M&A activity, but there is increased diversity and choice of processing compon Read more…

By Tiffany Trader

IBM Adds Support for Ion Trap Quantum Technology to Qiskit

November 11, 2019

After years of percolating in the shadow of quantum computing research based on superconducting semiconductors – think IBM, Rigetti, Google, and D-Wave (quant Read more…

By John Russell

Tackling HPC’s Memory and I/O Bottlenecks with On-Node, Non-Volatile RAM

November 8, 2019

On-node, non-volatile memory (NVRAM) is a game-changing technology that can remove many I/O and memory bottlenecks and provide a key enabler for exascale. Th Read more…

By Jan Rowell

MLPerf Releases First Inference Benchmark Results; Nvidia Touts its Showing

November 6, 2019

MLPerf.org, the young AI-benchmarking consortium, today issued the first round of results for its inference test suite. Among organizations with submissions wer Read more…

By John Russell

Azure Cloud First with AMD Epyc Rome Processors

November 6, 2019

At Ignite 2019 this week, Microsoft's Azure cloud team and AMD announced an expansion of their partnership that began in 2017 when Azure debuted Epyc-backed ins Read more…

By Tiffany Trader

Nvidia Launches Credit Card-Sized 21 TOPS Jetson System for Edge Devices

November 6, 2019

Nvidia has launched a new addition to its Jetson product line: a credit card-sized (70x45mm) form factor delivering up to 21 trillion operations/second (TOPS) o Read more…

By Doug Black

In Memoriam: Steve Tuecke, Globus Co-founder

November 4, 2019

HPCwire is deeply saddened to report that Steve Tuecke, longtime scientist at Argonne National Lab and University of Chicago, has passed away at age 52. Tuecke Read more…

By Tiffany Trader

Spending Spree: Hyperscalers Bought $57B of IT in 2018, $10B+ by Google – But Is Cloud on Horizon?

October 31, 2019

Hyperscalers are the masters of the IT universe, gravitational centers of increasing pull in the emerging age of data-driven compute and AI.  In the high-stake Read more…

By Doug Black

Supercomputer-Powered AI Tackles a Key Fusion Energy Challenge

August 7, 2019

Fusion energy is the Holy Grail of the energy world: low-radioactivity, low-waste, zero-carbon, high-output nuclear power that can run on hydrogen or lithium. T Read more…

By Oliver Peckham

Using AI to Solve One of the Most Prevailing Problems in CFD

October 17, 2019

How can artificial intelligence (AI) and high-performance computing (HPC) solve mesh generation, one of the most commonly referenced problems in computational engineering? A new study has set out to answer this question and create an industry-first AI-mesh application... Read more…

By James Sharpe

Cray Wins NNSA-Livermore ‘El Capitan’ Exascale Contract

August 13, 2019

Cray has won the bid to build the first exascale supercomputer for the National Nuclear Security Administration (NNSA) and Lawrence Livermore National Laborator Read more…

By Tiffany Trader

DARPA Looks to Propel Parallelism

September 4, 2019

As Moore’s law runs out of steam, new programming approaches are being pursued with the goal of greater hardware performance with less coding. The Defense Advanced Projects Research Agency is launching a new programming effort aimed at leveraging the benefits of massive distributed parallelism with less sweat. Read more…

By George Leopold

AMD Launches Epyc Rome, First 7nm CPU

August 8, 2019

From a gala event at the Palace of Fine Arts in San Francisco yesterday (Aug. 7), AMD launched its second-generation Epyc Rome x86 chips, based on its 7nm proce Read more…

By Tiffany Trader

D-Wave’s Path to 5000 Qubits; Google’s Quantum Supremacy Claim

September 24, 2019

On the heels of IBM’s quantum news last week come two more quantum items. D-Wave Systems today announced the name of its forthcoming 5000-qubit system, Advantage (yes the name choice isn’t serendipity), at its user conference being held this week in Newport, RI. Read more…

By John Russell

Ayar Labs to Demo Photonics Chiplet in FPGA Package at Hot Chips

August 19, 2019

Silicon startup Ayar Labs continues to gain momentum with its DARPA-backed optical chiplet technology that puts advanced electronics and optics on the same chip Read more…

By Tiffany Trader

Crystal Ball Gazing: IBM’s Vision for the Future of Computing

October 14, 2019

Dario Gil, IBM’s relatively new director of research, painted a intriguing portrait of the future of computing along with a rough idea of how IBM thinks we’ Read more…

By John Russell

Leading Solution Providers

ISC 2019 Virtual Booth Video Tour

CRAY
CRAY
DDN
DDN
DELL EMC
DELL EMC
GOOGLE
GOOGLE
ONE STOP SYSTEMS
ONE STOP SYSTEMS
PANASAS
PANASAS
VERNE GLOBAL
VERNE GLOBAL

Intel Confirms Retreat on Omni-Path

August 1, 2019

Intel Corp.’s plans to make a big splash in the network fabric market for linking HPC and other workloads has apparently belly-flopped. The chipmaker confirmed to us the outlines of an earlier report by the website CRN that it has jettisoned plans for a second-generation version of its Omni-Path interconnect... Read more…

By Staff report

Kubernetes, Containers and HPC

September 19, 2019

Software containers and Kubernetes are important tools for building, deploying, running and managing modern enterprise applications at scale and delivering enterprise software faster and more reliably to the end user — while using resources more efficiently and reducing costs. Read more…

By Daniel Gruber, Burak Yenier and Wolfgang Gentzsch, UberCloud

Dell Ramps Up HPC Testing of AMD Rome Processors

October 21, 2019

Dell Technologies is wading deeper into the AMD-based systems market with a growing evaluation program for the latest Epyc (Rome) microprocessors from AMD. In a Read more…

By John Russell

Rise of NIH’s Biowulf Mirrors the Rise of Computational Biology

July 29, 2019

The story of NIH’s supercomputer Biowulf is fascinating, important, and in many ways representative of the transformation of life sciences and biomedical res Read more…

By John Russell

Intel Debuts Pohoiki Beach, Its 8M Neuron Neuromorphic Development System

July 17, 2019

Neuromorphic computing has received less fanfare of late than quantum computing whose mystery has captured public attention and which seems to have generated mo Read more…

By John Russell

Xilinx vs. Intel: FPGA Market Leaders Launch Server Accelerator Cards

August 6, 2019

The two FPGA market leaders, Intel and Xilinx, both announced new accelerator cards this week designed to handle specialized, compute-intensive workloads and un Read more…

By Doug Black

When Dense Matrix Representations Beat Sparse

September 9, 2019

In our world filled with unintended consequences, it turns out that saving memory space to help deal with GPU limitations, knowing it introduces performance pen Read more…

By James Reinders

With the Help of HPC, Astronomers Prepare to Deflect a Real Asteroid

September 26, 2019

For years, NASA has been running simulations of asteroid impacts to understand the risks (and likelihoods) of asteroids colliding with Earth. Now, NASA and the European Space Agency (ESA) are preparing for the next, crucial step in planetary defense against asteroid impacts: physically deflecting a real asteroid. Read more…

By Oliver Peckham

  • arrow
  • Click Here for More Headlines
  • arrow
Do NOT follow this link or you will be banned from the site!
Share This