Hard Questions While Waiting For The HPCS Downselect

By the High-End Crusader

May 5, 2006

 Proposals from Cray Inc., IBM, and Sun Microsystems for Phase III awards from DARPA’s High-Productivity Computing Systems (HPCS) program are due as these words are being written.  To differing degrees, the vendors are understandably coy about precisely what their proposals will contain.  In a fiercely competitive environment where perception management is key, a loosely guarded detail might open up one of the proposals to successful sniping by a rival vendor.  But an innovative petaflops system architecture that, in the most optimistic case, might reinvigorate the stagnant field of high-end computing is hardly in the same category as a stealth fighter aircraft or a classified graph algorithm.  First, taxpayers have at least some right to know how their tax dollars are being spent.  Second, one of the critical questions in postmodern computing is this: Is there some form of thread migration, in shared-memory and distributed-memory architectures, respectively, that effectively solves—or fails to solve—the dynamic load- balancing problem?  This question deserves to be answered in full view of the entire HPC community.  While it may be naive to expect a final resolution of this perennial question in short order, we can demand more openness from vendors as the evidence accumulates.  After the awards are made, it is likely that the government will engage in some perception management of its own, perhaps arguing that the innovative system architectures of the two winners (people expect two winners) are prospectively likely to be as good as, or even better than, the forthcoming Japanese 10-PFs/s heterogeneous-processing machine, which appears to this observer to be architecturally quite impressive.  Notwithstanding the vendors’ undisputed proprietary rights, your correspondent would like to preemptively _shame_ each Phase III winner into coming clean soon after the awards are made.  He asks each successful PI, at such time, to explain the broad outlines—with at least some details—of his company’s system architecture.  Obviously, there are major differences in the degree to which the various vendors have been coy up to now, but in all three cases one can reasonably ask: What are your aspirations for this machine?  What do you really think it can do?  What are you really going to put into it?  — A Problem And Its Solution  A significant source of performance and programmability shortfalls in current “high-end” system architectures is inter-application and intra-application diversity.  Since we can safely leave the embarrassingly _local_ (not “parallel”!) applications to clusters and their ilk, the applications we consider have the following characteristics: 1) they compute over globally dispersed data structures; 2) they engage in long-range communication, sometimes frequently; 3) they contain a dynamic heterogeneous mix of algorithm characteristics due to dynamic variation of a) the amount, kind, and granularity of the application’s parallelism, and b) the amount, kind, and granularity of the application’s locality; and 4) in both classical and postmodern problems, there are frequently rather serious load-balancing problems that generally require flexible, dynamic load balancing mechanisms to resolve.  We can easily imagine an ideal heterogeneous system architecture that would gracefully deal with all these forms of application diversity. Such a machine would have the following characteristics: 1) it would have high global system bandwidth—an expensive, critical, finite system resource—a) to permit tolerating the latency of long-range communication, etc., and b) to ease the programming burden by mitigating the performance nonuniformity of memory accessing; 2) it would have a wide variety of _combined_ compile-time, runtime, and hardware parallelism mechanisms that would adapt to dynamic variation of the amount, kind, and granularity of an application’s parallelism; 3) it would have a wide variety of _combined_ compile-time, runtime, and hardware locality mechanisms that would adapt to dynamic variation in the amount, kind, and granularity of an application’s locality; and 4) it would include—as mandated by your correspondent’s architectural tradition—various work queues of fine-grained and medium-grained parallel activities (threads, chores, threadlets, etc.) that would be dynamically self-scheduled by various instances of various “virtual processors” (VPs).  This notion of VP is fully generic.  An individual VP could be, for example, 1) a multithreaded-processor instruction-stream processing resource, 2) a vector processor, 3) a RISC superscalar processor, or 3) _any_ processing element with an assignable  tuple, i.e., program counter plus frame pointer, which specifies the data (or execution) context.  Work queues, i.e., unordered collections of ready-to-execute continuations, were originally popularized by dataflow folk and later enthusiastically recommended by (massively) multithreaded folk.  Perhaps vector processors—which come from an entirely different architectural tradition—should ease up on static scheduling by their scalar units, and try a little dynamic self- scheduling instead.  Just a suggestion.  Note: A vector threadlet can be much smaller than a full vector thread.  In any case, work queues in shared-memory systems are known to allow flexible, dynamic load balancing by using the _work-stealing_ technique popularized by Blumofe in the MIT Cilk project.  Ah, but any cost-effective system architecture must find some way to use all these parallelism, locality, and load-balancing mechanisms to achieve high performance without over-consuming the precious, finite global system bandwidth resource.  — Heterogeneous Parallelism and Heterogeneous Locality  Much of the misunderstanding of parallelism and locality comes from the confusion between 1) parallelism and locality as _attributes_ of programs and computations, and 2) parallelism and locality as various architectural (and software) _mechanisms_ that allow us to extract, generate, shape, and exploit parallelism and locality, respectively (although sometimes—to really confuse us—parallelism mechanisms are used to exploit locality).  Compounding this misunderstanding is a heap of partially obsolete and potentially misleading terms such as “instruction-level parallelism” (ILP), “data-level parallelism” (DLP), and “thread-level parallelism” (TLP), not to mention such hoary favorites as “temporal locality” and “spatial locality”. While there is no hope of escaping this muddled terminology, we might at least try to clean it up.  For parallelism, massive abstraction simplifies the discussion. Consider the flow-dependence graph of a computation, where the graph’s vertices are elementary operations.  We can take this as the finest- granularity view of the parallelism in the computation.  Absence of a direct (or transitive) flow dependence between two operations means that it is permissible to execute these two operations in parallel. Similarly, grouping subgraphs of the flow-dependence graph into supervertices of a new dependence graph allows us to view this parallelism at coarser granularity.  At this level of abstraction, reasonable people agree there is no way to identify any portion of this parallelism as ILP, DLP, or TLP—or any other kind of P— because all the parallelism specified in this dependence graph is _fungible_.  This simply means that it could very well be exploited by any one of several architectural parallelism mechanisms.  If you find this paradoxical, think of a parallelizing compiler transforming TLP into ILP or vice versa.  For better or worse, “instruction-level parallelism” is the term we use to denote exploiting parallelism by conventional RISC superscalar pipelining with dynamic instruction scheduling and superscalar instruction issue.  Code so executed is called _serial code_ (or single-threaded scalar code).  Of course, even if all the instructions in a thread are totally ordered by dependences, we can still execute them “in parallel” by overlapping pipeline stages of different instructions, where the dependences are now between pipeline stages. Alas, since instructions can specify long-latency operations (e.g., loads), and since RISC superscalar processors do _very_ advanced pipelining, it helps if the operations in the thread are as unordered as possible.  If not, we may find it essentially impossible to fill the instruction-issue slots.  Does serial code exist?  Practically speaking, it exists whenever other forms of extracting and exploiting parallelism are more expensive—for whatever reason (perhaps because we are unwilling to reprogram a legacy application). Single-thread performance on a RISC superscalar processor is often quite respectable when parallelism is not an option.  “Vector-level parallelism” is the term we _should_ use to denote exploiting parallelism by conventional or hypermodern vector processors.  Code so executed is called _vectorized code_.  Precisely what code is vectorizable is a long story but, roughly speaking, such code 1) contains large groups of independent homogeneous operations that can be performed in parallel on vast arrays of data, and 2) has limited branch intensity and/or irregularity of control flow, which might very well overwhelm the compiler and lead to diminishing performance returns.  Modern vector processors have spatially parallel execution units and spatially parallel datapaths.  When vectorization “succeeds”, in several senses of that word, it is by far the most cost-effective architectural mechanism for exploiting parallelism.  Finally, “thread-level parallelism” is the term we use to denote exploiting parallelism by conventional or hypermodern (massively) multithreaded processors.   Code so executed is called _multithreaded code_.  What makes code threadable but not vectorizable?  Classic answers include: 1) frequent conditional branches, and short-vector or even scalar arithmetic; 2) data-dependent loop exits, recursive function invocations, and code selection based on data values; 3) non- linear and pointer-based data structures in data-intensive applications, such as postmodern graph applications; and 4) unpredictable fine-grained synchronization.  But maybe a better question is what code _should_ be threaded?  — A Model of Heterogeneity  The ultimate payoff of a heterogeneous architecture is the ability to construct a hybrid system architecture that combines characteristics of both 1) parallel von Neumann machines, and 2) parallel non-von Neumann machines, which are basically (meta)dataflow machines.  Any reader who understands that facts can change and opinions can evolve is referred to your correspondent’s first attempt to articulate the subtle distinction between von Neumann and non-von Neumann machines: “Asynchronous Heterogeneity: The Next Big Thing in HEC”, HPCwire, August 12, 2005 [M450881]).  An early example of such a hybrid architecture is the MIT *T machine.  What does this mean in the current context?  Broadly speaking, vector (and serial) threads _necessarily_ have high thread state and thus are suitable for some von Neumann computing tasks, while multithreaded scalar threads _can_ have arbitrarily low thread state and thus can _become_ suitable for other (possibly complementary) non-von Neumann computing tasks.  This is not to criticize threaded von Neumann computing, but merely to show that threading is more general purpose– -it can do both!  There is no question that applications present with mixtures of vector parallelism and scalar parallelism and that this often dictates whether vectors or threading is more appropriate.  But an entirely different question is whether high-state threads or low-state threads _as such_ are more appropriate.  Thread state matters so much because of 1) the frequent need for lightweight synchronization, and other issues related to dynamic self-scheduling; and 2) the desire to use inexpensive mechanisms for migrating low-state threads to exploit novel forms of locality.  Already, to take one topical example, new attributed-graph algorithms in health care, proteomics, counterterrorism, bioinformatics, etc., etc., are showing us that there are entirely novel styles of computation that must be addressed.  For locality, in contrast to parallelism, a more concrete (indeed, almost architecture-dependent) abstraction simplifies the discussion. One useful distinction is this: 1) there are temporal relationships among data operations, and 2) there are spatial relationships among data items and also among executing threads and (thread-relevant) data items.  What we really want is an efficient general-purpose mechanism to leverage instances of the first kind of spatial relationship into instances of the second kind, of which more later.  Intuitively, our model of (system) locality starts by abstracting the rich complexity of the _intraprocessor_ memory hierarchy into a one- level store.  Consider the flow-dependence graph of a computation, where—for simplicity— the graph’s vertices are elementary operations of precisely two types: either loads or arithmetic operations.  Each vertex is labeled with its type.  Since marshaling operations into threads strongly affects locality, we view this computation as resulting from the execution of some thread (serial, vector, or threaded).  Obviously, both the loads and the arithmetic operations are _producers_ of data values, while the arithmetic operations are also _consumers_.  We include a separate dependence edge every time a value produced by one operation is (nondestructively) consumed, i.e., used as an operand, by another operation. For even more simplicity, we will shortly elide some data-producing operations from this dependence graph.  Our performance metric here is the average amortized cost of producing a data value.  If a produced value is reused, i.e., consumed several times, we amortize its production cost over the number of its uses.  We can _attribute_ this labeled graph by specifying the data- production cost of each data-producing operation.  To do this, we need an architectural model. The simplest model is that the thread runs on a processor with a local store (e.g., the register file plus the D- cache) and is connected through a pipe to a memory.  Qualitatively, we say that each load produces a data value at _high_ cost, while each arithmetic operation produces a data value at _low_ cost.  Ultimately, we must assign precise quantitative values to these data-production costs, and design locality mechanisms to reduce them whenever possible.  To keep the model simple, we elide all (low-cost) “local reads”, i.e., all loads satisfied from the D-cache—as well as all (implicit) reads from the register file, so only (expensive) loads and (cheap) arithmetics are retained in the elided flow-dependence graph.  Reads from nonarchitected pipeline registers are included as part of the cost of an arithmetic operation.  Of course, in any real-world implementation, we would have to reinstate all the (implicit and explicit) elided vertices—and worry about each and every data- production cost.  Here, we are merely engaged in conceptual exposition.  In any execution schedule permitted by the flow-dependence graph, at each moment of time, there will be a set of produced values whose _final_ consumption lies in the future.  This is the _live set_ of the thread.  Since we are trying to minimize communication (i.e., wire) cost, things work best if the thread’s live set fits into the thread’s state, which is just the size of the thread’s effective local store (possibly a small fraction of the processor’s local store).  Two observations lead to an important punchline.  1) Temporal relationships among high-cost and low-cost data operations are determined by the topology of the _labeled_ dependence graph (after all, each ‘load’ vertex has a given out-degree and there is a given ratio of loads to arithmetics).  2) Spatial relationships among data items—and among executing threads and (thread-relevant) data items– -can strongly influence the data-production costs of the _attributed_ dependence graph (for example, a thread sitting at the _center_ of a cluster of useful data items reduces the average cost of performing a load).  The data-production cost attributes add _metric_ information to what is otherwise (largely) a _topological_ structure.  Note: The data-production costs of elided data-producing operations are generally controlled by minimizing wire distances within the processor’s microarchitecture—for example, by segmenting register files, caches, and functional-unit groups.  Bill Dally’s paper on the Imagine stream processor explains all this in great detail.  We have simply chosen to abstract from microarchitecture locality to focus on _system-architecture_ locality.  Quite naturally, the architectural mechanisms for exploiting locality fall into two classes: 1) _intraprocessor_ locality mechanisms optimized to exploit dependence graphs with high “internal locality”, i.e., those with significant reuse of loaded data values and/or significant local production of data values; and 2) _extraprocessor_ locality mechanisms optimized to exploit dependence graphs with low internal locality, i.e., those with little or no data reuse and little or no local production of data values.  What “optimization” of the second type of dependence graph is possible?  The latter dependence graphs _may_ have potentially high “external locality”, in the sense that the communication costs of their high-cost data-producing operations (viz., their expensive loads) can potentially be significantly reduced.  In class 1, threads execute on a fixed processor with a local store that can store the peak amount of the thread’s live set.  In this (von Neumann) model, threads are immobile and gradually build up their thread state—in the processor’s local store—to maximize their ratio of arithmetic operations to communication operations (the thread’s “arithmetic intensity”).  Obviously, to be properly exploited, a massive live set inherent in a thread’s flow-dependence graph requires an equally massive local store.  Dependence graphs with high internal locality _and_ small live sets do exist, but they tend to be the exception rather than the rule.  An immobile thread that gradually accumulates thread state is the essence of von Neumann computing.  Contrariwise, a thread denied thread-state accumulation by algorithm constraints may seek performance by engaging in non-von Neumann computing.  In class 2, how a thread—which has been marshaled from some flow- dependence graph—is intended to behave depends on whether 1) the application is control and data decomposable, 2) the application’s data is distributable, or 3) neither.  The first attribute is true of _localizable_ applications, i.e., the ones we ceded at the beginning of this piece to clusters and their ilk as undemanding applications not requiring a true supercomputer architect’s full attention.  For a strong proponent of heterogeneity, this may sound a tad sectarian, but clusters remain very cost effective for easily localizable applications.  The problem is that, for demanding applications, most serious people have lost hope that, given a moderate amount of programmer effort, the compiler will automatically do a good job of allocating and distributing data to suit the computation and minimize communication. In a word, you can’t assume that your thread will automatically run in a “data-friendly” environment, where the bulk of the data you need is close by.  Is there a more modest goal that might be achievable with modest programmer effort and a reasonable amount of automatic help from the compiler and the runtime system?  Yes, any time you can _physically_ group together data that _logically_ belong together.  This is not manually fragmenting data structures and designing and coordinating threads to compute individual fragments.  Rather, it is, at least in part, simply the “degree 0” of affinity scheduling.  In non-von Neumann computing, when we encounter (or manufacture) a dependence graph with low internal locality, we should ask: Can I shape and decompose this dependence graph into fine-grained parallel activities, i.e., low-state threads, with the property that each low- state thread would perform well if it was put into a local storage environment containing, say, at least 75% of the data the thread needed?  If so, we should further ask: Can I program this application so that, for each low-state thread, at least 75% of the relevant data becomes concentrated _around_ some location in the computer’s virtual address space, before the thread executes?  (Since we only care about average cost, this “concentration” can take many forms; for example, we might shape the distribution function of thread-relevant data into a crude approximation to the Dirac delta function).  If so, and if thread migration is dirt cheap, we can freely launch countless low- state threads, from many other (possibly high-state) threads, and have them migrate to these small regions of data concentration.  If the bulk of the loads can be satisfied from a small neighborhood— in virtual address space—of the low-state thread’s new location, and if we can get low-state threads to migrate cheaply to essentially arbitrary virtual locations, we have clearly minimized the average cost of that thread’s loads. Obviously, however, the virtual-address metric space cannot be _independent_ of the physical-location metric space, since—in the general case—cost equals wire distance along the global system interconnection network.  Agile, low-state threads 1) are freely spawned with minimal overhead by many types of thread, 2) migrate cheaply to prestaged local concentrations of data, and 3) engage in lightweight synchronization as often as they please. Deployment of these agile threads, and programmed construction of their data environments, is the essence of non-von Neumann computing.  Note: In a homogeneous multithreaded multiprocessor with logically flat shared memory, this problem does not arise.  Take one of the famously unpartitionable graphs in postmodern graph computing.  There is no need to partition or distribute or rebalance this graph because, logically speaking, every portion of the graph is at the same distance from every thread.  Threads don’t need exotic migration machinery because they gain no performance from migrating!  With an unlimited budget, you could build a rather large homogeneous multithreaded multiprocessor.  You could do the same thing if you were willing to cut corners on the system’s global hardware bandwidth.  But the future of computing demands that we back off—at least for a generation—from arbitrarily large flat shared memories.  We must, for a while, become more architecturally modest; we must say, “I can’t guarantee sufficient performance with parallelism alone or locality alone, if I truly want to scale; rather, I need to exploit (almost) every scrap of parallelism, of any kind, and (almost) every scrap of locality, of any kind, that I can dig out of this application”. It is this new-found architectural modesty that is the driving force behind heterogeneous processing.  Note: Programmability is simply the desire not to be forced to dig out all this parallelism, and all this locality, with your teeth!  A decent programming model, grounded in a decent programming language, is simply one that allows the programmer, the compiler, and the runtime system to extract almost all the parallelism, and almost all the locality, without forcing the programmer to do hard labor.  That being said, performance tuning may require a larger contribution from the programmer.  — Heterogeneous Systems vs. Heterogeneous Hardware  If the first hard question was, how well can we load balance in shared-memory and distributed-memory system architectures, respectively?, the second hard question is this: Even if a vendor provides heterogeneous hardware to deal with different forms of parallelism and—far less commonly—different forms of locality, how committed is the vendor to the massive system-software “architecture” research—in new compiler technology, in the design of new runtime systems, and in getting Unix to live on what is essentially a new planet—that is required if heterogeneous processing is to achieve even a small fraction of its true potential?  More bluntly, what must be added to even an impressive collection of heterogeneous system hardware to produce a genuine heterogeneous system architecture?  Is this asking for too much?  Isn’t it at least conceivable that certain vendors—at least those with actual system architectures in their HPCS proposals—are already stretched to their limits (technical, financial, commercial, ideological, or whatever)?  Would additional government pressure from HPCS on vendors at this time— asking for more innovation or more-radical system design—really be productive, or would it be counterproductive?  These are delicate questions.  But just as we have all scrutinized the intentions of a potential wife (or husband), so may we scrutinize the intentions of any vendor who is asking for a large chunk of public money. Notionally, your correspondent can imagine at least two possible vendor attitudes: 1) We fully agree that a thorough exploitation of heterogeneous processing is the nation’s _only_ way out of the high-end crisis; we offer the best design we can produce at this time given current constraints of human and financial resources; we have considered—and postponed—more expensive designs; and 2) While the notion of heterogeneous processing is intriguing, and might lead to some interesting bragging rights down the road, we are not totally sure that the heterogeneous architecture proposed by our academic partners will fit well with our traditional business model, i.e., the traditional way we make our money and the traditional “high- end” brand we currently promote.  Not to put too fine a point on it, sometimes a suitor does not fully understand his own intentions, much less those of his intended match. And, throughout history, some suitors have been cads.  From a crusader’s point of view, it makes sense to articulate the full vision of heterogeneous processing.  After all, no system architecture, and no government program, is cast in concrete.  At present, HPCS is the only high-end computing program our government has.  We can’t reduce our support  for this utterly vital national program just because—as several reputable persons have suggested— it has watered down its original high expectations, indeed, its original mandatory requirements, for an HPCS machine.  Should it be the case that giant steps by U.S. vendors at this time exceed our current capabilities, let us at least make sure that we are marching in the right direction.  — Conclusion  Heterogeneous processing, towards which your correspondent is radically biased, is turning out to be a big tent.  It is not the only future of computing, but it must be a large part of any possible future.  The drivers of heterogeneous processing are inter-application and intra-application diversity; the big tent is just the fact that we are incrementally widening our diversity net.  In fact, some innovative _homogeneous_ machines are targeted at important biomedical and national-security challenges.  Still, choices abound.  What are the natural limits of homogeneous processing? Are homogeneous systems in some sense essentially niche machines?  Do shared memory and distributed memory have profound effects on our ability to load balance?  How is this (alleged) difference related to programmer mediation of data distribution?  Why are designers comfortable with parallelism diversity but willfully obtuse when it comes to locality diversity?  Did the regrettable PIM mania destroy the credibility of non-von Neumann computing?  How much should one pay for a genuine heterogeneous system architecture? Since it is a sensitive subject, your correspondent has not rehashed arguments he made in late September showing that low- state threads deserve a _distinct_ intraprocessor memory hierarchy (essentially to lower the low-state thread-migration overhead).  Is a polymorphic intraprocessor memory hierarchy even conceivable?  Is it implementable?  The mission agencies have computational problems they need solved yesterday. Crusaders—especially multithreaded crusaders—dream of killer apps that will follow postmodern graph processing, and novel styles of computation that will define the future paradigms of high- end computing.  Computational biomedicine and computational national security are converging towards massive data-intensive applications where very limited (conventional) locality trashes all (conventional) locality mechanisms.  This is the proper driver of non-von Neumann computing.  —  The High-End Crusader, a noted expert in high-performance computing and communications, shall remain anonymous.  He alone bears responsibility for these commentaries.  Replies are welcome and may be sent to HPCwire editor Michael Feldman at [email protected]. 

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

MLPerf Inference 4.0 Results Showcase GenAI; Nvidia Still Dominates

March 28, 2024

There were no startling surprises in the latest MLPerf Inference benchmark (4.0) results released yesterday. Two new workloads — Llama 2 and Stable Diffusion XL — were added to the benchmark suite as MLPerf continues Read more…

Q&A with Nvidia’s Chief of DGX Systems on the DGX-GB200 Rack-scale System

March 27, 2024

Pictures of Nvidia's new flagship mega-server, the DGX GB200, on the GTC show floor got favorable reactions on social media for the sheer amount of computing power it brings to artificial intelligence.  Nvidia's DGX Read more…

Call for Participation in Workshop on Potential NSF CISE Quantum Initiative

March 26, 2024

Editor’s Note: Next month there will be a workshop to discuss what a quantum initiative led by NSF’s Computer, Information Science and Engineering (CISE) directorate could entail. The details are posted below in a Ca Read more…

Waseda U. Researchers Reports New Quantum Algorithm for Speeding Optimization

March 25, 2024

Optimization problems cover a wide range of applications and are often cited as good candidates for quantum computing. However, the execution time for constrained combinatorial optimization applications on quantum device Read more…

NVLink: Faster Interconnects and Switches to Help Relieve Data Bottlenecks

March 25, 2024

Nvidia’s new Blackwell architecture may have stolen the show this week at the GPU Technology Conference in San Jose, California. But an emerging bottleneck at the network layer threatens to make bigger and brawnier pro Read more…

Who is David Blackwell?

March 22, 2024

During GTC24, co-founder and president of NVIDIA Jensen Huang unveiled the Blackwell GPU. This GPU itself is heavily optimized for AI work, boasting 192GB of HBM3E memory as well as the the ability to train 1 trillion pa Read more…

MLPerf Inference 4.0 Results Showcase GenAI; Nvidia Still Dominates

March 28, 2024

There were no startling surprises in the latest MLPerf Inference benchmark (4.0) results released yesterday. Two new workloads — Llama 2 and Stable Diffusion Read more…

Q&A with Nvidia’s Chief of DGX Systems on the DGX-GB200 Rack-scale System

March 27, 2024

Pictures of Nvidia's new flagship mega-server, the DGX GB200, on the GTC show floor got favorable reactions on social media for the sheer amount of computing po Read more…

NVLink: Faster Interconnects and Switches to Help Relieve Data Bottlenecks

March 25, 2024

Nvidia’s new Blackwell architecture may have stolen the show this week at the GPU Technology Conference in San Jose, California. But an emerging bottleneck at Read more…

Who is David Blackwell?

March 22, 2024

During GTC24, co-founder and president of NVIDIA Jensen Huang unveiled the Blackwell GPU. This GPU itself is heavily optimized for AI work, boasting 192GB of HB Read more…

Nvidia Looks to Accelerate GenAI Adoption with NIM

March 19, 2024

Today at the GPU Technology Conference, Nvidia launched a new offering aimed at helping customers quickly deploy their generative AI applications in a secure, s Read more…

The Generative AI Future Is Now, Nvidia’s Huang Says

March 19, 2024

We are in the early days of a transformative shift in how business gets done thanks to the advent of generative AI, according to Nvidia CEO and cofounder Jensen Read more…

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

March 18, 2024

Nvidia's latest and fastest GPU, codenamed Blackwell, is here and will underpin the company's AI plans this year. The chip offers performance improvements from Read more…

Nvidia Showcases Quantum Cloud, Expanding Quantum Portfolio at GTC24

March 18, 2024

Nvidia’s barrage of quantum news at GTC24 this week includes new products, signature collaborations, and a new Nvidia Quantum Cloud for quantum developers. Wh Read more…

Alibaba Shuts Down its Quantum Computing Effort

November 30, 2023

In case you missed it, China’s e-commerce giant Alibaba has shut down its quantum computing research effort. It’s not entirely clear what drove the change. Read more…

Nvidia H100: Are 550,000 GPUs Enough for This Year?

August 17, 2023

The GPU Squeeze continues to place a premium on Nvidia H100 GPUs. In a recent Financial Times article, Nvidia reports that it expects to ship 550,000 of its lat Read more…

Shutterstock 1285747942

AMD’s Horsepower-packed MI300X GPU Beats Nvidia’s Upcoming H200

December 7, 2023

AMD and Nvidia are locked in an AI performance battle – much like the gaming GPU performance clash the companies have waged for decades. AMD has claimed it Read more…

DoD Takes a Long View of Quantum Computing

December 19, 2023

Given the large sums tied to expensive weapon systems – think $100-million-plus per F-35 fighter – it’s easy to forget the U.S. Department of Defense is a Read more…

Synopsys Eats Ansys: Does HPC Get Indigestion?

February 8, 2024

Recently, it was announced that Synopsys is buying HPC tool developer Ansys. Started in Pittsburgh, Pa., in 1970 as Swanson Analysis Systems, Inc. (SASI) by John Swanson (and eventually renamed), Ansys serves the CAE (Computer Aided Engineering)/multiphysics engineering simulation market. Read more…

Choosing the Right GPU for LLM Inference and Training

December 11, 2023

Accelerating the training and inference processes of deep learning models is crucial for unleashing their true potential and NVIDIA GPUs have emerged as a game- Read more…

Intel’s Server and PC Chip Development Will Blur After 2025

January 15, 2024

Intel's dealing with much more than chip rivals breathing down its neck; it is simultaneously integrating a bevy of new technologies such as chiplets, artificia Read more…

Baidu Exits Quantum, Closely Following Alibaba’s Earlier Move

January 5, 2024

Reuters reported this week that Baidu, China’s giant e-commerce and services provider, is exiting the quantum computing development arena. Reuters reported � Read more…

Leading Solution Providers

Contributors

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

October 30, 2023

With long lead times for the NVIDIA H100 and A100 GPUs, many organizations are looking at the new NVIDIA L40S GPU, which it’s a new GPU optimized for AI and g Read more…

Shutterstock 1179408610

Google Addresses the Mysteries of Its Hypercomputer 

December 28, 2023

When Google launched its Hypercomputer earlier this month (December 2023), the first reaction was, "Say what?" It turns out that the Hypercomputer is Google's t Read more…

AMD MI3000A

How AMD May Get Across the CUDA Moat

October 5, 2023

When discussing GenAI, the term "GPU" almost always enters the conversation and the topic often moves toward performance and access. Interestingly, the word "GPU" is assumed to mean "Nvidia" products. (As an aside, the popular Nvidia hardware used in GenAI are not technically... Read more…

Shutterstock 1606064203

Meta’s Zuckerberg Puts Its AI Future in the Hands of 600,000 GPUs

January 25, 2024

In under two minutes, Meta's CEO, Mark Zuckerberg, laid out the company's AI plans, which included a plan to build an artificial intelligence system with the eq Read more…

Google Introduces ‘Hypercomputer’ to Its AI Infrastructure

December 11, 2023

Google ran out of monikers to describe its new AI system released on December 7. Supercomputer perhaps wasn't an apt description, so it settled on Hypercomputer Read more…

China Is All In on a RISC-V Future

January 8, 2024

The state of RISC-V in China was discussed in a recent report released by the Jamestown Foundation, a Washington, D.C.-based think tank. The report, entitled "E Read more…

Intel Won’t Have a Xeon Max Chip with New Emerald Rapids CPU

December 14, 2023

As expected, Intel officially announced its 5th generation Xeon server chips codenamed Emerald Rapids at an event in New York City, where the focus was really o Read more…

IBM Quantum Summit: Two New QPUs, Upgraded Qiskit, 10-year Roadmap and More

December 4, 2023

IBM kicks off its annual Quantum Summit today and will announce a broad range of advances including its much-anticipated 1121-qubit Condor QPU, a smaller 133-qu Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire