Compilers and More: MPI+X

By Michael Wolfe

July 16, 2014

At ISC’14, there was intense and continuing interest in the choice of a standard approach for programming the next generation HPC systems. While not guaranteed, many of these systems are likely to be large clusters of nodes with multicore CPUs and some sort of attached accelerators. A standard programming approach is necessary to convince developers, and particularly ISVs, to start adoption now in preparation for this coming generation of systems. John Barr raised the same question in an article at Scientific Computing World from a more philosophical point of view. Here I address this question from a deeper technical perspective.

HPC programming is currently dominated by either a flat model with MPI across nodes as well as cores within a node, or a hybrid model with MPI across the nodes and OpenMP shared memory parallelism across the cores in a node. The advantage of flat MPI is a simpler programming model, only one level of parallelism and only one API. The disadvantage is it doesn’t take advantage of the shared data across the ranks on the same node, requiring message and buffer management across all ranks. MPI+OpenMP roughly inverts those advantages and disadvantages.

The reason MPI and MPI+OpenMP have worked so well over the past 20 years now is that most HPC systems are roughly isomorphic, with some differences in instruction set, node topology, and performance profiles. The system is a network of nodes, the nodes have one or more processors, the processors have one or more cores. The cores on a node share virtual and physical memory, with hardware cache coherence to make shared memory programming relatively safe. There are some outliers, like the big SGI shared memory systems, which have some programming model and performance advantages for certain applications.

John Barr suggests that the options for programming next-generation HPC systems will be MPI+X, where X is one of: OpenMP (if accelerators don’t persist), OpenACC, OpenMP 4+ with device constructs, OpenCL, or CUDA. John is probably right, but I want to add a caveat. We don’t know what HPC systems will look like in a decade. We don’t even know what the HPC systems of the near future will look like. The familiar (I refrain from “simple”) MPI+OpenMP programming model that has served us so well is unlikely to apply to the newest designs five or ten years from now, because the network of homogeneous nodes of homogeneous cores with uniform shared memory is going to change. HPC is at another inflection point, where the drivers are both economic and technological.

Technology is driving designs towards slower cores with more parallelism to make up the performance slack. Parallelism will come in three forms: classical multicore or multiprocessor parallel processes, threads, or tasks; classical vector operations, sometimes recast as SIMD operations; and multithreading used to overlap long latency operations, mostly to tolerate long memory latencies. If your system has enough native bandwidth, multithreading can allow an appropriately organized application to achieve close to peak performance as long as there is another thread ready to execute when one thread stalls on a cache miss. Bandwidth increase is easier to achieve than latency reduction, though it’s still not free.

The economics of HPC leads us to use the best, lowest-cost commodity parts whenever possible. The economic drivers in computing will be mobile and embedded, such as smart phones and tablets. High end workstations will still be important, but vendors are learning that growth is in mobile, and we can expect the biggest investments for new computing parts there. Luckily for us, those parts need increasing performance and continue to require lower power and low cost, so even if painful, a shift from workstation parts to mobile parts may be beneficial.

But the important point is that we don’t know exactly what these systems will look like, and there’s likely to be quite a bit of variability. Some will look like today’s systems, though with higher core counts on each processor chip. Some will have low power chips designed for or inspired by the embedded space; think Blue Gene L/P/Q, Intel Atom/Quark, or ARM. The ratios of parallelism on a node vs. across nodes will differ.

Many will have accelerators of some sort: GPUs (NVIDIA, AMD), manycore (Intel Xeon Phi), or other (TI DSP). The amount and style of parallelism for each kind of accelerator differs.

Memory on today’s accelerators is physically and logically separate from the host memory; that will not always remain so. There is movement to share the same address space between accelerator and host, which allows a kind of shared memory. However, there will still be a performance penalty for accessing the more distant memory, either using remote access or software managed distributed shared memory, like the classical Treadmarks system. In particular, today’s accelerators are designed with a high bandwidth memory; if the host memory doesn’t provide the same bandwidth, the performance will suffer. The same can be true if the host and accelerator share physical memory. Today’s laptops have an integrated graphics unit, often on the processor chip itself; the IGU uses the system memory for the graphics buffer. This is sufficient for most laptop uses (email, presentations), but for interactive games and other challenging graphical applications, you want a discrete GPU with its own memory. Upcoming memory subsystems, such as the Hybrid Memory Cube, may change the balance allowing high bandwidth shared memory, if we’re willing to pay the cost.

So the challenge proposed by John Barr isn’t just what language we should use to write programs, but how will we write programs that will give us satisfactory performance across the variety of systems available today, tomorrow, and over the next decade. Will any of the options he presented suffice? Let’s discuss each of them.

MPI + OpenMP 3.1

OpenMP 3.1 was a minor update to OpenMP 3.0, released in 2011 after three years of development. For nodes with small numbers of homogeneous cores and uniform memory access, OpenMP serves quite well. It’s hard to argue that it needs to be replaced, unless you are using a higher level language than C, C++ or Fortran. However, even those languages are starting to adopt native parallel constructs. Fortran has long had array assignments and a forall construct, expressing vector-style or SIMD parallelism, and now has a do concurrent construct, subsuming much of the behavior of the OpenMP parallel do There is ongoing discussion in the C and C++ language committees about adding OpenMP-style threading or Cilk-style tasking to those languages. Language standards committees move slowly, thankfully, but depending on how these discussions play out, the need for OpenMP directives may diminish in the future.

Advantages: OpenMP 3.1 has the advantage of being relatively high level, and is quite mature. It is supported across essentially all multicore and multiprocessor systems, giving good functional and performance portability, though not necessarily scalability.

Disadvantages: Anyone just starting to port an existing program to a multicore will either really like the simplicity of OpenMP parallel loops, or will really dislike the necessity of annotating each and every parallel loop in the program. The OpenMP 3.1 model of a system is somewhat dated; systems being built now and in the future are no longer so homogeneous. OpenMP is really dependent on long-lived parallel threads, which pervade the execution model and are exposed in the programming model. While OpenMP now has tasks, the tasking model was grafted (quite elegantly, in my opinion) onto the existing OpenMP threading model.

Portability: As mentioned, OpenMP is supported on essentially all multiprocessor and multicore systems used in HPC today, and provides pretty good performance portability across the homogeneous shared memory (low core-count) multicore environment for which it was designed. However, OpenMP 3.1 has no support for heterogeneous or accelerated systems. There is an impressive research project at Purdue University to implement OpenMP on NVIDIA GPUs with some significant compiler and runtime effort. They have produced a prototype compiler and have some impressive results on benchmark programs. However, it’s still in the research stage, and it suffers from trying to hide or virtualize some of the most performance sensitive issues of accelerator programming, such as data movement between host and accelerator memories and the performance of the synchronizations required by OpenMP.

Future: OpenMP 4 adds many new features, including some support for heterogeneous systems and more control to take advantage of data and compute locality. The language is becoming more complex, to adapt to and allow control over a more complex world of parallel systems. Eventually, meaning a decade or more from now, the base languages may support enough native parallelism to obviate the need for OpenMP directives.

MPI + CUDA

CUDA is an explicit programming model for GPU-style accelerators. It exposes many of the architectural features of a GPU, such as the 3D grid and 3D thread block, the software-managed cache (CUDA shared memory), the SIMD-style execution of adjacent threads in the same thread block, the separate data memory on the GPU, and so on. Used the right way, this is a strength of the model and language, since it gives a dedicated developer access to all the power of the GPU. However, it does require significant investment when porting to the GPU. In addition, when moving from one generation of GPU to the next, highly tuned CUDA kernels will often have to be retuned for the newer hardware, increasing code maintenance cost.

Moreover, CUDA today is a single-vendor (NVIDIA) solution. PGI has developed a CUDA-X86 solution, but its performance is not always as good as the corresponding native code. One can envision a CUDA implementation for other GPUs or other devices, but there are no hints of such a development.

Advantages: The CUDA model is relatively mature, over five years old now. It gives control to the programmer over all the relevant features of the GPU. It comes with debugging and performance analysis tools, and is well-supported by NVIDIA. The language and programming model evolve as the capabilities of the GPU evolve. Recently, CUDA added Unified Memory, which makes porting easier by simplifying the memory management and allowing automatic data movement between the host and device memories. CUDA Fortran provides the same programming model for Fortran programmers, with the advantages of using the higher level language features available in Fortran.

Disadvantages: It’s an explicit programming model. Converting an existing application in C, C++ or Fortran to CUDA is a significant rewrite, although such rewrites allow and encourage one to replace inappropriate algorithms and data structures with others more suited to the parallel world.

Portability: CUDA is currently limited to NVIDIA GPUs. There have been efforts to port CUDA programs to parallel execution on CPUs; the CUDA-X86 product uses some of the same techniques. These efforts have some successes, but highly parallel CUDA programs don’t always map all that well to the limited multicore and SIMD parallelism available on CPUs. Moreover, a kernel tuned for one generation of NVIDIA GPU (experimenting with thread block size and shape, unrolling loops) will often have to be retuned for the next generation.

Future: NVIDIA will continue to support and develop CUDA. CUDA will continue to play an important role in GPU programming, especially where the investment pays off handsomely, such as highly tuned libraries or very performance sensitive application kernels.

MPI + OpenCL

OpenCL is essentially the same programming model for GPU-style accelerators as CUDA, but with a lower level platform layer. It exposes the same features of the GPU, but is not NVIDIA-specific. It is also supported on many targets, including DSPs and even x86 and ARM processors. Like CUDA, it gives complete control to the programmer. Unlike CUDA, there is no OpenCL compiler for the host code; the platform API is completely library-based, making it look very verbose, especially for small programs. The difference becomes clear when looking at how a kernel is launched in CUDA and OpenCL. In CUDA C, a kernel launch looks like a procedure call to the kernel, with additional syntax to specify the launch configuration (grid and block sizes):

      kernel<<< gridsize, blocksize >>>( arg0, arg1, ... );

The CUDA compiler translates this into the series of runtime calls to manage the arguments and the actual launch. Since OpenCL doesn’t have such a compiler, a kernel launch becomes a series of explicit runtime calls: one for each argument, and one for the launch:

      clSetKernelArg( kernel_handle, 0, sizeof(arg0), &arg1 );
      clSetKernelArg( kernel_handle, 1, sizeof(arg1), &arg1 );
      ...
      clEnqueueNDRangeKernel( queue_handle, kernel_handle, 1, NULL,
                              &globalsize, &localsize, 0, NULL, NULL );

Other than that, OpenCL programming is potentially just as powerful as CUDA programming.

Given the wider range of target devices implementing OpenCL, there has been quite a discussion about OpenCL and performance portability for the past several years. Use your favorite search engine for the phrase OpenCL performance portability for a sample. A recent article on HPCwire advocates quite well in favor of OpenCL for heterogeneous computing, in part because it accommodates all the available architectures, separately or together. I have to agree with almost everything the article says. As for performance portability, this article states: True, we may need to write a new version of our kernel to get the best performance on Architecture A, but isn’t this what we actually want? The most telling word in that sentence is kernel, singular. In the embedded world, one developer is quite likely to be focused on a single kernel, such as some video processing, or audio encoding, or the like. In that scenario, with only one kernel, then you can afford to tune and retune for each architecture. However, if you have tens of thousands or millions of lines of code to support over the multi-decade lifetime of your application, this is not a productive path.

Advantages: The kernels are written in the same explicit programming model as CUDA, and OpenCL is supported across a wide range of devices. As with CUDA, the OpenCL API continues to evolve, adding features aimed at simplifying programming and supporting larger applications.

Disadvantages: It doesn’t have a standard set of tools across all devices, so debugging and profiling support is sometimes spotty. The language is evolving, as mentioned, but evolves slowly, as any committee-designed language will.

Portability: Since the programming model is lower-level, your kernels have to explicitly take advantage of the features of the target architecture. That means your kernels have to be retuned or rewritten for different targets.

Future: The OpenCL model is more aimed at the embedded or mobile market, where the application writer doesn’t necessarily have details about the target device. Features like dynamic, just-in-time compiling are important in that arena. In HPC, having to recompile every kernel on every node of your cluster every time you run your program quickly sounds like overhead you’d rather avoid. However, I predict that like CUDA, OpenCL can play an important role in programming for GPUs and other accelerators, especially where the investment will pay off handsomely.

MPI + OpenACC

OpenACC is a directive-based programming model targeting a CPU+accelerator system, intentionally similar in many ways to OpenMP. It is designed to do for accelerator programming systems what OpenMP does for multicore systems. It hides or virtualizes those features of the system that can be managed automatically by the system without performance penalty, and exposes those features that must be managed by the programmer. For instance, the programming and execution models expose the presence of separate memories on a host+GPU system, and requires the user to manage data movement between the host and device memories for device-resident data. However, OpenACC virtualizes the variable names, using the same name for the CPU and device copies of the data, resolving them depending on where the name is used, and will even work with no overhead when the device shares memory with the host. The original intent was to eventually merge OpenACC into OpenMP, but the two groups decided on different approaches. In full disclosure, this author is a key member of the OpenACC language committee, and formerly participated in the OpenMP language committee.

OpenACC was designed to focus on parallel algorithms that perform well on the accelerators of interest. Accelerators today need a high degree of multicore parallelism as well as vector or SIMD parallelism. They don’t work very efficiently on scalar code, or on scalar tasks; scalar code will always be more efficient on a high-speed host CPU. This is the primary motivation for a programming model that supports closely coupled CPU+Accelerator systems, which will likely be most efficient on maximizing performance across complete applications.

Advantages: OpenACC has demonstrated support for multiple devices, and there is some initial evidence for performance portability across device types.

Disadvantages: OpenACC is still relatively young, and there are as yet no available open-source implementations of the full language. As yet, there are no implementations that target the host multicore as a device, so it currently must be combined with OpenMP. Some critical features are still under development, and there are some differences in how features are implemented by different compilers, making portability across vendors an issue.

Portability: OpenACC can virtualize many aspects of the architecture, so programs should work well across divergent targets. At the NVIDIA GPU Technology Conference in March, I presented results showing the performance of the SPEC ACCEL OpenACC benchmarks running on NVIDIA Kepler and AMD Radeon GPUs, providing initial evidence of performance portability across devices.

Future: OpenACC version 2.0 includes some important new features, such as defined support for separate compilation, unstructured data lifetimes, and more asynchronous operations; work proceeds on additional critical and important features. There are quite a few large applications being ported to GPUs with OpenACC, and this experience feeds back to the language design and implementations. Expect a great deal of exciting work on OpenACC in the next few years, including commercial and open-source implementations across all HPC target accelerators, as well as host CPUs.

MPI + OpenMP 4.0 Target Directives

One may ask why OpenACC looks so different from OpenMP? Why not just implement OpenMP on GPUs? This is a good question and an important one. I tried to answer this with an articleseveral years ago. Current GPUs, in particular, have some limitations that we should avoid and some capabilities that we should exploit when programming for the best parallel performance.

The OpenMP ARB released version 4.0 of the specification last July, a major revision just two years after releasing 3.1. Among many new features are the device constructs. Some of these constructs mirror those in OpenACC, such as data management with the OpenMP target data and target update constructs. The OpenMP committee chose to add a new level of parallelism to the classical OpenMP threads: teams. In OpenMP 3.1, a parallel region was executed by a team of threads, and parallel loop iterations or tasks were work-shared across the threads in a team. With the OpenMP 4.0 device constructs, the user can now create a league of teams of threads. The addition of teams was to accommodate devices like GPUs which do not support an efficient global barrier synchronization across all the threads in the system. This addition has two unfortunate side effects. First, porting an existing OpenMP program to a device like a GPU is not as easy as simply adding a target directive around the parallel loops. As with OpenACC, the programmer will have to manage the data traffic to the device, but now the programmer has to add new types of parallelism as well. Second, some devices, such as the manycore Intel Xeon Phi Coprocessor, don’t need and probably don’t want this extra level of parallelism; OpenMP can be implemented natively on that target. This means that tuning for one target may differ significantly from tuning for another target, making performance portability a real challenge. I fear that the OpenMP committee has made some unfortunate decisions that will be hard to fix.

Intel supports many of the OpenMP 4 device constructs, and presumably will have a conformant implementation for the Xeon Phi Coprocessor. Texas Instruments has also announced plans to support OpenMP 4 on some ARM+DSP heterogeneous SoCs. The new Convey MX computer has support for hybrid OpenMP that matches OpenMP 4 features. Cray has announced plans to deliver OpenMP 4 for their systems as well, including Xeon Phi and Kepler accelerators, though they don’t promise performance portability. The other OpenMP vendor members have yet to announce support for the device constructs.

Advantages: For those targets that can implement full OpenMP, the OpenMP 4 device constructs look like a simpler method to port an existing OpenMP program. There is still the requirement to manage data transfers, but that’s the case on any of these programming models.

Disadvantages: For targets that aren’t designed to implement OpenMP, it’s not clear whether good performance can be delivered. No prototype implementations of the teams construct were available. Performance portability is going to be a challenge, though, as with OpenCL, language portability is still a big step in the right direction. Parts of the current OpenMP 4 device constructs are not well specified and need more careful definition.

Portability: It’s too early to say for certain, but I fear the OpenMP device constructs are too prescriptive to support performance portability across the wide variety of accelerator architectures being used now and designed for the future. If a programmer has to write programs differently for each target, then the language loses many of its advantages.

Future: OpenMP 4.0 is still relatively new and has many new features beyond the device constructs. It will be some time before we have several mature complete implementations. The OpenMP ARB seems committed to more rapid exploration and adoption of new features across the spectrum of parallel programming. The device constructs will be expanded or contracted as necessary to address the targets of interest. Only if the OpenMP device constructs result in good performance on the host as well as DSPs, manycores, and GPUs, will it become the language of choice.

Any Others?

John Barr left off two other heterogeneous programming languages that have been bandied about, important if only because of the large company promoting each. Google Renderscript uses a kernels computation model and managed memory allocations (like OpenCL kernels and buffers). Unlike OpenCL, Renderscript is optimized for kernels that access graphics-like data structures: rectangular arrays of structs. It’s pretty hard to get the kind of random indexed addressing that is available in OpenCL or CUDA. The runtime can launch kernels on any appropriate available device, such as a CPU or GPU, moving data as required to make it accessible on that device.

Microsoft C++AMP (Accelerated Massive Parallelism) is a pretty sophisticated C++-specific solution, aiming at the sophisticated C++ programmer (or perhaps all C++ programmers are sophisticated), using C++ templates and lambda expressions for data and compute. It introduces a very rich parallel_for_each function that takes a computation domain and kernel function as arguments, and runs that function across the parallel domain. Like Renderscript and OpenCL, the runtime manages the memory movement; it can also decide whether to run a kernel on the host or the GPU. Currently it is available only from Microsoft, though there are one or two experimental implementations from outside Redmond.

Conclusions

As John Barr argues, the tools and programming methodologies for upcoming HPC systems will change as the architectures evolve. Resolving to a high level portable heterogeneous programming strategy will attract more users early in this evolution. Before MPI, there were several message passing libraries, some of which were vendor-specific. Settling on MPI allowed applications to port across cluster architectures without rewrites, and allowed vendors to focus on innovation at the architecture and implementation level. It also limited innovation on the message passing libraries. Before OpenMP, there were several sets of high level directives and low-level threading libraries for programming multiprocessor workstations and cluster nodes. Settling on OpenMP similarly allowed applications to port across multiprocessor and eventually multicore architectures without rewrites, allowing vendors to focus on innovation at the architecture and implementation level. It also limited innovation at the programming language level.

To program heterogeneous systems, we want to settle on a high level programming strategy. It should be as target agnostic as possible. In particular, it must not require a user to write a different program for different targets, and is must not require a user to write a different program for heterogeneous systems from homogeneous systems. We’re going to always require low-level or explicit programming as well, and CUDA and OpenCL will fit this bill nicely for appropriate devices. At the high level, the obvious possibilities for X are OpenACC or OpenMP; the eventual choice will be made by HPC users. If the OpenMP device constructs mature to better support performance portable programming across the wide variety of devices, it may become the best option. If OpenACC demonstrates portability and more generality, it should be the language of choice. There’s still hope that the two specifications will converge. Users will eventually decide based on how well each language supports the variety of targets in terms of ease of use and features, but most importantly, performance and performance portability.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

March 18, 2024

Nvidia's latest and fastest GPU, code-named Blackwell, is here and will underpin the company's AI plans this year. The chip offers performance improvements from its predecessors, including the red-hot H100 and A100 GPUs. Read more…

Nvidia Showcases Quantum Cloud, Expanding Quantum Portfolio at GTC24

March 18, 2024

Nvidia’s barrage of quantum news at GTC24 this week includes new products, signature collaborations, and a new Nvidia Quantum Cloud for quantum developers. While Nvidia may not spring to mind when thinking of the quant Read more…

2024 Winter Classic: Meet the HPE Mentors

March 18, 2024

The latest installment of the 2024 Winter Classic Studio Update Show features our interview with the HPE mentor team who introduced our student teams to the joys (and potential sorrows) of the HPL (LINPACK) and accompany Read more…

Houston We Have a Solution: Addressing the HPC and Tech Talent Gap

March 15, 2024

Generations of Houstonian teachers, counselors, and parents have either worked in the aerospace industry or know people who do - the prospect of entering the field was normalized for boys in 1969 when the Apollo 11 missi Read more…

Apple Buys DarwinAI Deepening its AI Push According to Report

March 14, 2024

Apple has purchased Canadian AI startup DarwinAI according to a Bloomberg report today. Apparently the deal was done early this year but still hasn’t been publicly announced according to the report. Apple is preparing Read more…

Survey of Rapid Training Methods for Neural Networks

March 14, 2024

Artificial neural networks are computing systems with interconnected layers that process and learn from data. During training, neural networks utilize optimization algorithms to iteratively refine their parameters until Read more…

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

March 18, 2024

Nvidia's latest and fastest GPU, code-named Blackwell, is here and will underpin the company's AI plans this year. The chip offers performance improvements from Read more…

Nvidia Showcases Quantum Cloud, Expanding Quantum Portfolio at GTC24

March 18, 2024

Nvidia’s barrage of quantum news at GTC24 this week includes new products, signature collaborations, and a new Nvidia Quantum Cloud for quantum developers. Wh Read more…

Houston We Have a Solution: Addressing the HPC and Tech Talent Gap

March 15, 2024

Generations of Houstonian teachers, counselors, and parents have either worked in the aerospace industry or know people who do - the prospect of entering the fi Read more…

Survey of Rapid Training Methods for Neural Networks

March 14, 2024

Artificial neural networks are computing systems with interconnected layers that process and learn from data. During training, neural networks utilize optimizat Read more…

PASQAL Issues Roadmap to 10,000 Qubits in 2026 and Fault Tolerance in 2028

March 13, 2024

Paris-based PASQAL, a developer of neutral atom-based quantum computers, yesterday issued a roadmap for delivering systems with 10,000 physical qubits in 2026 a Read more…

India Is an AI Powerhouse Waiting to Happen, but Challenges Await

March 12, 2024

The Indian government is pushing full speed ahead to make the country an attractive technology base, especially in the hot fields of AI and semiconductors, but Read more…

Charles Tahan Exits National Quantum Coordination Office

March 12, 2024

(March 1, 2024) My first official day at the White House Office of Science and Technology Policy (OSTP) was June 15, 2020, during the depths of the COVID-19 loc Read more…

AI Bias In the Spotlight On International Women’s Day

March 11, 2024

What impact does AI bias have on women and girls? What can people do to increase female participation in the AI field? These are some of the questions the tech Read more…

Alibaba Shuts Down its Quantum Computing Effort

November 30, 2023

In case you missed it, China’s e-commerce giant Alibaba has shut down its quantum computing research effort. It’s not entirely clear what drove the change. Read more…

Nvidia H100: Are 550,000 GPUs Enough for This Year?

August 17, 2023

The GPU Squeeze continues to place a premium on Nvidia H100 GPUs. In a recent Financial Times article, Nvidia reports that it expects to ship 550,000 of its lat Read more…

Analyst Panel Says Take the Quantum Computing Plunge Now…

November 27, 2023

Should you start exploring quantum computing? Yes, said a panel of analysts convened at Tabor Communications HPC and AI on Wall Street conference earlier this y Read more…

DoD Takes a Long View of Quantum Computing

December 19, 2023

Given the large sums tied to expensive weapon systems – think $100-million-plus per F-35 fighter – it’s easy to forget the U.S. Department of Defense is a Read more…

Shutterstock 1285747942

AMD’s Horsepower-packed MI300X GPU Beats Nvidia’s Upcoming H200

December 7, 2023

AMD and Nvidia are locked in an AI performance battle – much like the gaming GPU performance clash the companies have waged for decades. AMD has claimed it Read more…

Synopsys Eats Ansys: Does HPC Get Indigestion?

February 8, 2024

Recently, it was announced that Synopsys is buying HPC tool developer Ansys. Started in Pittsburgh, Pa., in 1970 as Swanson Analysis Systems, Inc. (SASI) by John Swanson (and eventually renamed), Ansys serves the CAE (Computer Aided Engineering)/multiphysics engineering simulation market. Read more…

Intel’s Server and PC Chip Development Will Blur After 2025

January 15, 2024

Intel's dealing with much more than chip rivals breathing down its neck; it is simultaneously integrating a bevy of new technologies such as chiplets, artificia Read more…

Baidu Exits Quantum, Closely Following Alibaba’s Earlier Move

January 5, 2024

Reuters reported this week that Baidu, China’s giant e-commerce and services provider, is exiting the quantum computing development arena. Reuters reported � Read more…

Leading Solution Providers

Contributors

Choosing the Right GPU for LLM Inference and Training

December 11, 2023

Accelerating the training and inference processes of deep learning models is crucial for unleashing their true potential and NVIDIA GPUs have emerged as a game- Read more…

Training of 1-Trillion Parameter Scientific AI Begins

November 13, 2023

A US national lab has started training a massive AI brain that could ultimately become the must-have computing resource for scientific researchers. Argonne N Read more…

Shutterstock 1179408610

Google Addresses the Mysteries of Its Hypercomputer 

December 28, 2023

When Google launched its Hypercomputer earlier this month (December 2023), the first reaction was, "Say what?" It turns out that the Hypercomputer is Google's t Read more…

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

October 30, 2023

With long lead times for the NVIDIA H100 and A100 GPUs, many organizations are looking at the new NVIDIA L40S GPU, which it’s a new GPU optimized for AI and g Read more…

AMD MI3000A

How AMD May Get Across the CUDA Moat

October 5, 2023

When discussing GenAI, the term "GPU" almost always enters the conversation and the topic often moves toward performance and access. Interestingly, the word "GPU" is assumed to mean "Nvidia" products. (As an aside, the popular Nvidia hardware used in GenAI are not technically... Read more…

Shutterstock 1606064203

Meta’s Zuckerberg Puts Its AI Future in the Hands of 600,000 GPUs

January 25, 2024

In under two minutes, Meta's CEO, Mark Zuckerberg, laid out the company's AI plans, which included a plan to build an artificial intelligence system with the eq Read more…

Google Introduces ‘Hypercomputer’ to Its AI Infrastructure

December 11, 2023

Google ran out of monikers to describe its new AI system released on December 7. Supercomputer perhaps wasn't an apt description, so it settled on Hypercomputer Read more…

China Is All In on a RISC-V Future

January 8, 2024

The state of RISC-V in China was discussed in a recent report released by the Jamestown Foundation, a Washington, D.C.-based think tank. The report, entitled "E Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire