Compilers and More: Hardware/Software Codesign

By Michael Wolfe

November 2, 2010

Recently, I was invited to participate in a workshop, sponsored by Sandia National Labs, to discuss how codesign (that’s co-design, not code-sign) fits into the landscape of high performance computing. There is a growing feeling that merely taking the latest processor offerings from Intel, AMD or IBM will not get us to exascale in a reasonable time frame, cost budget, and power constraint. One avenue to explore is designing and building more specialized systems, aimed at the types of problems seen in HPC, or at least at the problems seen in some important subset of HPC. Such a strategy loses the advantages we’ve enjoyed over the past two decades of commoditization in HPC: lower cost, latest technology, tools we can also use on our laptops. However, a more special purpose design may be wise, or necessary – HPC is too small a market to really interest the big CPU vendors. Consider that last year somewhere around 170 million laptops were sold, whereas the sum of all processors (chips, not cores) in last June’s TOP500 list is about 1.4 million, less than 1 percent.

Some will surely point out that there’s some customization in most HPC system designs. Recent Cray systems may use commodity AMD or Intel processors, but they have custom, high bandwidth, low latency messaging hardware, and many HPC system designs have special cooling to handle the high heat density.

Yet, many feel that we need more fundamental customization, and specifically, codesign between the software and hardware to reach useful exascale. This last point, useful exascale, is often defined as exascale computing on real applications, specifically not Linpack. (One of my colleagues went so far as to suggest that the way to save HPC is to contractually ban the Linpack benchmark from any government procurement.) My particular interest here is how codesign or customization affects the software tool stack, including the OS, compiler, debugger, and other tools.

What is Codesign?

The buzzword is codesign, but it is only loosely defined. Even at this workshop, one homework question was for each participant to write up a definition, hopefully resulting in less than one definition per attendee. My definition is that codesign occurs when two or more elements of the system are designed together, trading features, costs, advantages and disadvantages of each element against those of each other element. Specifically relevant is codesign of the software with the hardware.

The embedded system design community has a longer history of software/hardware codesign. For example, when designing an audio signal processor, the engineers might add a 16-bit fractional functional unit and appropriate instructions. There’s some thought that the HPC community could learn much about codesign and customization from the experience of the embedded systems industry. But the embedded community has a very different economic model. One embedded design may be replicated millions of times. Think how many copies of a cell phone chip or automotive controller chip get manufactured, relative to the number of supercomputers of any one design. Moreover, each embedded design has some very specific target application space: automobile antilock brake control, television set-top box, smart phone. The design may share some elements (many such designs include an ARM processor), but the customization need only address one of these applications.

Even if we don’t really have codesign (yet), software does affect processor design even in the commodity processor industry. AMD wouldn’t have added the 3DNow! instructions in 1998, and Intel wouldn’t have responded by adding the SSE instruction set to the Pentium III in 1999, had software (and customers using it) not demanded higher floating point compute bandwidth, something that x86 processors were not very good at before then. With the SSE2 extensions to the Pentium 4, x86 processors started making an appearance in the TOP500 list.

Distant and Recent Codesign in HPC

That’s not to say that we’ve never had codesign in high performance computing. We can go (way) back to the 1960s-1970s, to the design of the Illiac IV (I never worked on or even saw the Illiac IV, but all us proud Illini are inclined to bring it up at any opportunity). Illiac (and its contemporaries, the Control Data STAR-100 and the Texas Instruments Advanced Scientific Computer) were designed specifically to solve certain important problems of the day. The choice of memory size, bandwidth, and types of functional units were affected by the applications.

IBM has made several recent forays into codesign for HPC. They designed and delivered the Blue Gene/L, with a specially designed PowerPC processor. Rather than use the highest performance processor chip of the time, IBM started with a lower speed, lower power embedded processor and added a double-pipeline, double precision floating point unit with complex arithmetic instruction set extensions. IBM also designed the Roadrunner system at Los Alamos. It uses AMD node processors and a specially extended Cell processor (another embedded design, originally aimed at the Sony Playstation), the PowerXCell 8i, with high performance double precision. IBM’s design for DARPA’s HPCS program uses a special Interconnect Module. And we all recall IBM’s Deep Blue chess computer, which famously played and beat World Champion Garry Kasparov, with a custom chess move generator chip.

Other more special purpose systems have been designed, several to solve molecular dynamics problems. MDGRAPE-3 (Gravity Pipe) is both a specially designed processor chip to compute interatomic forces, accelerating long-range force computation, and the name of the large system using this chip, developed at RIKEN in Japan. The system has 111 nodes with over 5,000 MDGRAPE-3 chips, where each chip has 20 parallel computation pipelines. The MDGRAPE-3 system operates at petascale performance levels, but since it’s special purpose, it doesn’t work on the Linpack benchmark and so can’t be placed on the TOP500 list.

Anton is another special-purpose machine designed for certain molecular dynamics simulations, specifically for folding proteins and biological macromocules. It does most of the force calculations in one largely fixed-function subsystem, and FFT and bond forces in another more flexible subsystem. The Anton system is designed with 512 nodes, each node being a custom ASIC with a small memory.

While not directed at HPC, but interesting and related indirectly is the development of programmable GPUs. As the standard graphics computing pipeline was developed, it was initially implemented in fixed function blocks. The pipeline has several stages, including vertex transformation, clipping, rasterization, pixel shading, and display. Interactive graphics lives under a very strict real-time requirement; it must be able to generate a new color for each pixel in the whole scene 30 or 60 times a second. As technology got faster, vendors started making parts of the GPU programmable, in particular the vertex and pixel processors. They developed shader programming languages, such as NVIDIA’s Cg, Microsoft’s HLSL and the GLSL shader language in OpenGL. The languages allow graphics programmers to exploit the features of the GPUs, which were designed to solve the problems that graphics programmers want to solve. From this background, we have the GPU programming languages CUDA, OpenCL, and now DirectCompute from Microsoft.

Future Custom and Codesign in HPC

I’ll suggest two general obvious areas for customization, those relating to processors or computing, and those relating to memory.

Memory Extensions: Messages, Communication, Memory Hierarchies

Given the prevalence of message-passing in large scale parallelism, an obvious design opportunity is a network interface designed to optimize common communication patterns. It’s unfortunate that most message-passing application use MPI, which is implemented as a library instead of a language. There’s no way for a compiler to optimize the application to take advantage of such an interface, but an optimized MPI library would serve almost the same purpose.

Other possibilities for scalable parallelism exist. The SGI Altix UV systems allow for over a thousand cores to share a single memory address space, using standard x86 processors. SGI uses an interface at the cache coherence protocol level, and manages message and memory traffic across the system. Numascale has recently announced another product allowing construction of scalable shared address space systems, again interfacing with the cache coherence protocols.

Both these systems attempt to support a strict memory consistency model. One could also explore a system supporting a more relaxed memory coherence, such as release consistency. This would allow an application to manage the memory consistency traffic more explicitly. The advantage is the possibility to reduce memory coherence messaging (and making processes wait for those messages). The disadvantage is the possibility of getting the explicit consistency wrong. Intel is exploring this with its Single-chip Cloud Computer, which has 48 cores without full hardware cache coherence.

We might also explore software-managed cache memories. Hardware caches are great, but highly tuned algorithms often find that the cache gets in the way. A cache will load a whole cache line (and evict some other cache line) when a load or store causes a data cache miss, in the hopes that temporal or spatial locality will benefit from having that whole line closer to the processor. If the program knows that there is no locality, it should be able to tell the hardware not to cache this load or store; in fact, some processors have memory instructions with exactly this behavior. The next step is to have a small local memory with the speed of a level 1 cache, but under program control. The Cray-2 had a 128KB local memory at each processor, and the NVIDIA Tesla shared memory can be thought as a software data cache.

Processor Extensions: Coprocessors and Attached Processors

As we look towards exascale computing, energy becomes a serious limiting factor. Prof. Mark Horowitz and his colleagues at Stanford University has made a convincing argument that the best (and perhaps the only) way that software can reduce energy inside a processor is to execute fewer instructions. The instruction fetch, decode, dispatch, and retire logic takes so much energy that there’s no way to effectively reduce energy except to reduce the instruction count. Reducing the instruction count while doing the same amount of total work means we have to do more work per instruction. One obvious approach, currently in use for other reasons, is vector instructions. The X86 SSE instructions and the PowerPC Altivec instruction work in this way. Consider Intel’s upcoming AVX instructions as another step in this direction. Another step would be to allow the customer to decide how wide the packed or vector instructions should be. There’s no reason that the instruction set should be defined as strictly 128-bits or 256-bits wide.

In the old days of microprocessors (1980s), processors were designed with explicit coprocessor interfaces and had coprocessor instructions. The first floating point functionality was typically added using this interface. Given the limited transistor real estate available on early microprocessors, the coprocessor interface allowed for extensions without having to modify the microprocessor itself. Some designs had the coprocessor monitoring the instruction stream, selecting the coprocessor instructions and executing them directly, while the CPU continued executing its own instruction stream. Other designs had the CPU fetch the instruction and pass appropriate instructions directly to the coprocessor through a dedicated interface. Today, with billions of transistors on each chip, a microprocessor will include not only fully pipelined floating point functional units, but multiple cores, multiple levels of cache, memory controllers, multichip interfaces (Hypertransport or Quickpath), and more. It’s not feasible for an external chip to act in such a coordinated manner with the microprocessor. The interface would have to pass through two or three levels of on-chip cache, or connect through an IO interface.

That doesn’t mean that coprocessors are out of the question. Embedded processors still offer tightly coupled coprocessor interfaces, and there is some evidence (see above) that embedded processors have a role to play in HPC. One possibility is to design with something like the old Xilinx Virtex-4 or -5 FPGAs, which included one or two PowerPC cores on board, each with a coprocessor interface. This might allow you to use different coprocessors for different applications, reprogramming the FPGA fabric as you load the application. The downside is the lower gate density and clock speed of FPGAs relative to microprocessor cores or ASICs.

Another approach is to convince an embedded systems vendor to design, implement and fabricate a specialized coprocessor with an embedded core or multicore chip. ARM processors are designed with an integrated Coprocessor Interface. ARM suppliers, such as PGI’s parent company STMicroelectronics might be willing to help design and fabricate such chips, given enough of a market or other incentive. However, selling 10,000 or even 100,000 chips for each big installation isn’t much of a market for these vendors. I fear the only way to walk this path is to minimize the cost and risk for the chip manufacturer by raising the price (which may be too costly relative to commodity parts) or shifting the design costs and risk to another party, the customer or system integrator (same argument).

In the even older days of minicomputers (1970s), small machines were augmented with attached processors. These were physically connected like an IO device, able to read and write the system memory, but programmed separately. One of the first attached processors was the Floating Point Systems AP-120B, often attached to a Digital PDP-11 or VAX. An attached processor allows a high performance subsystem optimized for the computing, but which doesn’t support all the functionality of a modern operating system to be connected to a more general purpose system which does. The customer gets full functionality and high performance, though at the increased cost of managing the interface between the two subsystems – the trick for the vendor is to minimize that cost. The most recent such device was the Clearspeed accelerator.

Today’s GPU computing falls into the attached processor camp. Programming an NVIDIA or ATI graphics card with CUDA or OpenCL looks similar in many respects to programming array processors of 30 years ago. The host connects to the GPU, allocates and moves data to the GPU memory, launches asynchronous operations on the GPU, and eventually waits until those operations complete to bring the results back. NVIDIA has done a good job minimizing the apparent software interface between the two subsystems with the CUDA language.

We could conceive of application-specific attached processors (ASAP — I like the acronym already). The costs are similar to designing a custom coprocessor. Someone (the customer or system integrator) has to design and arrange for fabrication of the ASAP, and write the software to interface to it. There are certainly specific markets where this makes sense, but it would be better for all if there were some level of standardization to share risks and costs across multiple projects.

The Convey hybrid core system functions much like an attached processor with several interesting twists. It is implemented using FPGAs, so a customer can use the standard floating point vector units, or develop an application-specific personality, essentially a custom functional unit pipeline. The accelerator unit has its own attached high bandwidth memory, but this memory is mapped into the host address space. Thus, the host can access data in the accelerator memory, and vice versa, though with a performance penalty. The system comes with a compiler to make the interface as seamless as possible.

What About the Software?

There are two aspects of software codesign: system software and applications. Let’s start with applications, which are, after all, the reason to go down the codesign or customization path. How much are developers willing to change or customize their applications given new hardware features and, presumably, higher performance? The answers are mixed. Some bleeding edge researchers are willing to do a wholesale rewrite, including developing new algorithms, for a factor of 2 (or less) improvement. Others are unwilling to change their programs much at all. The algorithms are tuned for numerical accuracy and precision, and they don’t want to (or can’t) validate a new method that might be required for a new machine. The former category includes all the CUDA and OpenCL programmers, and the latter category includes many ISV applications.

We’ve been through several generations of high performance machines, and one could make the argument that application developers will follow the path of higher performance, even if that requires program rewrites. The pipelined machines of the 1960s, vector machines in the 1970s and 1980s, multiprocessors in the 1980s and beyond, and massively parallel clusters from the 1990s to today all required rewrites to utilize the parallelism. However, the programming models used in any generation were largely portable across different machines of the same generation. Programs that vectorized for a Cray-1 would vectorize for a NEC SX or Convex C-1 or other contemporary machines. Programs using MPI for parallelism today port across a wide range of cluster designs. If we start with more customized machines, programs tuned for those custom features naturally become less portable, or at least less performance portable.

We can alleviate that pain by using a standard set of library routines, where the library is optimized for each target. This is the approach behind LAPACK and other libraries. Alternatively, if we can get the compilers to generate the right code for each target, perhaps the programs can remain truly portable. This brings us to system software. Mostly I’m interested in how all this affects the compiler. There is other important system software, debugger and operating system in particular, but other than supporting the new features in appropriate ways, there’s usually less technological difficulty.

Getting a compiler to use some new feature can be challenging. There have been some notable successes. When SSE instructions were introduced in 1999, Intel required programmers to use assembly code, or to add SSE intrinsic functions to their code. Their compilers recognized the intrinsics and turned them into the appropriate instructions, but the code was limited to those machines with those instructions (and compilers). The Portland Group was the first to use classical vectorization technology to generate SSE instructions directly from loops in the program. The same technology will allow programs to use AVX instructions without changing the source. If we’d stuck with those SSE intrinsics, using AVX instructions would require a significant rewrite.

Now imagine adding a functional unit to compute a weighted average of four neighboring array elements, which is essentially what the hardware texture units in GPUs do. Would a compiler require that programs express this using an intrinsic function (which is how it’s expressed in Cg and other GPU languages)? Can a compiler recognize this pattern without the intrinsic? If it could, could it also use that hardware for other operations that are similar in some respects? This would be the key to portability of the program, and generality and usefulness of the functional unit. We could end up with some number of pattern-recognizers in our compilers, with different patterns enabled for each target machine. Perhaps we could even create a compiler where a vendor or user could add patterns and replacement rules without modifying the compiler itself.

Software is one of the key differences between the embedded and HPC worlds. In the embedded market, it can be worthwhile to make a specific hardware addition that might only solve one problem for one application. If the market is large enough, the vendor will recoup the development cost in very high volumes, and that part will only run that one application, anyway. The cost of the software customization isn’t significant. In the HPC market, it’s rare to have a system dedicated to a single application (Anton and Deep Blue notwithstanding). Any customized addition must be useful to a wide range of applications in order to make it worthwhile for the vendor to develop and support it.

The Path to Successful Codesign

There has been a lot of exploration and some good experiences. However, while a custom system like Anton could be considered a great success for its application, it won’t affect system design in any fundamental way. It’s a single success point, not a path to success.

Success depends on providing an ecosystem that allows applications to live beyond the lifetime of any single system or even vendor. Today’s HPC systems are largely clusters of commodity microprocessor and memory parts with some customization in the network fabrics for some vendors. High level languages and MPI libraries provide the necessary ecosystem, allowing applications to move across systems with not much more than a recompile.

For exascale, we’re clearly moving in a direction where commodity microprocessors alone will not provide a solution within acceptable cost and energy limits, hence we’re going to be using coprocessors or accelerators of some sort. We’re going to want an ecosystem that provides some level of software standard interface to these accelerators. The accelerators of the day are GPUs, which are themselves commodity parts designed for another purpose.

There are many obstacles challenging successful codesign at the processor level.

  • Definition of success. A one-off machine (like Anton) only has to satisfy a single customer, and can be completely customized for the one application. This level of customization would not be profitable for a vendor at any reasonable price. Either the design has to have many customers, or the cost of the customization has to be low enough to allow single-use.
  • Application-level customization What characteristics of an application make it amenible for hardware implementation? Clearly vector operations can be effectively implemented and used, but what other application-level features would find use in more than that one application? How to identify these? Research is lacking in this area.
  • Skill set. What skills are needed to do the custom design? Today, you’d need skills beyond what application writers know, or probably want to know. On the other hand, you want some application knowledge in order to determine what tradeoffs to make.
  • Software ecosystem. Do you want compilers to determine when to use the new feature by recognizing it in your source programs, like vectorizing compilers do today? Or do you want to use instruction-level intrinsics and assembly code, like the ETSI intrinsics used in embedded low-precision signal processing applications? Do you want your debugger to be able to read, display, and change state in your coprocessor? Does your operating system need to save and restore state between context switches? This is one area where the state of GPU computing today is lacking. The operating system does not manage the GPU, the user does.
  • Application maintenance. How much does the application need to change to use the new hardware features? This goes beyond just the expression of the feature, whether a vectorizing compiler will work or whether you have to use intrinsics. Will you be willing to recast your algorithm to take advantage of new features, like the way we optimize for locality to take advantage of cache memories today?
  • Delivery time. How much does this level of customization add to the manufacturing and delivery time of a new system? This affects what level of technology will be available for the system. Typically, custom features are one or two generations behind the fastest, densest hardware, so they have to make up that difference in architecture.
  • Knowledge reuse. Once we’ve gone through this path once, will we be able to reuse the knowledge and skills we’ve acquired in the next generation? Will the technology progression require a whole new set of skills for hardware design?

As mentioned in the workshop, embedded system designers have to address essentially the same issues regularly, though with different economic and technological constraints. It’s quite possible that those vendors could learn about HPC more quickly than the HPC vendors can learn about codesign.

About the Author

Michael Wolfe has developed compilers for over 30 years in both academia and industry, and is now a senior compiler engineer at The Portland Group, Inc. (www.pgroup.com), a wholly-owned subsidiary of STMicroelectronics, Inc. The opinions stated here are those of the author, and do not represent opinions of The Portland Group, Inc. or STMicroelectronics, Inc.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

New Exascale System for Earth Simulation Introduced

April 23, 2018

After four years of development, the Energy Exascale Earth System Model (E3SM) will be unveiled today and released to the broader scientific community this month. The E3SM project is supported by the Department of Energy Read more…

By Staff

RSC Reports 500Tflops, Hot Water Cooled System Deployed at JINR

April 18, 2018

RSC, developer of supercomputers and advanced HPC systems based in Russia, today reported deployment of “the world's first 100% ‘hot water’ liquid cooled supercomputer” at Joint Institute for Nuclear Research (JI Read more…

By Staff

New Device Spots Quantum Particle ‘Fingerprint’

April 18, 2018

Majorana particles have been observed by university researchers employing a device consisting of layers of magnetic insulators on a superconducting material. The advance opens the door to controlling the elusive particle Read more…

By George Leopold

HPE Extreme Performance Solutions

Hybrid HPC is Speeding Time to Insight and Revolutionizing Medicine

High performance computing (HPC) is a key driver of success in many verticals today, and health and life science industries are extensively leveraging these capabilities. Read more…

Cray Rolls Out AMD-Based CS500; More to Follow?

April 18, 2018

Cray was the latest OEM to bring AMD back into the fold with introduction today of a CS500 option based on AMD’s Epyc processor line. The move follows Cray’s introduction of an ARM-based system (XC-50) last November. Read more…

By John Russell

Cray Rolls Out AMD-Based CS500; More to Follow?

April 18, 2018

Cray was the latest OEM to bring AMD back into the fold with introduction today of a CS500 option based on AMD’s Epyc processor line. The move follows Cray’ Read more…

By John Russell

IBM: Software Ecosystem for OpenPOWER is Ready for Prime Time

April 16, 2018

With key pieces of the IBM/OpenPOWER versus Intel/x86 gambit settling into place – e.g., the arrival of Power9 chips and Power9-based systems, hyperscaler sup Read more…

By John Russell

US Plans $1.8 Billion Spend on DOE Exascale Supercomputing

April 11, 2018

On Monday, the United States Department of Energy announced its intention to procure up to three exascale supercomputers at a cost of up to $1.8 billion with th Read more…

By Tiffany Trader

Cloud-Readiness and Looking Beyond Application Scaling

April 11, 2018

There are two aspects to consider when determining if an application is suitable for running in the cloud. The first, which we will discuss here under the title Read more…

By Chris Downing

Transitioning from Big Data to Discovery: Data Management as a Keystone Analytics Strategy

April 9, 2018

The past 10-15 years has seen a stark rise in the density, size, and diversity of scientific data being generated in every scientific discipline in the world. Key among the sciences has been the explosion of laboratory technologies that generate large amounts of data in life-sciences and healthcare research. Large amounts of data are now being stored in very large storage name spaces, with little to no organization and a general unease about how to approach analyzing it. Read more…

By Ari Berman, BioTeam, Inc.

IBM Expands Quantum Computing Network

April 5, 2018

IBM is positioning itself as a first mover in establishing the era of commercial quantum computing. The company believes in order for quantum to work, taming qu Read more…

By Tiffany Trader

FY18 Budget & CORAL-2 – Exascale USA Continues to Move Ahead

April 2, 2018

It was not pretty. However, despite some twists and turns, the federal government’s Fiscal Year 2018 (FY18) budget is complete and ended with some very positi Read more…

By Alex R. Larzelere

Nvidia Ups Hardware Game with 16-GPU DGX-2 Server and 18-Port NVSwitch

March 27, 2018

Nvidia unveiled a raft of new products from its annual technology conference in San Jose today, and despite not offering up a new chip architecture, there were still a few surprises in store for HPC hardware aficionados. Read more…

By Tiffany Trader

Inventor Claims to Have Solved Floating Point Error Problem

January 17, 2018

"The decades-old floating point error problem has been solved," proclaims a press release from inventor Alan Jorgensen. The computer scientist has filed for and Read more…

By Tiffany Trader

Researchers Measure Impact of ‘Meltdown’ and ‘Spectre’ Patches on HPC Workloads

January 17, 2018

Computer scientists from the Center for Computational Research, State University of New York (SUNY), University at Buffalo have examined the effect of Meltdown Read more…

By Tiffany Trader

Russian Nuclear Engineers Caught Cryptomining on Lab Supercomputer

February 12, 2018

Nuclear scientists working at the All-Russian Research Institute of Experimental Physics (RFNC-VNIIEF) have been arrested for using lab supercomputing resources to mine crypto-currency, according to a report in Russia’s Interfax News Agency. Read more…

By Tiffany Trader

How the Cloud Is Falling Short for HPC

March 15, 2018

The last couple of years have seen cloud computing gradually build some legitimacy within the HPC world, but still the HPC industry lies far behind enterprise I Read more…

By Chris Downing

Chip Flaws ‘Meltdown’ and ‘Spectre’ Loom Large

January 4, 2018

The HPC and wider tech community have been abuzz this week over the discovery of critical design flaws that impact virtually all contemporary microprocessors. T Read more…

By Tiffany Trader

How Meltdown and Spectre Patches Will Affect HPC Workloads

January 10, 2018

There have been claims that the fixes for the Meltdown and Spectre security vulnerabilities, named the KPTI (aka KAISER) patches, are going to affect applicatio Read more…

By Rosemary Francis

Nvidia Responds to Google TPU Benchmarking

April 10, 2017

Nvidia highlights strengths of its newest GPU silicon in response to Google's report on the performance and energy advantages of its custom tensor processor. Read more…

By Tiffany Trader

Fast Forward: Five HPC Predictions for 2018

December 21, 2017

What’s on your list of high (and low) lights for 2017? Volta 100’s arrival on the heels of the P100? Appearance, albeit late in the year, of IBM’s Power9? Read more…

By John Russell

Leading Solution Providers

Deep Learning at 15 PFlops Enables Training for Extreme Weather Identification at Scale

March 19, 2018

Petaflop per second deep learning training performance on the NERSC (National Energy Research Scientific Computing Center) Cori supercomputer has given climate Read more…

By Rob Farber

Lenovo Unveils Warm Water Cooled ThinkSystem SD650 in Rampup to LRZ Install

February 22, 2018

This week Lenovo took the wraps off the ThinkSystem SD650 high-density server with third-generation direct water cooling technology developed in tandem with par Read more…

By Tiffany Trader

AI Cloud Competition Heats Up: Google’s TPUs, Amazon Building AI Chip

February 12, 2018

Competition in the white hot AI (and public cloud) market pits Google against Amazon this week, with Google offering AI hardware on its cloud platform intended Read more…

By Doug Black

HPC and AI – Two Communities Same Future

January 25, 2018

According to Al Gara (Intel Fellow, Data Center Group), high performance computing and artificial intelligence will increasingly intertwine as we transition to Read more…

By Rob Farber

New Blueprint for Converging HPC, Big Data

January 18, 2018

After five annual workshops on Big Data and Extreme-Scale Computing (BDEC), a group of international HPC heavyweights including Jack Dongarra (University of Te Read more…

By John Russell

US Plans $1.8 Billion Spend on DOE Exascale Supercomputing

April 11, 2018

On Monday, the United States Department of Energy announced its intention to procure up to three exascale supercomputers at a cost of up to $1.8 billion with th Read more…

By Tiffany Trader

Momentum Builds for US Exascale

January 9, 2018

2018 looks to be a great year for the U.S. exascale program. The last several months of 2017 revealed a number of important developments that help put the U.S. Read more…

By Alex R. Larzelere

Google Chases Quantum Supremacy with 72-Qubit Processor

March 7, 2018

Google pulled ahead of the pack this week in the race toward "quantum supremacy," with the introduction of a new 72-qubit quantum processor called Bristlecone. Read more…

By Tiffany Trader

  • arrow
  • Click Here for More Headlines
  • arrow
Share This