Compilers and More: Hardware/Software Codesign

By Michael Wolfe

November 2, 2010

Recently, I was invited to participate in a workshop, sponsored by Sandia National Labs, to discuss how codesign (that’s co-design, not code-sign) fits into the landscape of high performance computing. There is a growing feeling that merely taking the latest processor offerings from Intel, AMD or IBM will not get us to exascale in a reasonable time frame, cost budget, and power constraint. One avenue to explore is designing and building more specialized systems, aimed at the types of problems seen in HPC, or at least at the problems seen in some important subset of HPC. Such a strategy loses the advantages we’ve enjoyed over the past two decades of commoditization in HPC: lower cost, latest technology, tools we can also use on our laptops. However, a more special purpose design may be wise, or necessary – HPC is too small a market to really interest the big CPU vendors. Consider that last year somewhere around 170 million laptops were sold, whereas the sum of all processors (chips, not cores) in last June’s TOP500 list is about 1.4 million, less than 1 percent.

Some will surely point out that there’s some customization in most HPC system designs. Recent Cray systems may use commodity AMD or Intel processors, but they have custom, high bandwidth, low latency messaging hardware, and many HPC system designs have special cooling to handle the high heat density.

Yet, many feel that we need more fundamental customization, and specifically, codesign between the software and hardware to reach useful exascale. This last point, useful exascale, is often defined as exascale computing on real applications, specifically not Linpack. (One of my colleagues went so far as to suggest that the way to save HPC is to contractually ban the Linpack benchmark from any government procurement.) My particular interest here is how codesign or customization affects the software tool stack, including the OS, compiler, debugger, and other tools.

What is Codesign?

The buzzword is codesign, but it is only loosely defined. Even at this workshop, one homework question was for each participant to write up a definition, hopefully resulting in less than one definition per attendee. My definition is that codesign occurs when two or more elements of the system are designed together, trading features, costs, advantages and disadvantages of each element against those of each other element. Specifically relevant is codesign of the software with the hardware.

The embedded system design community has a longer history of software/hardware codesign. For example, when designing an audio signal processor, the engineers might add a 16-bit fractional functional unit and appropriate instructions. There’s some thought that the HPC community could learn much about codesign and customization from the experience of the embedded systems industry. But the embedded community has a very different economic model. One embedded design may be replicated millions of times. Think how many copies of a cell phone chip or automotive controller chip get manufactured, relative to the number of supercomputers of any one design. Moreover, each embedded design has some very specific target application space: automobile antilock brake control, television set-top box, smart phone. The design may share some elements (many such designs include an ARM processor), but the customization need only address one of these applications.

Even if we don’t really have codesign (yet), software does affect processor design even in the commodity processor industry. AMD wouldn’t have added the 3DNow! instructions in 1998, and Intel wouldn’t have responded by adding the SSE instruction set to the Pentium III in 1999, had software (and customers using it) not demanded higher floating point compute bandwidth, something that x86 processors were not very good at before then. With the SSE2 extensions to the Pentium 4, x86 processors started making an appearance in the TOP500 list.

Distant and Recent Codesign in HPC

That’s not to say that we’ve never had codesign in high performance computing. We can go (way) back to the 1960s-1970s, to the design of the Illiac IV (I never worked on or even saw the Illiac IV, but all us proud Illini are inclined to bring it up at any opportunity). Illiac (and its contemporaries, the Control Data STAR-100 and the Texas Instruments Advanced Scientific Computer) were designed specifically to solve certain important problems of the day. The choice of memory size, bandwidth, and types of functional units were affected by the applications.

IBM has made several recent forays into codesign for HPC. They designed and delivered the Blue Gene/L, with a specially designed PowerPC processor. Rather than use the highest performance processor chip of the time, IBM started with a lower speed, lower power embedded processor and added a double-pipeline, double precision floating point unit with complex arithmetic instruction set extensions. IBM also designed the Roadrunner system at Los Alamos. It uses AMD node processors and a specially extended Cell processor (another embedded design, originally aimed at the Sony Playstation), the PowerXCell 8i, with high performance double precision. IBM’s design for DARPA’s HPCS program uses a special Interconnect Module. And we all recall IBM’s Deep Blue chess computer, which famously played and beat World Champion Garry Kasparov, with a custom chess move generator chip.

Other more special purpose systems have been designed, several to solve molecular dynamics problems. MDGRAPE-3 (Gravity Pipe) is both a specially designed processor chip to compute interatomic forces, accelerating long-range force computation, and the name of the large system using this chip, developed at RIKEN in Japan. The system has 111 nodes with over 5,000 MDGRAPE-3 chips, where each chip has 20 parallel computation pipelines. The MDGRAPE-3 system operates at petascale performance levels, but since it’s special purpose, it doesn’t work on the Linpack benchmark and so can’t be placed on the TOP500 list.

Anton is another special-purpose machine designed for certain molecular dynamics simulations, specifically for folding proteins and biological macromocules. It does most of the force calculations in one largely fixed-function subsystem, and FFT and bond forces in another more flexible subsystem. The Anton system is designed with 512 nodes, each node being a custom ASIC with a small memory.

While not directed at HPC, but interesting and related indirectly is the development of programmable GPUs. As the standard graphics computing pipeline was developed, it was initially implemented in fixed function blocks. The pipeline has several stages, including vertex transformation, clipping, rasterization, pixel shading, and display. Interactive graphics lives under a very strict real-time requirement; it must be able to generate a new color for each pixel in the whole scene 30 or 60 times a second. As technology got faster, vendors started making parts of the GPU programmable, in particular the vertex and pixel processors. They developed shader programming languages, such as NVIDIA’s Cg, Microsoft’s HLSL and the GLSL shader language in OpenGL. The languages allow graphics programmers to exploit the features of the GPUs, which were designed to solve the problems that graphics programmers want to solve. From this background, we have the GPU programming languages CUDA, OpenCL, and now DirectCompute from Microsoft.

Future Custom and Codesign in HPC

I’ll suggest two general obvious areas for customization, those relating to processors or computing, and those relating to memory.

Memory Extensions: Messages, Communication, Memory Hierarchies

Given the prevalence of message-passing in large scale parallelism, an obvious design opportunity is a network interface designed to optimize common communication patterns. It’s unfortunate that most message-passing application use MPI, which is implemented as a library instead of a language. There’s no way for a compiler to optimize the application to take advantage of such an interface, but an optimized MPI library would serve almost the same purpose.

Other possibilities for scalable parallelism exist. The SGI Altix UV systems allow for over a thousand cores to share a single memory address space, using standard x86 processors. SGI uses an interface at the cache coherence protocol level, and manages message and memory traffic across the system. Numascale has recently announced another product allowing construction of scalable shared address space systems, again interfacing with the cache coherence protocols.

Both these systems attempt to support a strict memory consistency model. One could also explore a system supporting a more relaxed memory coherence, such as release consistency. This would allow an application to manage the memory consistency traffic more explicitly. The advantage is the possibility to reduce memory coherence messaging (and making processes wait for those messages). The disadvantage is the possibility of getting the explicit consistency wrong. Intel is exploring this with its Single-chip Cloud Computer, which has 48 cores without full hardware cache coherence.

We might also explore software-managed cache memories. Hardware caches are great, but highly tuned algorithms often find that the cache gets in the way. A cache will load a whole cache line (and evict some other cache line) when a load or store causes a data cache miss, in the hopes that temporal or spatial locality will benefit from having that whole line closer to the processor. If the program knows that there is no locality, it should be able to tell the hardware not to cache this load or store; in fact, some processors have memory instructions with exactly this behavior. The next step is to have a small local memory with the speed of a level 1 cache, but under program control. The Cray-2 had a 128KB local memory at each processor, and the NVIDIA Tesla shared memory can be thought as a software data cache.

Processor Extensions: Coprocessors and Attached Processors

As we look towards exascale computing, energy becomes a serious limiting factor. Prof. Mark Horowitz and his colleagues at Stanford University has made a convincing argument that the best (and perhaps the only) way that software can reduce energy inside a processor is to execute fewer instructions. The instruction fetch, decode, dispatch, and retire logic takes so much energy that there’s no way to effectively reduce energy except to reduce the instruction count. Reducing the instruction count while doing the same amount of total work means we have to do more work per instruction. One obvious approach, currently in use for other reasons, is vector instructions. The X86 SSE instructions and the PowerPC Altivec instruction work in this way. Consider Intel’s upcoming AVX instructions as another step in this direction. Another step would be to allow the customer to decide how wide the packed or vector instructions should be. There’s no reason that the instruction set should be defined as strictly 128-bits or 256-bits wide.

In the old days of microprocessors (1980s), processors were designed with explicit coprocessor interfaces and had coprocessor instructions. The first floating point functionality was typically added using this interface. Given the limited transistor real estate available on early microprocessors, the coprocessor interface allowed for extensions without having to modify the microprocessor itself. Some designs had the coprocessor monitoring the instruction stream, selecting the coprocessor instructions and executing them directly, while the CPU continued executing its own instruction stream. Other designs had the CPU fetch the instruction and pass appropriate instructions directly to the coprocessor through a dedicated interface. Today, with billions of transistors on each chip, a microprocessor will include not only fully pipelined floating point functional units, but multiple cores, multiple levels of cache, memory controllers, multichip interfaces (Hypertransport or Quickpath), and more. It’s not feasible for an external chip to act in such a coordinated manner with the microprocessor. The interface would have to pass through two or three levels of on-chip cache, or connect through an IO interface.

That doesn’t mean that coprocessors are out of the question. Embedded processors still offer tightly coupled coprocessor interfaces, and there is some evidence (see above) that embedded processors have a role to play in HPC. One possibility is to design with something like the old Xilinx Virtex-4 or -5 FPGAs, which included one or two PowerPC cores on board, each with a coprocessor interface. This might allow you to use different coprocessors for different applications, reprogramming the FPGA fabric as you load the application. The downside is the lower gate density and clock speed of FPGAs relative to microprocessor cores or ASICs.

Another approach is to convince an embedded systems vendor to design, implement and fabricate a specialized coprocessor with an embedded core or multicore chip. ARM processors are designed with an integrated Coprocessor Interface. ARM suppliers, such as PGI’s parent company STMicroelectronics might be willing to help design and fabricate such chips, given enough of a market or other incentive. However, selling 10,000 or even 100,000 chips for each big installation isn’t much of a market for these vendors. I fear the only way to walk this path is to minimize the cost and risk for the chip manufacturer by raising the price (which may be too costly relative to commodity parts) or shifting the design costs and risk to another party, the customer or system integrator (same argument).

In the even older days of minicomputers (1970s), small machines were augmented with attached processors. These were physically connected like an IO device, able to read and write the system memory, but programmed separately. One of the first attached processors was the Floating Point Systems AP-120B, often attached to a Digital PDP-11 or VAX. An attached processor allows a high performance subsystem optimized for the computing, but which doesn’t support all the functionality of a modern operating system to be connected to a more general purpose system which does. The customer gets full functionality and high performance, though at the increased cost of managing the interface between the two subsystems – the trick for the vendor is to minimize that cost. The most recent such device was the Clearspeed accelerator.

Today’s GPU computing falls into the attached processor camp. Programming an NVIDIA or ATI graphics card with CUDA or OpenCL looks similar in many respects to programming array processors of 30 years ago. The host connects to the GPU, allocates and moves data to the GPU memory, launches asynchronous operations on the GPU, and eventually waits until those operations complete to bring the results back. NVIDIA has done a good job minimizing the apparent software interface between the two subsystems with the CUDA language.

We could conceive of application-specific attached processors (ASAP — I like the acronym already). The costs are similar to designing a custom coprocessor. Someone (the customer or system integrator) has to design and arrange for fabrication of the ASAP, and write the software to interface to it. There are certainly specific markets where this makes sense, but it would be better for all if there were some level of standardization to share risks and costs across multiple projects.

The Convey hybrid core system functions much like an attached processor with several interesting twists. It is implemented using FPGAs, so a customer can use the standard floating point vector units, or develop an application-specific personality, essentially a custom functional unit pipeline. The accelerator unit has its own attached high bandwidth memory, but this memory is mapped into the host address space. Thus, the host can access data in the accelerator memory, and vice versa, though with a performance penalty. The system comes with a compiler to make the interface as seamless as possible.

What About the Software?

There are two aspects of software codesign: system software and applications. Let’s start with applications, which are, after all, the reason to go down the codesign or customization path. How much are developers willing to change or customize their applications given new hardware features and, presumably, higher performance? The answers are mixed. Some bleeding edge researchers are willing to do a wholesale rewrite, including developing new algorithms, for a factor of 2 (or less) improvement. Others are unwilling to change their programs much at all. The algorithms are tuned for numerical accuracy and precision, and they don’t want to (or can’t) validate a new method that might be required for a new machine. The former category includes all the CUDA and OpenCL programmers, and the latter category includes many ISV applications.

We’ve been through several generations of high performance machines, and one could make the argument that application developers will follow the path of higher performance, even if that requires program rewrites. The pipelined machines of the 1960s, vector machines in the 1970s and 1980s, multiprocessors in the 1980s and beyond, and massively parallel clusters from the 1990s to today all required rewrites to utilize the parallelism. However, the programming models used in any generation were largely portable across different machines of the same generation. Programs that vectorized for a Cray-1 would vectorize for a NEC SX or Convex C-1 or other contemporary machines. Programs using MPI for parallelism today port across a wide range of cluster designs. If we start with more customized machines, programs tuned for those custom features naturally become less portable, or at least less performance portable.

We can alleviate that pain by using a standard set of library routines, where the library is optimized for each target. This is the approach behind LAPACK and other libraries. Alternatively, if we can get the compilers to generate the right code for each target, perhaps the programs can remain truly portable. This brings us to system software. Mostly I’m interested in how all this affects the compiler. There is other important system software, debugger and operating system in particular, but other than supporting the new features in appropriate ways, there’s usually less technological difficulty.

Getting a compiler to use some new feature can be challenging. There have been some notable successes. When SSE instructions were introduced in 1999, Intel required programmers to use assembly code, or to add SSE intrinsic functions to their code. Their compilers recognized the intrinsics and turned them into the appropriate instructions, but the code was limited to those machines with those instructions (and compilers). The Portland Group was the first to use classical vectorization technology to generate SSE instructions directly from loops in the program. The same technology will allow programs to use AVX instructions without changing the source. If we’d stuck with those SSE intrinsics, using AVX instructions would require a significant rewrite.

Now imagine adding a functional unit to compute a weighted average of four neighboring array elements, which is essentially what the hardware texture units in GPUs do. Would a compiler require that programs express this using an intrinsic function (which is how it’s expressed in Cg and other GPU languages)? Can a compiler recognize this pattern without the intrinsic? If it could, could it also use that hardware for other operations that are similar in some respects? This would be the key to portability of the program, and generality and usefulness of the functional unit. We could end up with some number of pattern-recognizers in our compilers, with different patterns enabled for each target machine. Perhaps we could even create a compiler where a vendor or user could add patterns and replacement rules without modifying the compiler itself.

Software is one of the key differences between the embedded and HPC worlds. In the embedded market, it can be worthwhile to make a specific hardware addition that might only solve one problem for one application. If the market is large enough, the vendor will recoup the development cost in very high volumes, and that part will only run that one application, anyway. The cost of the software customization isn’t significant. In the HPC market, it’s rare to have a system dedicated to a single application (Anton and Deep Blue notwithstanding). Any customized addition must be useful to a wide range of applications in order to make it worthwhile for the vendor to develop and support it.

The Path to Successful Codesign

There has been a lot of exploration and some good experiences. However, while a custom system like Anton could be considered a great success for its application, it won’t affect system design in any fundamental way. It’s a single success point, not a path to success.

Success depends on providing an ecosystem that allows applications to live beyond the lifetime of any single system or even vendor. Today’s HPC systems are largely clusters of commodity microprocessor and memory parts with some customization in the network fabrics for some vendors. High level languages and MPI libraries provide the necessary ecosystem, allowing applications to move across systems with not much more than a recompile.

For exascale, we’re clearly moving in a direction where commodity microprocessors alone will not provide a solution within acceptable cost and energy limits, hence we’re going to be using coprocessors or accelerators of some sort. We’re going to want an ecosystem that provides some level of software standard interface to these accelerators. The accelerators of the day are GPUs, which are themselves commodity parts designed for another purpose.

There are many obstacles challenging successful codesign at the processor level.

  • Definition of success. A one-off machine (like Anton) only has to satisfy a single customer, and can be completely customized for the one application. This level of customization would not be profitable for a vendor at any reasonable price. Either the design has to have many customers, or the cost of the customization has to be low enough to allow single-use.
  • Application-level customization What characteristics of an application make it amenible for hardware implementation? Clearly vector operations can be effectively implemented and used, but what other application-level features would find use in more than that one application? How to identify these? Research is lacking in this area.
  • Skill set. What skills are needed to do the custom design? Today, you’d need skills beyond what application writers know, or probably want to know. On the other hand, you want some application knowledge in order to determine what tradeoffs to make.
  • Software ecosystem. Do you want compilers to determine when to use the new feature by recognizing it in your source programs, like vectorizing compilers do today? Or do you want to use instruction-level intrinsics and assembly code, like the ETSI intrinsics used in embedded low-precision signal processing applications? Do you want your debugger to be able to read, display, and change state in your coprocessor? Does your operating system need to save and restore state between context switches? This is one area where the state of GPU computing today is lacking. The operating system does not manage the GPU, the user does.
  • Application maintenance. How much does the application need to change to use the new hardware features? This goes beyond just the expression of the feature, whether a vectorizing compiler will work or whether you have to use intrinsics. Will you be willing to recast your algorithm to take advantage of new features, like the way we optimize for locality to take advantage of cache memories today?
  • Delivery time. How much does this level of customization add to the manufacturing and delivery time of a new system? This affects what level of technology will be available for the system. Typically, custom features are one or two generations behind the fastest, densest hardware, so they have to make up that difference in architecture.
  • Knowledge reuse. Once we’ve gone through this path once, will we be able to reuse the knowledge and skills we’ve acquired in the next generation? Will the technology progression require a whole new set of skills for hardware design?

As mentioned in the workshop, embedded system designers have to address essentially the same issues regularly, though with different economic and technological constraints. It’s quite possible that those vendors could learn about HPC more quickly than the HPC vendors can learn about codesign.

About the Author

Michael Wolfe has developed compilers for over 30 years in both academia and industry, and is now a senior compiler engineer at The Portland Group, Inc. (www.pgroup.com), a wholly-owned subsidiary of STMicroelectronics, Inc. The opinions stated here are those of the author, and do not represent opinions of The Portland Group, Inc. or STMicroelectronics, Inc.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

Trinity Supercomputer’s Haswell and KNL Partitions Are Merged

July 19, 2017

Trinity supercomputer’s two partitions – one based on Intel Xeon Haswell processors and the other on Xeon Phi Knights Landing – have been fully integrated are now available for use on classified work in the Nationa Read more…

By HPCwire Staff

Fujitsu Continues HPC, AI Push

July 19, 2017

Summer is well under way, but the so-called summertime slowdown, linked with hot temperatures and longer vacations, does not seem to have impacted Fujitsu's output. The Japanese multinational has made a raft of HPC and A Read more…

By Tiffany Trader

Researchers Use DNA to Store and Retrieve Digital Movie

July 18, 2017

From abacus to pencil and paper to semiconductor chips, the technology of computing has always been an ever-changing target. The human brain is probably the computer we use most (hopefully) and understand least. This mon Read more…

By John Russell

The Exascale FY18 Budget – The Next Step

July 17, 2017

On July 12, 2017, the U.S. federal budget for its Exascale Computing Initiative (ECI) took its next step forward. On that day, the full Appropriations Committee of the House of Representatives voted to accept the recomme Read more…

By Alex R. Larzelere

HPE Extreme Performance Solutions

HPE Servers Deliver High Performance Remote Visualization

Whether generating seismic simulations, locating new productive oil reservoirs, or constructing complex models of the earth’s subsurface, energy, oil, and gas (EO&G) is a highly data-driven industry. Read more…

Summer Reading: IEEE Spectrum’s Chip Hall of Fame

July 17, 2017

Take a trip down memory lane – the Mostek MK4096 4-kilobit DRAM, for instance. Perhaps processors are more to your liking. Remember the Sh-Boom processor (1988), created by Russell Fish and Chuck Moore, and named after Read more…

By John Russell

Women in HPC Luncheon Shines Light on Female-Friendly Hiring Practices

July 13, 2017

The second annual Women in HPC luncheon was held on June 20, 2017, during the International Supercomputing Conference in Frankfurt, Germany. The luncheon provides participants the opportunity to network with industry lea Read more…

By Tiffany Trader

Satellite Advances, NSF Computation Power Rapid Mapping of Earth’s Surface

July 13, 2017

New satellite technologies have completely changed the game in mapping and geographical data gathering, reducing costs and placing a new emphasis on time series and timeliness in general, according to Paul Morin, directo Read more…

By Ken Chiacchia and Tiffany Jolley

Intel Skylake: Xeon Goes from Chip to Platform

July 13, 2017

With yesterday’s New York unveiling of the new “Skylake” Xeon Scalable processors, Intel made multiple runs at multiple competitive threats and strategic markets. Skylake will carry Intel's flag in the fight for le Read more…

By Doug Black

Fujitsu Continues HPC, AI Push

July 19, 2017

Summer is well under way, but the so-called summertime slowdown, linked with hot temperatures and longer vacations, does not seem to have impacted Fujitsu's out Read more…

By Tiffany Trader

Researchers Use DNA to Store and Retrieve Digital Movie

July 18, 2017

From abacus to pencil and paper to semiconductor chips, the technology of computing has always been an ever-changing target. The human brain is probably the com Read more…

By John Russell

The Exascale FY18 Budget – The Next Step

July 17, 2017

On July 12, 2017, the U.S. federal budget for its Exascale Computing Initiative (ECI) took its next step forward. On that day, the full Appropriations Committee Read more…

By Alex R. Larzelere

Women in HPC Luncheon Shines Light on Female-Friendly Hiring Practices

July 13, 2017

The second annual Women in HPC luncheon was held on June 20, 2017, during the International Supercomputing Conference in Frankfurt, Germany. The luncheon provid Read more…

By Tiffany Trader

Satellite Advances, NSF Computation Power Rapid Mapping of Earth’s Surface

July 13, 2017

New satellite technologies have completely changed the game in mapping and geographical data gathering, reducing costs and placing a new emphasis on time series Read more…

By Ken Chiacchia and Tiffany Jolley

Intel Skylake: Xeon Goes from Chip to Platform

July 13, 2017

With yesterday’s New York unveiling of the new “Skylake” Xeon Scalable processors, Intel made multiple runs at multiple competitive threats and strategic Read more…

By Doug Black

Perverse Incentives? How Economics (Mis-)shaped Academic Science

July 12, 2017

The unintended consequences of how we fund academic research—in the U.S. and elsewhere—are strangling innovation, putting universities into debt and creatin Read more…

By Ken Chiacchia, Senior Science Writer, Pittsburgh Supercomputing Center

Why Tech is Failing at Diversity and How It Can Succeed

July 11, 2017

The sectors that are supposed to be all about innovation and the future continue to fail spectacularly at gender equity and diversity. UK, US and Canada still haven’t managed to break the average 20 percent threshold for gender equity across STEM academic disciplines. Read more…

By Kelly Nolan

HPC Compiler Company PathScale Seeks Life Raft

March 23, 2017

HPCwire has learned that HPC compiler company PathScale has fallen on difficult times and is asking the community for help or actively seeking a buyer for its a Read more…

By Tiffany Trader

Quantum Bits: D-Wave and VW; Google Quantum Lab; IBM Expands Access

March 21, 2017

For a technology that’s usually characterized as far off and in a distant galaxy, quantum computing has been steadily picking up steam. Just how close real-wo Read more…

By John Russell

Google Pulls Back the Covers on Its First Machine Learning Chip

April 6, 2017

This week Google released a report detailing the design and performance characteristics of the Tensor Processing Unit (TPU), its custom ASIC for the inference Read more…

By Tiffany Trader

Nvidia Responds to Google TPU Benchmarking

April 10, 2017

Nvidia highlights strengths of its newest GPU silicon in response to Google's report on the performance and energy advantages of its custom tensor processor. Read more…

By Tiffany Trader

Trump Budget Targets NIH, DOE, and EPA; No Mention of NSF

March 16, 2017

President Trump’s proposed U.S. fiscal 2018 budget issued today sharply cuts science spending while bolstering military spending as he promised during the cam Read more…

By John Russell

CPU-based Visualization Positions for Exascale Supercomputing

March 16, 2017

In this contributed perspective piece, Intel’s Jim Jeffers makes the case that CPU-based visualization is now widely adopted and as such is no longer a contrarian view, but is rather an exascale requirement. Read more…

By Jim Jeffers, Principal Engineer and Engineering Leader, Intel

Nvidia’s Mammoth Volta GPU Aims High for AI, HPC

May 10, 2017

At Nvidia's GPU Technology Conference (GTC17) in San Jose, Calif., this morning, CEO Jensen Huang announced the company's much-anticipated Volta architecture a Read more…

By Tiffany Trader

Facebook Open Sources Caffe2; Nvidia, Intel Rush to Optimize

April 18, 2017

From its F8 developer conference in San Jose, Calif., today, Facebook announced Caffe2, a new open-source, cross-platform framework for deep learning. Caffe2 is the successor to Caffe, the deep learning framework developed by Berkeley AI Research and community contributors. Read more…

By Tiffany Trader

Leading Solution Providers

How ‘Knights Mill’ Gets Its Deep Learning Flops

June 22, 2017

Intel, the subject of much speculation regarding the delayed, rewritten or potentially canceled “Aurora” contract (the Argonne Lab part of the CORAL “ Read more…

By Tiffany Trader

MIT Mathematician Spins Up 220,000-Core Google Compute Cluster

April 21, 2017

On Thursday, Google announced that MIT math professor and computational number theorist Andrew V. Sutherland had set a record for the largest Google Compute Engine (GCE) job. Sutherland ran the massive mathematics workload on 220,000 GCE cores using preemptible virtual machine instances. Read more…

By Tiffany Trader

Reinders: “AVX-512 May Be a Hidden Gem” in Intel Xeon Scalable Processors

June 29, 2017

Imagine if we could use vector processing on something other than just floating point problems.  Today, GPUs and CPUs work tirelessly to accelerate algorithms Read more…

By James Reinders

Google Debuts TPU v2 and will Add to Google Cloud

May 25, 2017

Not long after stirring attention in the deep learning/AI community by revealing the details of its Tensor Processing Unit (TPU), Google last week announced the Read more…

By John Russell

Russian Researchers Claim First Quantum-Safe Blockchain

May 25, 2017

The Russian Quantum Center today announced it has overcome the threat of quantum cryptography by creating the first quantum-safe blockchain, securing cryptocurrencies like Bitcoin, along with classified government communications and other sensitive digital transfers. Read more…

By Doug Black

Groq This: New AI Chips to Give GPUs a Run for Deep Learning Money

April 24, 2017

CPUs and GPUs, move over. Thanks to recent revelations surrounding Google’s new Tensor Processing Unit (TPU), the computing world appears to be on the cusp of Read more…

By Alex Woodie

Top500 Results: Latest List Trends and What’s in Store

June 19, 2017

Greetings from Frankfurt and the 2017 International Supercomputing Conference where the latest Top500 list has just been revealed. Although there were no major Read more…

By Tiffany Trader

Six Exascale PathForward Vendors Selected; DoE Providing $258M

June 15, 2017

The much-anticipated PathForward awards for hardware R&D in support of the Exascale Computing Project were announced today with six vendors selected – AMD Read more…

By John Russell

  • arrow
  • Click Here for More Headlines
  • arrow
Share This