Compilers and More: Hardware/Software Codesign

By Michael Wolfe

November 2, 2010

Recently, I was invited to participate in a workshop, sponsored by Sandia National Labs, to discuss how codesign (that’s co-design, not code-sign) fits into the landscape of high performance computing. There is a growing feeling that merely taking the latest processor offerings from Intel, AMD or IBM will not get us to exascale in a reasonable time frame, cost budget, and power constraint. One avenue to explore is designing and building more specialized systems, aimed at the types of problems seen in HPC, or at least at the problems seen in some important subset of HPC. Such a strategy loses the advantages we’ve enjoyed over the past two decades of commoditization in HPC: lower cost, latest technology, tools we can also use on our laptops. However, a more special purpose design may be wise, or necessary – HPC is too small a market to really interest the big CPU vendors. Consider that last year somewhere around 170 million laptops were sold, whereas the sum of all processors (chips, not cores) in last June’s TOP500 list is about 1.4 million, less than 1 percent.

Some will surely point out that there’s some customization in most HPC system designs. Recent Cray systems may use commodity AMD or Intel processors, but they have custom, high bandwidth, low latency messaging hardware, and many HPC system designs have special cooling to handle the high heat density.

Yet, many feel that we need more fundamental customization, and specifically, codesign between the software and hardware to reach useful exascale. This last point, useful exascale, is often defined as exascale computing on real applications, specifically not Linpack. (One of my colleagues went so far as to suggest that the way to save HPC is to contractually ban the Linpack benchmark from any government procurement.) My particular interest here is how codesign or customization affects the software tool stack, including the OS, compiler, debugger, and other tools.

What is Codesign?

The buzzword is codesign, but it is only loosely defined. Even at this workshop, one homework question was for each participant to write up a definition, hopefully resulting in less than one definition per attendee. My definition is that codesign occurs when two or more elements of the system are designed together, trading features, costs, advantages and disadvantages of each element against those of each other element. Specifically relevant is codesign of the software with the hardware.

The embedded system design community has a longer history of software/hardware codesign. For example, when designing an audio signal processor, the engineers might add a 16-bit fractional functional unit and appropriate instructions. There’s some thought that the HPC community could learn much about codesign and customization from the experience of the embedded systems industry. But the embedded community has a very different economic model. One embedded design may be replicated millions of times. Think how many copies of a cell phone chip or automotive controller chip get manufactured, relative to the number of supercomputers of any one design. Moreover, each embedded design has some very specific target application space: automobile antilock brake control, television set-top box, smart phone. The design may share some elements (many such designs include an ARM processor), but the customization need only address one of these applications.

Even if we don’t really have codesign (yet), software does affect processor design even in the commodity processor industry. AMD wouldn’t have added the 3DNow! instructions in 1998, and Intel wouldn’t have responded by adding the SSE instruction set to the Pentium III in 1999, had software (and customers using it) not demanded higher floating point compute bandwidth, something that x86 processors were not very good at before then. With the SSE2 extensions to the Pentium 4, x86 processors started making an appearance in the TOP500 list.

Distant and Recent Codesign in HPC

That’s not to say that we’ve never had codesign in high performance computing. We can go (way) back to the 1960s-1970s, to the design of the Illiac IV (I never worked on or even saw the Illiac IV, but all us proud Illini are inclined to bring it up at any opportunity). Illiac (and its contemporaries, the Control Data STAR-100 and the Texas Instruments Advanced Scientific Computer) were designed specifically to solve certain important problems of the day. The choice of memory size, bandwidth, and types of functional units were affected by the applications.

IBM has made several recent forays into codesign for HPC. They designed and delivered the Blue Gene/L, with a specially designed PowerPC processor. Rather than use the highest performance processor chip of the time, IBM started with a lower speed, lower power embedded processor and added a double-pipeline, double precision floating point unit with complex arithmetic instruction set extensions. IBM also designed the Roadrunner system at Los Alamos. It uses AMD node processors and a specially extended Cell processor (another embedded design, originally aimed at the Sony Playstation), the PowerXCell 8i, with high performance double precision. IBM’s design for DARPA’s HPCS program uses a special Interconnect Module. And we all recall IBM’s Deep Blue chess computer, which famously played and beat World Champion Garry Kasparov, with a custom chess move generator chip.

Other more special purpose systems have been designed, several to solve molecular dynamics problems. MDGRAPE-3 (Gravity Pipe) is both a specially designed processor chip to compute interatomic forces, accelerating long-range force computation, and the name of the large system using this chip, developed at RIKEN in Japan. The system has 111 nodes with over 5,000 MDGRAPE-3 chips, where each chip has 20 parallel computation pipelines. The MDGRAPE-3 system operates at petascale performance levels, but since it’s special purpose, it doesn’t work on the Linpack benchmark and so can’t be placed on the TOP500 list.

Anton is another special-purpose machine designed for certain molecular dynamics simulations, specifically for folding proteins and biological macromocules. It does most of the force calculations in one largely fixed-function subsystem, and FFT and bond forces in another more flexible subsystem. The Anton system is designed with 512 nodes, each node being a custom ASIC with a small memory.

While not directed at HPC, but interesting and related indirectly is the development of programmable GPUs. As the standard graphics computing pipeline was developed, it was initially implemented in fixed function blocks. The pipeline has several stages, including vertex transformation, clipping, rasterization, pixel shading, and display. Interactive graphics lives under a very strict real-time requirement; it must be able to generate a new color for each pixel in the whole scene 30 or 60 times a second. As technology got faster, vendors started making parts of the GPU programmable, in particular the vertex and pixel processors. They developed shader programming languages, such as NVIDIA’s Cg, Microsoft’s HLSL and the GLSL shader language in OpenGL. The languages allow graphics programmers to exploit the features of the GPUs, which were designed to solve the problems that graphics programmers want to solve. From this background, we have the GPU programming languages CUDA, OpenCL, and now DirectCompute from Microsoft.

Future Custom and Codesign in HPC

I’ll suggest two general obvious areas for customization, those relating to processors or computing, and those relating to memory.

Memory Extensions: Messages, Communication, Memory Hierarchies

Given the prevalence of message-passing in large scale parallelism, an obvious design opportunity is a network interface designed to optimize common communication patterns. It’s unfortunate that most message-passing application use MPI, which is implemented as a library instead of a language. There’s no way for a compiler to optimize the application to take advantage of such an interface, but an optimized MPI library would serve almost the same purpose.

Other possibilities for scalable parallelism exist. The SGI Altix UV systems allow for over a thousand cores to share a single memory address space, using standard x86 processors. SGI uses an interface at the cache coherence protocol level, and manages message and memory traffic across the system. Numascale has recently announced another product allowing construction of scalable shared address space systems, again interfacing with the cache coherence protocols.

Both these systems attempt to support a strict memory consistency model. One could also explore a system supporting a more relaxed memory coherence, such as release consistency. This would allow an application to manage the memory consistency traffic more explicitly. The advantage is the possibility to reduce memory coherence messaging (and making processes wait for those messages). The disadvantage is the possibility of getting the explicit consistency wrong. Intel is exploring this with its Single-chip Cloud Computer, which has 48 cores without full hardware cache coherence.

We might also explore software-managed cache memories. Hardware caches are great, but highly tuned algorithms often find that the cache gets in the way. A cache will load a whole cache line (and evict some other cache line) when a load or store causes a data cache miss, in the hopes that temporal or spatial locality will benefit from having that whole line closer to the processor. If the program knows that there is no locality, it should be able to tell the hardware not to cache this load or store; in fact, some processors have memory instructions with exactly this behavior. The next step is to have a small local memory with the speed of a level 1 cache, but under program control. The Cray-2 had a 128KB local memory at each processor, and the NVIDIA Tesla shared memory can be thought as a software data cache.

Processor Extensions: Coprocessors and Attached Processors

As we look towards exascale computing, energy becomes a serious limiting factor. Prof. Mark Horowitz and his colleagues at Stanford University has made a convincing argument that the best (and perhaps the only) way that software can reduce energy inside a processor is to execute fewer instructions. The instruction fetch, decode, dispatch, and retire logic takes so much energy that there’s no way to effectively reduce energy except to reduce the instruction count. Reducing the instruction count while doing the same amount of total work means we have to do more work per instruction. One obvious approach, currently in use for other reasons, is vector instructions. The X86 SSE instructions and the PowerPC Altivec instruction work in this way. Consider Intel’s upcoming AVX instructions as another step in this direction. Another step would be to allow the customer to decide how wide the packed or vector instructions should be. There’s no reason that the instruction set should be defined as strictly 128-bits or 256-bits wide.

In the old days of microprocessors (1980s), processors were designed with explicit coprocessor interfaces and had coprocessor instructions. The first floating point functionality was typically added using this interface. Given the limited transistor real estate available on early microprocessors, the coprocessor interface allowed for extensions without having to modify the microprocessor itself. Some designs had the coprocessor monitoring the instruction stream, selecting the coprocessor instructions and executing them directly, while the CPU continued executing its own instruction stream. Other designs had the CPU fetch the instruction and pass appropriate instructions directly to the coprocessor through a dedicated interface. Today, with billions of transistors on each chip, a microprocessor will include not only fully pipelined floating point functional units, but multiple cores, multiple levels of cache, memory controllers, multichip interfaces (Hypertransport or Quickpath), and more. It’s not feasible for an external chip to act in such a coordinated manner with the microprocessor. The interface would have to pass through two or three levels of on-chip cache, or connect through an IO interface.

That doesn’t mean that coprocessors are out of the question. Embedded processors still offer tightly coupled coprocessor interfaces, and there is some evidence (see above) that embedded processors have a role to play in HPC. One possibility is to design with something like the old Xilinx Virtex-4 or -5 FPGAs, which included one or two PowerPC cores on board, each with a coprocessor interface. This might allow you to use different coprocessors for different applications, reprogramming the FPGA fabric as you load the application. The downside is the lower gate density and clock speed of FPGAs relative to microprocessor cores or ASICs.

Another approach is to convince an embedded systems vendor to design, implement and fabricate a specialized coprocessor with an embedded core or multicore chip. ARM processors are designed with an integrated Coprocessor Interface. ARM suppliers, such as PGI’s parent company STMicroelectronics might be willing to help design and fabricate such chips, given enough of a market or other incentive. However, selling 10,000 or even 100,000 chips for each big installation isn’t much of a market for these vendors. I fear the only way to walk this path is to minimize the cost and risk for the chip manufacturer by raising the price (which may be too costly relative to commodity parts) or shifting the design costs and risk to another party, the customer or system integrator (same argument).

In the even older days of minicomputers (1970s), small machines were augmented with attached processors. These were physically connected like an IO device, able to read and write the system memory, but programmed separately. One of the first attached processors was the Floating Point Systems AP-120B, often attached to a Digital PDP-11 or VAX. An attached processor allows a high performance subsystem optimized for the computing, but which doesn’t support all the functionality of a modern operating system to be connected to a more general purpose system which does. The customer gets full functionality and high performance, though at the increased cost of managing the interface between the two subsystems – the trick for the vendor is to minimize that cost. The most recent such device was the Clearspeed accelerator.

Today’s GPU computing falls into the attached processor camp. Programming an NVIDIA or ATI graphics card with CUDA or OpenCL looks similar in many respects to programming array processors of 30 years ago. The host connects to the GPU, allocates and moves data to the GPU memory, launches asynchronous operations on the GPU, and eventually waits until those operations complete to bring the results back. NVIDIA has done a good job minimizing the apparent software interface between the two subsystems with the CUDA language.

We could conceive of application-specific attached processors (ASAP — I like the acronym already). The costs are similar to designing a custom coprocessor. Someone (the customer or system integrator) has to design and arrange for fabrication of the ASAP, and write the software to interface to it. There are certainly specific markets where this makes sense, but it would be better for all if there were some level of standardization to share risks and costs across multiple projects.

The Convey hybrid core system functions much like an attached processor with several interesting twists. It is implemented using FPGAs, so a customer can use the standard floating point vector units, or develop an application-specific personality, essentially a custom functional unit pipeline. The accelerator unit has its own attached high bandwidth memory, but this memory is mapped into the host address space. Thus, the host can access data in the accelerator memory, and vice versa, though with a performance penalty. The system comes with a compiler to make the interface as seamless as possible.

What About the Software?

There are two aspects of software codesign: system software and applications. Let’s start with applications, which are, after all, the reason to go down the codesign or customization path. How much are developers willing to change or customize their applications given new hardware features and, presumably, higher performance? The answers are mixed. Some bleeding edge researchers are willing to do a wholesale rewrite, including developing new algorithms, for a factor of 2 (or less) improvement. Others are unwilling to change their programs much at all. The algorithms are tuned for numerical accuracy and precision, and they don’t want to (or can’t) validate a new method that might be required for a new machine. The former category includes all the CUDA and OpenCL programmers, and the latter category includes many ISV applications.

We’ve been through several generations of high performance machines, and one could make the argument that application developers will follow the path of higher performance, even if that requires program rewrites. The pipelined machines of the 1960s, vector machines in the 1970s and 1980s, multiprocessors in the 1980s and beyond, and massively parallel clusters from the 1990s to today all required rewrites to utilize the parallelism. However, the programming models used in any generation were largely portable across different machines of the same generation. Programs that vectorized for a Cray-1 would vectorize for a NEC SX or Convex C-1 or other contemporary machines. Programs using MPI for parallelism today port across a wide range of cluster designs. If we start with more customized machines, programs tuned for those custom features naturally become less portable, or at least less performance portable.

We can alleviate that pain by using a standard set of library routines, where the library is optimized for each target. This is the approach behind LAPACK and other libraries. Alternatively, if we can get the compilers to generate the right code for each target, perhaps the programs can remain truly portable. This brings us to system software. Mostly I’m interested in how all this affects the compiler. There is other important system software, debugger and operating system in particular, but other than supporting the new features in appropriate ways, there’s usually less technological difficulty.

Getting a compiler to use some new feature can be challenging. There have been some notable successes. When SSE instructions were introduced in 1999, Intel required programmers to use assembly code, or to add SSE intrinsic functions to their code. Their compilers recognized the intrinsics and turned them into the appropriate instructions, but the code was limited to those machines with those instructions (and compilers). The Portland Group was the first to use classical vectorization technology to generate SSE instructions directly from loops in the program. The same technology will allow programs to use AVX instructions without changing the source. If we’d stuck with those SSE intrinsics, using AVX instructions would require a significant rewrite.

Now imagine adding a functional unit to compute a weighted average of four neighboring array elements, which is essentially what the hardware texture units in GPUs do. Would a compiler require that programs express this using an intrinsic function (which is how it’s expressed in Cg and other GPU languages)? Can a compiler recognize this pattern without the intrinsic? If it could, could it also use that hardware for other operations that are similar in some respects? This would be the key to portability of the program, and generality and usefulness of the functional unit. We could end up with some number of pattern-recognizers in our compilers, with different patterns enabled for each target machine. Perhaps we could even create a compiler where a vendor or user could add patterns and replacement rules without modifying the compiler itself.

Software is one of the key differences between the embedded and HPC worlds. In the embedded market, it can be worthwhile to make a specific hardware addition that might only solve one problem for one application. If the market is large enough, the vendor will recoup the development cost in very high volumes, and that part will only run that one application, anyway. The cost of the software customization isn’t significant. In the HPC market, it’s rare to have a system dedicated to a single application (Anton and Deep Blue notwithstanding). Any customized addition must be useful to a wide range of applications in order to make it worthwhile for the vendor to develop and support it.

The Path to Successful Codesign

There has been a lot of exploration and some good experiences. However, while a custom system like Anton could be considered a great success for its application, it won’t affect system design in any fundamental way. It’s a single success point, not a path to success.

Success depends on providing an ecosystem that allows applications to live beyond the lifetime of any single system or even vendor. Today’s HPC systems are largely clusters of commodity microprocessor and memory parts with some customization in the network fabrics for some vendors. High level languages and MPI libraries provide the necessary ecosystem, allowing applications to move across systems with not much more than a recompile.

For exascale, we’re clearly moving in a direction where commodity microprocessors alone will not provide a solution within acceptable cost and energy limits, hence we’re going to be using coprocessors or accelerators of some sort. We’re going to want an ecosystem that provides some level of software standard interface to these accelerators. The accelerators of the day are GPUs, which are themselves commodity parts designed for another purpose.

There are many obstacles challenging successful codesign at the processor level.

  • Definition of success. A one-off machine (like Anton) only has to satisfy a single customer, and can be completely customized for the one application. This level of customization would not be profitable for a vendor at any reasonable price. Either the design has to have many customers, or the cost of the customization has to be low enough to allow single-use.
  • Application-level customization What characteristics of an application make it amenible for hardware implementation? Clearly vector operations can be effectively implemented and used, but what other application-level features would find use in more than that one application? How to identify these? Research is lacking in this area.
  • Skill set. What skills are needed to do the custom design? Today, you’d need skills beyond what application writers know, or probably want to know. On the other hand, you want some application knowledge in order to determine what tradeoffs to make.
  • Software ecosystem. Do you want compilers to determine when to use the new feature by recognizing it in your source programs, like vectorizing compilers do today? Or do you want to use instruction-level intrinsics and assembly code, like the ETSI intrinsics used in embedded low-precision signal processing applications? Do you want your debugger to be able to read, display, and change state in your coprocessor? Does your operating system need to save and restore state between context switches? This is one area where the state of GPU computing today is lacking. The operating system does not manage the GPU, the user does.
  • Application maintenance. How much does the application need to change to use the new hardware features? This goes beyond just the expression of the feature, whether a vectorizing compiler will work or whether you have to use intrinsics. Will you be willing to recast your algorithm to take advantage of new features, like the way we optimize for locality to take advantage of cache memories today?
  • Delivery time. How much does this level of customization add to the manufacturing and delivery time of a new system? This affects what level of technology will be available for the system. Typically, custom features are one or two generations behind the fastest, densest hardware, so they have to make up that difference in architecture.
  • Knowledge reuse. Once we’ve gone through this path once, will we be able to reuse the knowledge and skills we’ve acquired in the next generation? Will the technology progression require a whole new set of skills for hardware design?

As mentioned in the workshop, embedded system designers have to address essentially the same issues regularly, though with different economic and technological constraints. It’s quite possible that those vendors could learn about HPC more quickly than the HPC vendors can learn about codesign.

About the Author

Michael Wolfe has developed compilers for over 30 years in both academia and industry, and is now a senior compiler engineer at The Portland Group, Inc. (www.pgroup.com), a wholly-owned subsidiary of STMicroelectronics, Inc. The opinions stated here are those of the author, and do not represent opinions of The Portland Group, Inc. or STMicroelectronics, Inc.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

SC Bids Farewell to Denver, Heads to Dallas for 30th

November 17, 2017

After a jam-packed four-day expo and intensive six-day technical program, SC17 has wrapped up another successful event that brought together nearly 13,000 visitors to the Colorado Convention Center in Denver for the larg Read more…

By Tiffany Trader

SC17 Keynote – HPC Powers SKA Efforts to Peer Deep into the Cosmos

November 17, 2017

This week’s SC17 keynote – Life, the Universe and Computing: The Story of the SKA Telescope – was a powerful pitch for the potential of Big Science projects that also showcased the foundational role of high performance computing in modern science. It was also visually stunning. Read more…

By John Russell

How Cities Use HPC at the Edge to Get Smarter

November 17, 2017

Cities are sensoring up, collecting vast troves of data that they’re running through predictive models and using the insights to solve problems that, in some cases, city managers didn’t even know existed. Speaking Read more…

By Doug Black

HPE Extreme Performance Solutions

Harness Scalable Petabyte Storage with HPE Apollo 4510 and HPE StoreEver

As a growing number of connected devices challenges IT departments to rapidly collect, manage, and store troves of data, organizations must adopt a new generation of IT to help them operate quickly and intelligently. Read more…

SC17 Student Cluster Competition Configurations: Fewer Nodes, Way More Accelerators

November 16, 2017

The final configurations for each of the SC17 “Donnybrook in Denver” Student Cluster Competition have been released. Fortunately, each team received their equipment shipments on time and undamaged, so the teams are r Read more…

By Dan Olds

SC Bids Farewell to Denver, Heads to Dallas for 30th

November 17, 2017

After a jam-packed four-day expo and intensive six-day technical program, SC17 has wrapped up another successful event that brought together nearly 13,000 visit Read more…

By Tiffany Trader

SC17 Keynote – HPC Powers SKA Efforts to Peer Deep into the Cosmos

November 17, 2017

This week’s SC17 keynote – Life, the Universe and Computing: The Story of the SKA Telescope – was a powerful pitch for the potential of Big Science projects that also showcased the foundational role of high performance computing in modern science. It was also visually stunning. Read more…

By John Russell

How Cities Use HPC at the Edge to Get Smarter

November 17, 2017

Cities are sensoring up, collecting vast troves of data that they’re running through predictive models and using the insights to solve problems that, in some Read more…

By Doug Black

Student Cluster LINPACK Record Shattered! More LINs Packed Than Ever before!

November 16, 2017

Nanyang Technological University, the pride of Singapore, utterly destroyed the Student Cluster Competition LINPACK record by posting a score of 51.77 TFlop/s a Read more…

By Dan Olds

Hyperion Market Update: ‘Decent’ Growth Led by HPE; AI Transparency a Risk Issue

November 15, 2017

The HPC market update from Hyperion Research (formerly IDC) at the annual SC conference is a business and social “must,” and this year’s presentation at S Read more…

By Doug Black

Nvidia Focuses Its Cloud Containers on HPC Applications

November 14, 2017

Having migrated its top-of-the-line datacenter GPU to the largest cloud vendors, Nvidia is touting its Volta architecture for a range of scientific computing ta Read more…

By George Leopold

HPE Launches ARM-based Apollo System for HPC, AI

November 14, 2017

HPE doubled down on its memory-driven computing vision while expanding its processor portfolio with the announcement yesterday of the company’s first ARM-base Read more…

By Doug Black

OpenACC Shines in Global Climate/Weather Codes

November 14, 2017

OpenACC, the directive-based parallel programming model used mostly for porting codes to GPUs for use on heterogeneous systems, came to SC17 touting impressive Read more…

By John Russell

US Coalesces Plans for First Exascale Supercomputer: Aurora in 2021

September 27, 2017

At the Advanced Scientific Computing Advisory Committee (ASCAC) meeting, in Arlington, Va., yesterday (Sept. 26), it was revealed that the "Aurora" supercompute Read more…

By Tiffany Trader

NERSC Scales Scientific Deep Learning to 15 Petaflops

August 28, 2017

A collaborative effort between Intel, NERSC and Stanford has delivered the first 15-petaflops deep learning software running on HPC platforms and is, according Read more…

By Rob Farber

Oracle Layoffs Reportedly Hit SPARC and Solaris Hard

September 7, 2017

Oracle’s latest layoffs have many wondering if this is the end of the line for the SPARC processor and Solaris OS development. As reported by multiple sources Read more…

By John Russell

AMD Showcases Growing Portfolio of EPYC and Radeon-based Systems at SC17

November 13, 2017

AMD’s charge back into HPC and the datacenter is on full display at SC17. Having launched the EPYC processor line in June along with its MI25 GPU the focus he Read more…

By John Russell

Nvidia Responds to Google TPU Benchmarking

April 10, 2017

Nvidia highlights strengths of its newest GPU silicon in response to Google's report on the performance and energy advantages of its custom tensor processor. Read more…

By Tiffany Trader

Google Releases Deeplearn.js to Further Democratize Machine Learning

August 17, 2017

Spreading the use of machine learning tools is one of the goals of Google’s PAIR (People + AI Research) initiative, which was introduced in early July. Last w Read more…

By John Russell

GlobalFoundries Puts Wind in AMD’s Sails with 12nm FinFET

September 24, 2017

From its annual tech conference last week (Sept. 20), where GlobalFoundries welcomed more than 600 semiconductor professionals (reaching the Santa Clara venue Read more…

By Tiffany Trader

Amazon Debuts New AMD-based GPU Instances for Graphics Acceleration

September 12, 2017

Last week Amazon Web Services (AWS) streaming service, AppStream 2.0, introduced a new GPU instance called Graphics Design intended to accelerate graphics. The Read more…

By John Russell

Leading Solution Providers

EU Funds 20 Million Euro ARM+FPGA Exascale Project

September 7, 2017

At the Barcelona Supercomputer Centre on Wednesday (Sept. 6), 16 partners gathered to launch the EuroEXA project, which invests €20 million over three-and-a-half years into exascale-focused research and development. Led by the Horizon 2020 program, EuroEXA picks up the banner of a triad of partner projects — ExaNeSt, EcoScale and ExaNoDe — building on their work... Read more…

By Tiffany Trader

Delays, Smoke, Records & Markets – A Candid Conversation with Cray CEO Peter Ungaro

October 5, 2017

Earlier this month, Tom Tabor, publisher of HPCwire and I had a very personal conversation with Cray CEO Peter Ungaro. Cray has been on something of a Cinderell Read more…

By Tiffany Trader & Tom Tabor

Reinders: “AVX-512 May Be a Hidden Gem” in Intel Xeon Scalable Processors

June 29, 2017

Imagine if we could use vector processing on something other than just floating point problems.  Today, GPUs and CPUs work tirelessly to accelerate algorithms Read more…

By James Reinders

Cray Moves to Acquire the Seagate ClusterStor Line

July 28, 2017

This week Cray announced that it is picking up Seagate's ClusterStor HPC storage array business for an undisclosed sum. "In short we're effectively transitioning the bulk of the ClusterStor product line to Cray," said CEO Peter Ungaro. Read more…

By Tiffany Trader

Intel Launches Software Tools to Ease FPGA Programming

September 5, 2017

Field Programmable Gate Arrays (FPGAs) have a reputation for being difficult to program, requiring expertise in specialty languages, like Verilog or VHDL. Easin Read more…

By Tiffany Trader

HPC Chips – A Veritable Smorgasbord?

October 10, 2017

For the first time since AMD's ill-fated launch of Bulldozer the answer to the question, 'Which CPU will be in my next HPC system?' doesn't have to be 'Whichever variety of Intel Xeon E5 they are selling when we procure'. Read more…

By Dairsie Latimer

IBM Advances Web-based Quantum Programming

September 5, 2017

IBM Research is pairing its Jupyter-based Data Science Experience notebook environment with its cloud-based quantum computer, IBM Q, in hopes of encouraging a new class of entrepreneurial user to solve intractable problems that even exceed the capabilities of the best AI systems. Read more…

By Alex Woodie

How ‘Knights Mill’ Gets Its Deep Learning Flops

June 22, 2017

Intel, the subject of much speculation regarding the delayed, rewritten or potentially canceled “Aurora” contract (the Argonne Lab part of the CORAL “ Read more…

By Tiffany Trader

  • arrow
  • Click Here for More Headlines
  • arrow
Share This