Compilers and More: Hardware/Software Codesign

By Michael Wolfe

November 2, 2010

Recently, I was invited to participate in a workshop, sponsored by Sandia National Labs, to discuss how codesign (that’s co-design, not code-sign) fits into the landscape of high performance computing. There is a growing feeling that merely taking the latest processor offerings from Intel, AMD or IBM will not get us to exascale in a reasonable time frame, cost budget, and power constraint. One avenue to explore is designing and building more specialized systems, aimed at the types of problems seen in HPC, or at least at the problems seen in some important subset of HPC. Such a strategy loses the advantages we’ve enjoyed over the past two decades of commoditization in HPC: lower cost, latest technology, tools we can also use on our laptops. However, a more special purpose design may be wise, or necessary – HPC is too small a market to really interest the big CPU vendors. Consider that last year somewhere around 170 million laptops were sold, whereas the sum of all processors (chips, not cores) in last June’s TOP500 list is about 1.4 million, less than 1 percent.

Some will surely point out that there’s some customization in most HPC system designs. Recent Cray systems may use commodity AMD or Intel processors, but they have custom, high bandwidth, low latency messaging hardware, and many HPC system designs have special cooling to handle the high heat density.

Yet, many feel that we need more fundamental customization, and specifically, codesign between the software and hardware to reach useful exascale. This last point, useful exascale, is often defined as exascale computing on real applications, specifically not Linpack. (One of my colleagues went so far as to suggest that the way to save HPC is to contractually ban the Linpack benchmark from any government procurement.) My particular interest here is how codesign or customization affects the software tool stack, including the OS, compiler, debugger, and other tools.

What is Codesign?

The buzzword is codesign, but it is only loosely defined. Even at this workshop, one homework question was for each participant to write up a definition, hopefully resulting in less than one definition per attendee. My definition is that codesign occurs when two or more elements of the system are designed together, trading features, costs, advantages and disadvantages of each element against those of each other element. Specifically relevant is codesign of the software with the hardware.

The embedded system design community has a longer history of software/hardware codesign. For example, when designing an audio signal processor, the engineers might add a 16-bit fractional functional unit and appropriate instructions. There’s some thought that the HPC community could learn much about codesign and customization from the experience of the embedded systems industry. But the embedded community has a very different economic model. One embedded design may be replicated millions of times. Think how many copies of a cell phone chip or automotive controller chip get manufactured, relative to the number of supercomputers of any one design. Moreover, each embedded design has some very specific target application space: automobile antilock brake control, television set-top box, smart phone. The design may share some elements (many such designs include an ARM processor), but the customization need only address one of these applications.

Even if we don’t really have codesign (yet), software does affect processor design even in the commodity processor industry. AMD wouldn’t have added the 3DNow! instructions in 1998, and Intel wouldn’t have responded by adding the SSE instruction set to the Pentium III in 1999, had software (and customers using it) not demanded higher floating point compute bandwidth, something that x86 processors were not very good at before then. With the SSE2 extensions to the Pentium 4, x86 processors started making an appearance in the TOP500 list.

Distant and Recent Codesign in HPC

That’s not to say that we’ve never had codesign in high performance computing. We can go (way) back to the 1960s-1970s, to the design of the Illiac IV (I never worked on or even saw the Illiac IV, but all us proud Illini are inclined to bring it up at any opportunity). Illiac (and its contemporaries, the Control Data STAR-100 and the Texas Instruments Advanced Scientific Computer) were designed specifically to solve certain important problems of the day. The choice of memory size, bandwidth, and types of functional units were affected by the applications.

IBM has made several recent forays into codesign for HPC. They designed and delivered the Blue Gene/L, with a specially designed PowerPC processor. Rather than use the highest performance processor chip of the time, IBM started with a lower speed, lower power embedded processor and added a double-pipeline, double precision floating point unit with complex arithmetic instruction set extensions. IBM also designed the Roadrunner system at Los Alamos. It uses AMD node processors and a specially extended Cell processor (another embedded design, originally aimed at the Sony Playstation), the PowerXCell 8i, with high performance double precision. IBM’s design for DARPA’s HPCS program uses a special Interconnect Module. And we all recall IBM’s Deep Blue chess computer, which famously played and beat World Champion Garry Kasparov, with a custom chess move generator chip.

Other more special purpose systems have been designed, several to solve molecular dynamics problems. MDGRAPE-3 (Gravity Pipe) is both a specially designed processor chip to compute interatomic forces, accelerating long-range force computation, and the name of the large system using this chip, developed at RIKEN in Japan. The system has 111 nodes with over 5,000 MDGRAPE-3 chips, where each chip has 20 parallel computation pipelines. The MDGRAPE-3 system operates at petascale performance levels, but since it’s special purpose, it doesn’t work on the Linpack benchmark and so can’t be placed on the TOP500 list.

Anton is another special-purpose machine designed for certain molecular dynamics simulations, specifically for folding proteins and biological macromocules. It does most of the force calculations in one largely fixed-function subsystem, and FFT and bond forces in another more flexible subsystem. The Anton system is designed with 512 nodes, each node being a custom ASIC with a small memory.

While not directed at HPC, but interesting and related indirectly is the development of programmable GPUs. As the standard graphics computing pipeline was developed, it was initially implemented in fixed function blocks. The pipeline has several stages, including vertex transformation, clipping, rasterization, pixel shading, and display. Interactive graphics lives under a very strict real-time requirement; it must be able to generate a new color for each pixel in the whole scene 30 or 60 times a second. As technology got faster, vendors started making parts of the GPU programmable, in particular the vertex and pixel processors. They developed shader programming languages, such as NVIDIA’s Cg, Microsoft’s HLSL and the GLSL shader language in OpenGL. The languages allow graphics programmers to exploit the features of the GPUs, which were designed to solve the problems that graphics programmers want to solve. From this background, we have the GPU programming languages CUDA, OpenCL, and now DirectCompute from Microsoft.

Future Custom and Codesign in HPC

I’ll suggest two general obvious areas for customization, those relating to processors or computing, and those relating to memory.

Memory Extensions: Messages, Communication, Memory Hierarchies

Given the prevalence of message-passing in large scale parallelism, an obvious design opportunity is a network interface designed to optimize common communication patterns. It’s unfortunate that most message-passing application use MPI, which is implemented as a library instead of a language. There’s no way for a compiler to optimize the application to take advantage of such an interface, but an optimized MPI library would serve almost the same purpose.

Other possibilities for scalable parallelism exist. The SGI Altix UV systems allow for over a thousand cores to share a single memory address space, using standard x86 processors. SGI uses an interface at the cache coherence protocol level, and manages message and memory traffic across the system. Numascale has recently announced another product allowing construction of scalable shared address space systems, again interfacing with the cache coherence protocols.

Both these systems attempt to support a strict memory consistency model. One could also explore a system supporting a more relaxed memory coherence, such as release consistency. This would allow an application to manage the memory consistency traffic more explicitly. The advantage is the possibility to reduce memory coherence messaging (and making processes wait for those messages). The disadvantage is the possibility of getting the explicit consistency wrong. Intel is exploring this with its Single-chip Cloud Computer, which has 48 cores without full hardware cache coherence.

We might also explore software-managed cache memories. Hardware caches are great, but highly tuned algorithms often find that the cache gets in the way. A cache will load a whole cache line (and evict some other cache line) when a load or store causes a data cache miss, in the hopes that temporal or spatial locality will benefit from having that whole line closer to the processor. If the program knows that there is no locality, it should be able to tell the hardware not to cache this load or store; in fact, some processors have memory instructions with exactly this behavior. The next step is to have a small local memory with the speed of a level 1 cache, but under program control. The Cray-2 had a 128KB local memory at each processor, and the NVIDIA Tesla shared memory can be thought as a software data cache.

Processor Extensions: Coprocessors and Attached Processors

As we look towards exascale computing, energy becomes a serious limiting factor. Prof. Mark Horowitz and his colleagues at Stanford University has made a convincing argument that the best (and perhaps the only) way that software can reduce energy inside a processor is to execute fewer instructions. The instruction fetch, decode, dispatch, and retire logic takes so much energy that there’s no way to effectively reduce energy except to reduce the instruction count. Reducing the instruction count while doing the same amount of total work means we have to do more work per instruction. One obvious approach, currently in use for other reasons, is vector instructions. The X86 SSE instructions and the PowerPC Altivec instruction work in this way. Consider Intel’s upcoming AVX instructions as another step in this direction. Another step would be to allow the customer to decide how wide the packed or vector instructions should be. There’s no reason that the instruction set should be defined as strictly 128-bits or 256-bits wide.

In the old days of microprocessors (1980s), processors were designed with explicit coprocessor interfaces and had coprocessor instructions. The first floating point functionality was typically added using this interface. Given the limited transistor real estate available on early microprocessors, the coprocessor interface allowed for extensions without having to modify the microprocessor itself. Some designs had the coprocessor monitoring the instruction stream, selecting the coprocessor instructions and executing them directly, while the CPU continued executing its own instruction stream. Other designs had the CPU fetch the instruction and pass appropriate instructions directly to the coprocessor through a dedicated interface. Today, with billions of transistors on each chip, a microprocessor will include not only fully pipelined floating point functional units, but multiple cores, multiple levels of cache, memory controllers, multichip interfaces (Hypertransport or Quickpath), and more. It’s not feasible for an external chip to act in such a coordinated manner with the microprocessor. The interface would have to pass through two or three levels of on-chip cache, or connect through an IO interface.

That doesn’t mean that coprocessors are out of the question. Embedded processors still offer tightly coupled coprocessor interfaces, and there is some evidence (see above) that embedded processors have a role to play in HPC. One possibility is to design with something like the old Xilinx Virtex-4 or -5 FPGAs, which included one or two PowerPC cores on board, each with a coprocessor interface. This might allow you to use different coprocessors for different applications, reprogramming the FPGA fabric as you load the application. The downside is the lower gate density and clock speed of FPGAs relative to microprocessor cores or ASICs.

Another approach is to convince an embedded systems vendor to design, implement and fabricate a specialized coprocessor with an embedded core or multicore chip. ARM processors are designed with an integrated Coprocessor Interface. ARM suppliers, such as PGI’s parent company STMicroelectronics might be willing to help design and fabricate such chips, given enough of a market or other incentive. However, selling 10,000 or even 100,000 chips for each big installation isn’t much of a market for these vendors. I fear the only way to walk this path is to minimize the cost and risk for the chip manufacturer by raising the price (which may be too costly relative to commodity parts) or shifting the design costs and risk to another party, the customer or system integrator (same argument).

In the even older days of minicomputers (1970s), small machines were augmented with attached processors. These were physically connected like an IO device, able to read and write the system memory, but programmed separately. One of the first attached processors was the Floating Point Systems AP-120B, often attached to a Digital PDP-11 or VAX. An attached processor allows a high performance subsystem optimized for the computing, but which doesn’t support all the functionality of a modern operating system to be connected to a more general purpose system which does. The customer gets full functionality and high performance, though at the increased cost of managing the interface between the two subsystems – the trick for the vendor is to minimize that cost. The most recent such device was the Clearspeed accelerator.

Today’s GPU computing falls into the attached processor camp. Programming an NVIDIA or ATI graphics card with CUDA or OpenCL looks similar in many respects to programming array processors of 30 years ago. The host connects to the GPU, allocates and moves data to the GPU memory, launches asynchronous operations on the GPU, and eventually waits until those operations complete to bring the results back. NVIDIA has done a good job minimizing the apparent software interface between the two subsystems with the CUDA language.

We could conceive of application-specific attached processors (ASAP — I like the acronym already). The costs are similar to designing a custom coprocessor. Someone (the customer or system integrator) has to design and arrange for fabrication of the ASAP, and write the software to interface to it. There are certainly specific markets where this makes sense, but it would be better for all if there were some level of standardization to share risks and costs across multiple projects.

The Convey hybrid core system functions much like an attached processor with several interesting twists. It is implemented using FPGAs, so a customer can use the standard floating point vector units, or develop an application-specific personality, essentially a custom functional unit pipeline. The accelerator unit has its own attached high bandwidth memory, but this memory is mapped into the host address space. Thus, the host can access data in the accelerator memory, and vice versa, though with a performance penalty. The system comes with a compiler to make the interface as seamless as possible.

What About the Software?

There are two aspects of software codesign: system software and applications. Let’s start with applications, which are, after all, the reason to go down the codesign or customization path. How much are developers willing to change or customize their applications given new hardware features and, presumably, higher performance? The answers are mixed. Some bleeding edge researchers are willing to do a wholesale rewrite, including developing new algorithms, for a factor of 2 (or less) improvement. Others are unwilling to change their programs much at all. The algorithms are tuned for numerical accuracy and precision, and they don’t want to (or can’t) validate a new method that might be required for a new machine. The former category includes all the CUDA and OpenCL programmers, and the latter category includes many ISV applications.

We’ve been through several generations of high performance machines, and one could make the argument that application developers will follow the path of higher performance, even if that requires program rewrites. The pipelined machines of the 1960s, vector machines in the 1970s and 1980s, multiprocessors in the 1980s and beyond, and massively parallel clusters from the 1990s to today all required rewrites to utilize the parallelism. However, the programming models used in any generation were largely portable across different machines of the same generation. Programs that vectorized for a Cray-1 would vectorize for a NEC SX or Convex C-1 or other contemporary machines. Programs using MPI for parallelism today port across a wide range of cluster designs. If we start with more customized machines, programs tuned for those custom features naturally become less portable, or at least less performance portable.

We can alleviate that pain by using a standard set of library routines, where the library is optimized for each target. This is the approach behind LAPACK and other libraries. Alternatively, if we can get the compilers to generate the right code for each target, perhaps the programs can remain truly portable. This brings us to system software. Mostly I’m interested in how all this affects the compiler. There is other important system software, debugger and operating system in particular, but other than supporting the new features in appropriate ways, there’s usually less technological difficulty.

Getting a compiler to use some new feature can be challenging. There have been some notable successes. When SSE instructions were introduced in 1999, Intel required programmers to use assembly code, or to add SSE intrinsic functions to their code. Their compilers recognized the intrinsics and turned them into the appropriate instructions, but the code was limited to those machines with those instructions (and compilers). The Portland Group was the first to use classical vectorization technology to generate SSE instructions directly from loops in the program. The same technology will allow programs to use AVX instructions without changing the source. If we’d stuck with those SSE intrinsics, using AVX instructions would require a significant rewrite.

Now imagine adding a functional unit to compute a weighted average of four neighboring array elements, which is essentially what the hardware texture units in GPUs do. Would a compiler require that programs express this using an intrinsic function (which is how it’s expressed in Cg and other GPU languages)? Can a compiler recognize this pattern without the intrinsic? If it could, could it also use that hardware for other operations that are similar in some respects? This would be the key to portability of the program, and generality and usefulness of the functional unit. We could end up with some number of pattern-recognizers in our compilers, with different patterns enabled for each target machine. Perhaps we could even create a compiler where a vendor or user could add patterns and replacement rules without modifying the compiler itself.

Software is one of the key differences between the embedded and HPC worlds. In the embedded market, it can be worthwhile to make a specific hardware addition that might only solve one problem for one application. If the market is large enough, the vendor will recoup the development cost in very high volumes, and that part will only run that one application, anyway. The cost of the software customization isn’t significant. In the HPC market, it’s rare to have a system dedicated to a single application (Anton and Deep Blue notwithstanding). Any customized addition must be useful to a wide range of applications in order to make it worthwhile for the vendor to develop and support it.

The Path to Successful Codesign

There has been a lot of exploration and some good experiences. However, while a custom system like Anton could be considered a great success for its application, it won’t affect system design in any fundamental way. It’s a single success point, not a path to success.

Success depends on providing an ecosystem that allows applications to live beyond the lifetime of any single system or even vendor. Today’s HPC systems are largely clusters of commodity microprocessor and memory parts with some customization in the network fabrics for some vendors. High level languages and MPI libraries provide the necessary ecosystem, allowing applications to move across systems with not much more than a recompile.

For exascale, we’re clearly moving in a direction where commodity microprocessors alone will not provide a solution within acceptable cost and energy limits, hence we’re going to be using coprocessors or accelerators of some sort. We’re going to want an ecosystem that provides some level of software standard interface to these accelerators. The accelerators of the day are GPUs, which are themselves commodity parts designed for another purpose.

There are many obstacles challenging successful codesign at the processor level.

  • Definition of success. A one-off machine (like Anton) only has to satisfy a single customer, and can be completely customized for the one application. This level of customization would not be profitable for a vendor at any reasonable price. Either the design has to have many customers, or the cost of the customization has to be low enough to allow single-use.
  • Application-level customization What characteristics of an application make it amenible for hardware implementation? Clearly vector operations can be effectively implemented and used, but what other application-level features would find use in more than that one application? How to identify these? Research is lacking in this area.
  • Skill set. What skills are needed to do the custom design? Today, you’d need skills beyond what application writers know, or probably want to know. On the other hand, you want some application knowledge in order to determine what tradeoffs to make.
  • Software ecosystem. Do you want compilers to determine when to use the new feature by recognizing it in your source programs, like vectorizing compilers do today? Or do you want to use instruction-level intrinsics and assembly code, like the ETSI intrinsics used in embedded low-precision signal processing applications? Do you want your debugger to be able to read, display, and change state in your coprocessor? Does your operating system need to save and restore state between context switches? This is one area where the state of GPU computing today is lacking. The operating system does not manage the GPU, the user does.
  • Application maintenance. How much does the application need to change to use the new hardware features? This goes beyond just the expression of the feature, whether a vectorizing compiler will work or whether you have to use intrinsics. Will you be willing to recast your algorithm to take advantage of new features, like the way we optimize for locality to take advantage of cache memories today?
  • Delivery time. How much does this level of customization add to the manufacturing and delivery time of a new system? This affects what level of technology will be available for the system. Typically, custom features are one or two generations behind the fastest, densest hardware, so they have to make up that difference in architecture.
  • Knowledge reuse. Once we’ve gone through this path once, will we be able to reuse the knowledge and skills we’ve acquired in the next generation? Will the technology progression require a whole new set of skills for hardware design?

As mentioned in the workshop, embedded system designers have to address essentially the same issues regularly, though with different economic and technological constraints. It’s quite possible that those vendors could learn about HPC more quickly than the HPC vendors can learn about codesign.

About the Author

Michael Wolfe has developed compilers for over 30 years in both academia and industry, and is now a senior compiler engineer at The Portland Group, Inc. (www.pgroup.com), a wholly-owned subsidiary of STMicroelectronics, Inc. The opinions stated here are those of the author, and do not represent opinions of The Portland Group, Inc. or STMicroelectronics, Inc.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

Musk’s Latest Startup Eyes Brain-Computer Links

April 21, 2017

Elon Musk, the auto and space entrepreneur and severe critic of artificial intelligence, is forming a new venture that reportedly will seek to develop an interface between the human brain and computers. Read more…

By George Leopold

MIT Mathematician Spins Up 220,000-Core Google Compute Cluster

April 21, 2017

On Thursday, Google announced that MIT math professor and computational number theorist Andrew V. Sutherland had set a record for the largest Google Compute Engine (GCE) job. Sutherland ran the massive mathematics workload on 220,000 GCE cores using preemptible virtual machine instances. Read more…

By Tiffany Trader

Nvidia P100 Shows 1.3-2.3x Speedup Over K80 GPU on Financial Apps

April 20, 2017

When it comes to the true performance of the latest silicon, every end user knows that the best processor is the one that works best for their application. Read more…

By Tiffany Trader

Quantum Adds Global Smarts to StorNext File System

April 20, 2017

Companies that use Quantum’s StorNext platform to store massive amounts of data this week got a glimpse of new storage capabilities that should make it easier to access their data horde from anywhere in the world. Read more…

By Alex Woodie

HPE Extreme Performance Solutions

HPC-Driven Weather Simulations Improving Forecasting Capabilities

In September of 1938, a massive hurricane traversed the Atlantic Ocean and made landfall in New England. Due to inadequate and incorrect forecasting, the storm struck farther north and with greater intensity than had been predicted, leaving residents and authorities with virtually no warning or time to properly prepare. Read more…

Scaling an HPC Career in Nepal Can Be a Steep Climb

April 20, 2017

Umesh Upadhyaya works as an IT Associate at the International Centre for Integrated Mountain Development (ICIMOD) in Nepal, which supports the country’s one and only HPC facility. He is directly involved in an initiative that focuses on climate change and atmosphere modeling Read more…

By Nages Sieslack

Hyperion (IDC) Paints a Bullish Picture of HPC Future

April 20, 2017

Hyperion Research – formerly IDC’s HPC group – yesterday painted a fascinating and complicated portrait of the HPC community’s health and prospects at the HPC User Forum held in Albuquerque, NM. HPC sales are up and growing ($22 billion, all HPC segments, 2016). Read more…

By John Russell

Intel Open Sources All Lustre Work, Brent Gorda Exits

April 19, 2017

In a letter to the Lustre community posted on the Intel website, Vice President of Intel's Data Center Group Trish Damkroger writes that effective immediately the company will be contributing all Lustre development to the open source community. Damkroger also announced that Brent Gorda, General Manager, High Performance Data Division at Intel is leaving the company. Read more…

By Tiffany Trader

Facebook Open Sources Caffe2; Nvidia, Intel Rush to Optimize

April 18, 2017

From its F8 developer conference in San Jose, Calif., today, Facebook announced Caffe2, a new open-source, cross-platform framework for deep learning. Caffe2 is the successor to Caffe, the deep learning framework developed by Berkeley AI Research and community contributors. Read more…

By Tiffany Trader

Hyperion (IDC) Paints a Bullish Picture of HPC Future

April 20, 2017

Hyperion Research – formerly IDC’s HPC group – yesterday painted a fascinating and complicated portrait of the HPC community’s health and prospects at the HPC User Forum held in Albuquerque, NM. HPC sales are up and growing ($22 billion, all HPC segments, 2016). Read more…

By John Russell

Knights Landing Processor with Omni-Path Makes Cloud Debut

April 18, 2017

HPC cloud specialist Rescale is partnering with Intel and HPC resource provider R Systems to offer first-ever cloud access to Xeon Phi "Knights Landing" processors. The infrastructure is based on the 68-core Intel Knights Landing processor with integrated Omni-Path fabric (the 7250F Xeon Phi). Read more…

By Tiffany Trader

CERN openlab Explores New CPU/FPGA Processing Solutions

April 14, 2017

Through a CERN openlab project known as the ‘High-Throughput Computing Collaboration,’ researchers are investigating the use of various Intel technologies in data filtering and data acquisition systems. Read more…

By Linda Barney

DOE Supercomputer Achieves Record 45-Qubit Quantum Simulation

April 13, 2017

In order to simulate larger and larger quantum systems and usher in an age of “quantum supremacy,” researchers are stretching the limits of today’s most advanced supercomputers. Read more…

By Tiffany Trader

Penguin Takes a Run at the Big Cloud Providers

April 12, 2017

HPC specialist Penguin Computing recently re-ran benchmarks from a study of its larger brethren and says the results show its ‘public cloud’ – Penguin on Demand (POD) – is among the leaders in cost and performance. Read more…

By John Russell

Nvidia Responds to Google TPU Benchmarking

April 10, 2017

Nvidia highlights strengths of its newest GPU silicon in response to Google's report on the performance and energy advantages of its custom tensor processor. Read more…

By Tiffany Trader

HPC and the Colocation Datacenter – a Bridge Too Far?

April 7, 2017

A more standardised HPC platform approach is making the running of HPC projects within increasing financial reach. Read more…

By Clive Longbottom, Quocirca

Google Pulls Back the Covers on Its First Machine Learning Chip

April 6, 2017

This week Google released a report detailing the design and performance characteristics of the Tensor Processing Unit (TPU), its custom ASIC for the inference phase of neural networks (NN). Read more…

By Tiffany Trader

Google Pulls Back the Covers on Its First Machine Learning Chip

April 6, 2017

This week Google released a report detailing the design and performance characteristics of the Tensor Processing Unit (TPU), its custom ASIC for the inference phase of neural networks (NN). Read more…

By Tiffany Trader

Quantum Bits: D-Wave and VW; Google Quantum Lab; IBM Expands Access

March 21, 2017

For a technology that’s usually characterized as far off and in a distant galaxy, quantum computing has been steadily picking up steam. Read more…

By John Russell

Trump Budget Targets NIH, DOE, and EPA; No Mention of NSF

March 16, 2017

President Trump’s proposed U.S. fiscal 2018 budget issued today sharply cuts science spending while bolstering military spending as he promised during the campaign. Read more…

By John Russell

HPC Compiler Company PathScale Seeks Life Raft

March 23, 2017

HPCwire has learned that HPC compiler company PathScale has fallen on difficult times and is asking the community for help or actively seeking a buyer for its assets. Read more…

By Tiffany Trader

Nvidia Responds to Google TPU Benchmarking

April 10, 2017

Nvidia highlights strengths of its newest GPU silicon in response to Google's report on the performance and energy advantages of its custom tensor processor. Read more…

By Tiffany Trader

For IBM/OpenPOWER: Success in 2017 = (Volume) Sales

January 11, 2017

To a large degree IBM and the OpenPOWER Foundation have done what they said they would – assembling a substantial and growing ecosystem and bringing Power-based products to market, all in about three years. Read more…

By John Russell

CPU-based Visualization Positions for Exascale Supercomputing

March 16, 2017

In this contributed perspective piece, Intel’s Jim Jeffers makes the case that CPU-based visualization is now widely adopted and as such is no longer a contrarian view, but is rather an exascale requirement. Read more…

By Jim Jeffers, Principal Engineer and Engineering Leader, Intel

TSUBAME3.0 Points to Future HPE Pascal-NVLink-OPA Server

February 17, 2017

Since our initial coverage of the TSUBAME3.0 supercomputer yesterday, more details have come to light on this innovative project. Of particular interest is a new board design for NVLink-equipped Pascal P100 GPUs that will create another entrant to the space currently occupied by Nvidia's DGX-1 system, IBM's "Minsky" platform and the Supermicro SuperServer (1028GQ-TXR). Read more…

By Tiffany Trader

Leading Solution Providers

Tokyo Tech’s TSUBAME3.0 Will Be First HPE-SGI Super

February 16, 2017

In a press event Friday afternoon local time in Japan, Tokyo Institute of Technology (Tokyo Tech) announced its plans for the TSUBAME3.0 supercomputer, which will be Japan’s “fastest AI supercomputer,” Read more…

By Tiffany Trader

IBM Wants to be “Red Hat” of Deep Learning

January 26, 2017

IBM today announced the addition of TensorFlow and Chainer deep learning frameworks to its PowerAI suite of deep learning tools, which already includes popular offerings such as Caffe, Theano, and Torch. Read more…

By John Russell

Is Liquid Cooling Ready to Go Mainstream?

February 13, 2017

Lost in the frenzy of SC16 was a substantial rise in the number of vendors showing server oriented liquid cooling technologies. Three decades ago liquid cooling was pretty much the exclusive realm of the Cray-2 and IBM mainframe class products. That’s changing. We are now seeing an emergence of x86 class server products with exotic plumbing technology ranging from Direct-to-Chip to servers and storage completely immersed in a dielectric fluid. Read more…

By Steve Campbell

BioTeam’s Berman Charts 2017 HPC Trends in Life Sciences

January 4, 2017

Twenty years ago high performance computing was nearly absent from life sciences. Today it’s used throughout life sciences and biomedical research. Genomics and the data deluge from modern lab instruments are the main drivers, but so is the longer-term desire to perform predictive simulation in support of Precision Medicine (PM). There’s even a specialized life sciences supercomputer, ‘Anton’ from D.E. Shaw Research, and the Pittsburgh Supercomputing Center is standing up its second Anton 2 and actively soliciting project proposals. There’s a lot going on. Read more…

By John Russell

HPC Startup Advances Auto-Parallelization’s Promise

January 23, 2017

The shift from single core to multicore hardware has made finding parallelism in codes more important than ever, but that hasn’t made the task of parallel programming any easier. Read more…

By Tiffany Trader

HPC Technique Propels Deep Learning at Scale

February 21, 2017

Researchers from Baidu’s Silicon Valley AI Lab (SVAIL) have adapted a well-known HPC communication technique to boost the speed and scale of their neural network training and now they are sharing their implementation with the larger deep learning community. Read more…

By Tiffany Trader

US Supercomputing Leaders Tackle the China Question

March 15, 2017

Joint DOE-NSA report responds to the increased global pressures impacting the competitiveness of U.S. supercomputing. Read more…

By Tiffany Trader

IDG to Be Bought by Chinese Investors; IDC to Spin Out HPC Group

January 19, 2017

US-based publishing and investment firm International Data Group, Inc. (IDG) will be acquired by a pair of Chinese investors, China Oceanwide Holdings Group Co., Ltd. Read more…

By Tiffany Trader

  • arrow
  • Click Here for More Headlines
  • arrow
Share This