At the Hot Chips conference last week, Intel showcased its latest neural network processor accelerators for both AI training and inference, along with details of its hybrid chip packaging technology, Optane DC persistent memory and chiplet technology for optical I/O.
Intel’s forthcoming Nervana NNP-T, codenamed “Spring Crest,” is designed to train deep learning models at scale and to accommodate a given power budget. The processor is built “with flexibility in mind, striking a balance among computing, communication and memory,” according to Intel. “While Intel Xeon Scalable processors bring AI-specific instructions and provide a foundation for AI, the NNP-T is architected from scratch, building in features and requirements needed to solve for large models, without the overhead needed to support legacy technology.”
The chip features four PCIe Gen 4 interconnects, support for four stacks of HBM2-2400 (high-bandwidth memory) 8GB devices, up to 24 tensor processing clusters, 64 lanes of SerDes functional blocks used in high speed communications. up to 119 TOPS and 60 MB of on-chip distributed memory.
The NNP-T cores support bfloat16 arithmetic as well as FP32 precision.
On the deep learning inference side, Intel said the Nervana NNP-I, a.k.a. “Spring Hill,” chip delivers 4.8 tera operations per second (TOPS) per inference compute engine (ICE), and an NNP-I can accommodate up to 12 ICEs. The processor leverages Intel’s 10nm process technology with modified Xeon “Ice Lake” cores and offers flexible programmability, according to the company. “As AI becomes pervasive across every workload, having a dedicated inference accelerator that is easy to program, has short latencies, has fast code porting and includes support for all major deep learning frameworks allows companies to harness the full potential of their data as actionable insights,” the company said.
“In the heart of (the NNP-I) we have 12 inferencing cores, ICEs, they’re optimized for inferencing workloads and they can work autonomously, each one solving a network or multiple networks or running different copies of a network,” said Ofri Weschler, Intel Fellow, AI Products Group, lead hardware architect, “and they can also work collaboratively to solve a bigger problems. So there are multiple options for operating this structure. They communicate through a shared coherent fabric with 24MB of hardware-managed cache that also provides provisions for the software to let the hardware know what kind of expectations it has in terms of service level priorities…and 24MB of cache to assist the data transfers.”
Intel also emphasized its commitment to software that supports AI frameworks used by data scientists building AI applications. The company said it has optimized source libraries such as nGraph, which supports training and inference across multiple frameworks and hardware architectures. It also offers the Intel Distribution of OpenVINO toolkit designed to optimize pretrained models and deploy neural networks for video to various hardware architectures, and it has created BigDL, Intel’s distributed DL library for Apache Spark and Hadoop clusters.
The goal, stated Naveen Rao, Intel VP/EM, AI Products Group, in a recent blog, is to ensure that when purpose-built AI hardware, such as the Intel Nervana NNPs, are introduced, it integrates with existing developer tools and libraries “to make the transition for developers and data scientists as seamless as possible.”
Intel says it will sample the NNP-T to cloud service providers this year, with broader availability in 2020. Baidu is reportedly already using the NNP-T, and Facebook is a development partner on the NNP-I inference chip.
TeraPHY is an in-package optical I/O chiplet for high-bandwidth, low-power communication developed by Ayar Labs in conjunction with Intel (see “Ayar Labs to Demo Photonics Chiplet in FPGA Package at Hot Chips”). The two companies called their demonstration last week “the industry’s first integration of monolithic in-package optics (MIPO) with a high-performance system-on-chip (SOC).” The optical I/O chiplet is co-packaged with the Intel Stratix 10 FPGA using Intel Embedded Multi-die Interconnect Bridge (EMIB) technology, “offering high-bandwidth, low-power data communication from the chip package with determinant latency for distances up to 2 km,” Intel said, adding that interconnect is designed for the next phase of Moore’s Law by removing the traditional performance, power and cost bottlenecks in moving data.
“To get to a future state of ‘AI everywhere,’” said Rao, “we’ll need to address the crush of data being generated and ensure enterprises are empowered to make efficient use of their data, processing it where it’s collected when it makes sense and making smarter use of their upstream resources. Data centers and the cloud need to have access to performant and scalable general purpose computing and specialized acceleration for complex AI applications. In this future vision of AI everywhere, a holistic approach is needed—from hardware to software to applications.”