A Case for PCI Express as a High-Performance Cluster Interconnect

By Vijay Meduri

January 24, 2011

In high-speed computing (HPC), there are a number of significant benefits to simplifying the processor interconnect in rack- and chassis-based servers by designing in PCI Express (PCIe). The PCI-SIG, the group responsible for the conventional PCI and the much-higher-performance PCIe standards, has released three generations of PCIe specifications over the last eight years and is fully expected to continue this progression in the future with even newer generations, from which HPC systems will continue to see newer features, faster data throughput and improved reliability.

The latest PCIe specification, Gen 3, runs at 8Gbps per serial lane, enabling a 48-lane switch to handle a whopping 96 GBytes/sec. of full duplex peer to peer traffic. Due to the widespread usage of PCI and PCIe in computing, communications and industrial applications, this interconnect technology’s ecosystem is widely deployed and its cost efficiencies as a fabric are enormous. The PCIe interconnect, in each of its generations, offers a clean, high-performance interconnect with low-latency and substantial savings in terms of cost and power. The savings are due to its ability to eliminate multiple layers of expensive switches and bridges that previously were needed to blend various standards. This article explains the key features of a PCIe fabric that now make clusters, expansion boxes and shared-I/O applications relatively easy to develop and deploy.

Figure 1 illustrates a typical topology of building out a server cluster today, in which, while the form factors may change, the basic configuration follows a similar pattern. Given the widespread availability of open-source software and off-the-shelf hardware, companies have successfully built large topologies for their internal cloud infrastructure using this architecture.


Figure 1: Typical Data Center I/O interconnect

Figure 2 illustrates a server cluster built using a native PCIe fabric. As is evident, the usage of numerous adapters and controllers is significantly reduced and this results in a tremendous reduction in power and cost of the overall platform, while delivering better performance in terms of lower latency and higher throughput.


Figure 2: PCI Express-based Server Cluster

Key Features of a PCI Express Fabric

Bandwidth and performance

The width of a PCIe port can range from one to 16 lanes. The 16-lane port configurations are primarily used in graphic applications, while the one-lane configurations are used in USB, wireless and other bridge applications. On a 16-lane port with each lane operating at Gen 3’s 8Gbps, a design can effectively achieve a data transfer rate of 15 GBytes/second and with a cut-through latency on the order of about 120ns.

Non-transparency

In a typical system, the PCIe device hierarchy is owned by a single host, and the operating system/BIOS allocates and controls all the memory resources to the devices within the system. If two PCIe hierarchies are connected, there is conflict and the system resources are allocated over one other. To avoid this problem, the system will use non-transparency, allowing address isolation between the two hosts. Each host allocates resources in the normal manner and when it discovers the PCI device with the NT port, it allocates resources as it would with any end-point based on the BAR memory setup. This address window can be used to tunnel and communicate between various hosts in the system. There is a mechanism built into the device that translates the address window on the receiving host system to a non-overlapping address space.

DMA

DMA engines are now available on PCIe switches, such as those from PLX Technology, which are quite versatile and have the standard descriptor fetch -> move data -> descriptor fetch approach with a lot of programmable pre-fetch options that allow efficient data movement directly between the memory of one host to that of another. The DMA engine in a PCIe switch serves a function similar to that of a DMA resource on a network or host-bus card, in that it moves data to and from the main memory. Figure 3 illustrates how an integrated DMA engine is used to move data, thus avoiding the need for using CPU resources in large file transfers. For low-latency messaging, the CPU can be used to directly use programmable IOs to write into the system memory of a destination host using an address window over NT.


Figure 3: DMA engine embedded in a PCIe Fabric enables efficient data transfer

Spread spectrum isolation

Most systems turn on spread spectrum on the clock sources to reduce electro-magnetic interference on certain frequencies. All PCIe devices are required support spread spectrum clock sources, as long as the all the connected PCIe devices reference the same clock source the system will work. However, with PCIe cluster topologies, multiple domains need to support different spread-spectrum clocks and, hence, the need for spread-spectrum isolation. This feature allows two PCIe ports on different spread-spectrum domains to be connected through an isolation port.

Data integrity

The PCIe specification provides extensive logging and error reporting mechanisms. The two key data-integrity features are link cyclic redundancy check (LCRC) and end-to-end cyclic redundancy check (ECRC), with LCRC protecting link-to-link packet protection and ECRC providing end-to-end CRC protection. For inter-processor communication, ECRC augmented with protocol-level support provides excellent data protection. These inherent features, coupled with the implementation-specific robustness in the data path, provide excellent data integrity in a PCIe fabric. Data protection has been fundamental to the protocol, given ever-increasing requirement in computing, storage and chip set architectures.

Lossless fabric

A PCIe link between two devices guarantees delivery of packets with a rigorous ACK/NAK acknowledgment protocol. This eliminates the need for higher-level protocol per packet acknowledgement mechanism. The protocol does not allow for any device to drop a packet in transmission without notification. The situations where this could occur are primarily silicon/hardware failures and with the right Error Correction and Parity protection schemes the probability of such occurrence is statistically insignificant.

Congestion management and scaling

There is no inherent fabric-level congestion management mechanism built into the protocol, hence the topologies where PCIe shines is in the area of small- to medium-scale clusters of up to 200 nodes. For scaling beyond these numbers of nodes, a shared I/O Ethernet controller or converged adapter could be used to connect between the mini-PCIe clusters over a converged Ethernet fabric, as shown in Figure 4. Within the PCIe cluster, congestion management could be achieved by using simple flow control schemes among the nodes.


Figure 4 : Scaling a PCIe Fabric

Shared I/O

The basic concept in shared I/O is that the same I/O device is virtually shared between the many virtual managers (VMs) on a given host. This allows more of the software overhead that is currently being done in the hypervisor to be offloaded with native support onto the hardware. The PCI-SIG workgroup involved in this has defined both the single-root IO virtualization (SR-IOV) and multi-root IO virtualization (MR-IOV) requirements in the end-points, switches and root complexes. There is more traction in the industry on SR-IOV-compatible end-points than on MR-IOV. It is now possible to share I/O controllers among multiple hosts on a PCIe fabric by several different schemes.

Software

PCIe devices have been designed such that their software discovery is backward-compatible with the legacy PCI configuration model — an approach that’s been critical to the success of the interconnect technology. This software compatibility works well within the framework on a single host, however for multiple hosts there is a layer of software needed to enable communication between the hosts. Given that many clustering applications are already available for standard APIs from Ethernet and InfiniBand, there is active, ongoing development to build similar software stacks for PCIe that leverage the existing API thus reusing existing infrastructure.

Power and cost reduction using PCIe

With the elimination of protocol adapters at every node in a cluster, a reduction in I/O interconnect power consumption of up to 70 percent can be achieved. Due to the widespread usage and ubiquity of PCIe, the cost of switches, on a per-lane basis, is significantly lower than comparable performance-level switches. The protocol provides excellent power management features that allows for system software to dynamically shut down and bring up lanes by monitoring the usage and traffic flows. Additionally, active state power management (ASPM) allows for the hardware to automatically shut down the lanes when there is no traffic.

Flexibility and Interoperability

Each of the ports on a PCIe fabric can automatically negotiate to x16, x8, x4 or x1 widths based on a protocol handshake. In addition an x16 port can be configured as 2 x8s or 4 x4s which enables the systems designer complete flexibility on how he allocates his resources and can customize based on usage. Interesting techniques for acceleration and storage open up on a PCIe fabric since the graphics adapters and flash storage cards have PCIe interfaces that can directly connect to the fabric.

In summary, there are well over 800 vendors developing various kinds of solutions around the PCIe interconnect standard’s three generations. This has led to broad, worldwide adoption of the protocol and to a large percentage of the CPUs available in the market now having PCIe ports native to the processor. Given the low-cost emphasis on PCIe since its initial launch, the total cost of building out a PCIe fabric is an order of magnitude lower than comparable high-performance interconnects. As this I/O standard has evolved and become ubiquitous, the natural progression has been to use the fabric for expansion boxes and inter-processor communication. Dual-host solutions using non transparency have been widely used in storage boxes since 2005 but now the industry is pushing the fabric to expand to larger-scale clusters. Both hardware and software enhancements now make this possible to build reasonably sized clusters with off-the-shelf components. With the evolution of the converged network and storage adapters, there is an increased emphasis to pack more functionality into them, which pushes against the power and cost thresholds when every computing node needs a heavy adapter. A simple solution would be to extend the existing PCIe interconnect on the computing nodes to address the inter-processor communication.

Vijay Meduri is vice president of engineering for PCI Express switching at PLX Technology, Sunnyvale, Calif. He can be reached at vmeduri@plxtech.com.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

Why HPC Storage Matters More Now Than Ever: Analyst Q&A

September 17, 2021

With soaring data volumes and insatiable computing driving nearly every facet of economic, social and scientific progress, data storage is seizing the spotlight. Hyperion Research analyst and noted storage expert Mark No Read more…

GigaIO Gets $14.7M in Series B Funding to Expand Its Composable Fabric Technology to Customers

September 16, 2021

Just before the COVID-19 pandemic began in March 2020, GigaIO introduced its Universal Composable Fabric technology, which allows enterprises to bring together any HPC and AI resources and integrate them with networking, Read more…

What’s New in HPC Research: Solar Power, ExaWorks, Optane & More

September 16, 2021

In this regular feature, HPCwire highlights newly published research in the high-performance computing community and related domains. From parallel programming to exascale to quantum computing, the details are here. Read more…

Cerebras Brings Its Wafer-Scale Engine AI System to the Cloud

September 16, 2021

Five months ago, when Cerebras Systems debuted its second-generation wafer-scale silicon system (CS-2), co-founder and CEO Andrew Feldman hinted of the company’s coming cloud plans, and now those plans have come to fruition. Today, Cerebras and Cirrascale Cloud Services are launching... Read more…

AI Hardware Summit: Panel on Memory Looks Forward

September 15, 2021

What will system memory look like in five years? Good question. While Monday's panel, Designing AI Super-Chips at the Speed of Memory, at the AI Hardware Summit, tackled several topics, the panelists also took a brief glimpse into the future. Unlike compute, storage and networking, which... Read more…

AWS Solution Channel

Supporting Climate Model Simulations to Accelerate Climate Science

The Amazon Sustainability Data Initiative (ASDI), AWS is donating cloud resources, technical support, and access to scalable infrastructure and fast networking providing high performance computing (HPC) solutions to support simulations of near-term climate using the National Center for Atmospheric Research (NCAR) Community Earth System Model Version 2 (CESM2) and its Whole Atmosphere Community Climate Model (WACCM). Read more…

ECMWF Opens Bologna Datacenter in Preparation for Atos Supercomputer

September 14, 2021

In January 2020, the European Centre for Medium-Range Weather Forecasts (ECMWF) – a juggernaut in the weather forecasting scene – signed a four-year, $89-million contract with European tech firm Atos to quintuple its supercomputing capacity. With the deal approaching the two-year mark, ECMWF... Read more…

GigaIO Gets $14.7M in Series B Funding to Expand Its Composable Fabric Technology to Customers

September 16, 2021

Just before the COVID-19 pandemic began in March 2020, GigaIO introduced its Universal Composable Fabric technology, which allows enterprises to bring together Read more…

Cerebras Brings Its Wafer-Scale Engine AI System to the Cloud

September 16, 2021

Five months ago, when Cerebras Systems debuted its second-generation wafer-scale silicon system (CS-2), co-founder and CEO Andrew Feldman hinted of the company’s coming cloud plans, and now those plans have come to fruition. Today, Cerebras and Cirrascale Cloud Services are launching... Read more…

AI Hardware Summit: Panel on Memory Looks Forward

September 15, 2021

What will system memory look like in five years? Good question. While Monday's panel, Designing AI Super-Chips at the Speed of Memory, at the AI Hardware Summit, tackled several topics, the panelists also took a brief glimpse into the future. Unlike compute, storage and networking, which... Read more…

ECMWF Opens Bologna Datacenter in Preparation for Atos Supercomputer

September 14, 2021

In January 2020, the European Centre for Medium-Range Weather Forecasts (ECMWF) – a juggernaut in the weather forecasting scene – signed a four-year, $89-million contract with European tech firm Atos to quintuple its supercomputing capacity. With the deal approaching the two-year mark, ECMWF... Read more…

Quantum Computer Market Headed to $830M in 2024

September 13, 2021

What is one to make of the quantum computing market? Energized (lots of funding) but still chaotic and advancing in unpredictable ways (e.g. competing qubit tec Read more…

Amazon, NCAR, SilverLining Team for Unprecedented Cloud Climate Simulations

September 10, 2021

Earth’s climate is, to put it mildly, not in a good place. In the wake of a damning report from the Intergovernmental Panel on Climate Change (IPCC), scientis Read more…

After Roadblocks and Renewals, EuroHPC Targets a Bigger, Quantum Future

September 9, 2021

The EuroHPC Joint Undertaking (JU) was formalized in 2018, beginning a new era of European supercomputing that began to bear fruit this year with the launch of several of the first EuroHPC systems. The undertaking, however, has not been without its speed bumps, and the Union faces an uphill... Read more…

How Argonne Is Preparing for Exascale in 2022

September 8, 2021

Additional details came to light on Argonne National Laboratory’s preparation for the 2022 Aurora exascale-class supercomputer, during the HPC User Forum, held virtually this week on account of pandemic. Exascale Computing Project director Doug Kothe reviewed some of the 'early exascale hardware' at Argonne, Oak Ridge and NERSC (Perlmutter), while Ti Leggett, Deputy Project Director & Deputy Director... Read more…

Ahead of ‘Dojo,’ Tesla Reveals Its Massive Precursor Supercomputer

June 22, 2021

In spring 2019, Tesla made cryptic reference to a project called Dojo, a “super-powerful training computer” for video data processing. Then, in summer 2020, Tesla CEO Elon Musk tweeted: “Tesla is developing a [neural network] training computer called Dojo to process truly vast amounts of video data. It’s a beast! … A truly useful exaflop at de facto FP32.” Read more…

Berkeley Lab Debuts Perlmutter, World’s Fastest AI Supercomputer

May 27, 2021

A ribbon-cutting ceremony held virtually at Berkeley Lab's National Energy Research Scientific Computing Center (NERSC) today marked the official launch of Perlmutter – aka NERSC-9 – the GPU-accelerated supercomputer built by HPE in partnership with Nvidia and AMD. Read more…

Esperanto, Silicon in Hand, Champions the Efficiency of Its 1,092-Core RISC-V Chip

August 27, 2021

Esperanto Technologies made waves last December when it announced ET-SoC-1, a new RISC-V-based chip aimed at machine learning that packed nearly 1,100 cores onto a package small enough to fit six times over on a single PCIe card. Now, Esperanto is back, silicon in-hand and taking aim... Read more…

Enter Dojo: Tesla Reveals Design for Modular Supercomputer & D1 Chip

August 20, 2021

Two months ago, Tesla revealed a massive GPU cluster that it said was “roughly the number five supercomputer in the world,” and which was just a precursor to Tesla’s real supercomputing moonshot: the long-rumored, little-detailed Dojo system. “We’ve been scaling our neural network training compute dramatically over the last few years,” said Milan Kovac, Tesla’s director of autopilot engineering. Read more…

CentOS Replacement Rocky Linux Is Now in GA and Under Independent Control

June 21, 2021

The Rocky Enterprise Software Foundation (RESF) is announcing the general availability of Rocky Linux, release 8.4, designed as a drop-in replacement for the soon-to-be discontinued CentOS. The GA release is launching six-and-a-half months after Red Hat deprecated its support for the widely popular, free CentOS server operating system. The Rocky Linux development effort... Read more…

Intel Completes LLVM Adoption; Will End Updates to Classic C/C++ Compilers in Future

August 10, 2021

Intel reported in a blog this week that its adoption of the open source LLVM architecture for Intel’s C/C++ compiler is complete. The transition is part of In Read more…

Google Launches TPU v4 AI Chips

May 20, 2021

Google CEO Sundar Pichai spoke for only one minute and 42 seconds about the company’s latest TPU v4 Tensor Processing Units during his keynote at the Google I Read more…

Hot Chips: Here Come the DPUs and IPUs from Arm, Nvidia and Intel

August 25, 2021

The emergence of data processing units (DPU) and infrastructure processing units (IPU) as potentially important pieces in cloud and datacenter architectures was Read more…

Leading Solution Providers

Contributors

AMD-Xilinx Deal Gains UK, EU Approvals — China’s Decision Still Pending

July 1, 2021

AMD’s planned acquisition of FPGA maker Xilinx is now in the hands of Chinese regulators after needed antitrust approvals for the $35 billion deal were receiv Read more…

HPE Wins $2B GreenLake HPC-as-a-Service Deal with NSA

September 1, 2021

In the heated, oft-contentious, government IT space, HPE has won a massive $2 billion contract to provide HPC and AI services to the United States’ National Security Agency (NSA). Following on the heels of the now-canceled $10 billion JEDI contract (reissued as JWCC) and a $10 billion... Read more…

10nm, 7nm, 5nm…. Should the Chip Nanometer Metric Be Replaced?

June 1, 2020

The biggest cool factor in server chips is the nanometer. AMD beating Intel to a CPU built on a 7nm process node* – with 5nm and 3nm on the way – has been i Read more…

Julia Update: Adoption Keeps Climbing; Is It a Python Challenger?

January 13, 2021

The rapid adoption of Julia, the open source, high level programing language with roots at MIT, shows no sign of slowing according to data from Julialang.org. I Read more…

Quantum Roundup: IBM, Rigetti, Phasecraft, Oxford QC, China, and More

July 13, 2021

IBM yesterday announced a proof for a quantum ML algorithm. A week ago, it unveiled a new topology for its quantum processors. Last Friday, the Technical Univer Read more…

Intel Launches 10nm ‘Ice Lake’ Datacenter CPU with Up to 40 Cores

April 6, 2021

The wait is over. Today Intel officially launched its 10nm datacenter CPU, the third-generation Intel Xeon Scalable processor, codenamed Ice Lake. With up to 40 Read more…

Frontier to Meet 20MW Exascale Power Target Set by DARPA in 2008

July 14, 2021

After more than a decade of planning, the United States’ first exascale computer, Frontier, is set to arrive at Oak Ridge National Laboratory (ORNL) later this year. Crossing this “1,000x” horizon required overcoming four major challenges: power demand, reliability, extreme parallelism and data movement. Read more…

Intel Unveils New Node Names; Sapphire Rapids Is Now an ‘Intel 7’ CPU

July 27, 2021

What's a preeminent chip company to do when its process node technology lags the competition by (roughly) one generation, but outmoded naming conventions make it seem like it's two nodes behind? For Intel, the response was to change how it refers to its nodes with the aim of better reflecting its positioning within the leadership semiconductor manufacturing space. Intel revealed its new node nomenclature, and... Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire