A Case for PCI Express as a High-Performance Cluster Interconnect

By Vijay Meduri

January 24, 2011

In high-speed computing (HPC), there are a number of significant benefits to simplifying the processor interconnect in rack- and chassis-based servers by designing in PCI Express (PCIe). The PCI-SIG, the group responsible for the conventional PCI and the much-higher-performance PCIe standards, has released three generations of PCIe specifications over the last eight years and is fully expected to continue this progression in the future with even newer generations, from which HPC systems will continue to see newer features, faster data throughput and improved reliability.

The latest PCIe specification, Gen 3, runs at 8Gbps per serial lane, enabling a 48-lane switch to handle a whopping 96 GBytes/sec. of full duplex peer to peer traffic. Due to the widespread usage of PCI and PCIe in computing, communications and industrial applications, this interconnect technology’s ecosystem is widely deployed and its cost efficiencies as a fabric are enormous. The PCIe interconnect, in each of its generations, offers a clean, high-performance interconnect with low-latency and substantial savings in terms of cost and power. The savings are due to its ability to eliminate multiple layers of expensive switches and bridges that previously were needed to blend various standards. This article explains the key features of a PCIe fabric that now make clusters, expansion boxes and shared-I/O applications relatively easy to develop and deploy.

Figure 1 illustrates a typical topology of building out a server cluster today, in which, while the form factors may change, the basic configuration follows a similar pattern. Given the widespread availability of open-source software and off-the-shelf hardware, companies have successfully built large topologies for their internal cloud infrastructure using this architecture.


Figure 1: Typical Data Center I/O interconnect

Figure 2 illustrates a server cluster built using a native PCIe fabric. As is evident, the usage of numerous adapters and controllers is significantly reduced and this results in a tremendous reduction in power and cost of the overall platform, while delivering better performance in terms of lower latency and higher throughput.


Figure 2: PCI Express-based Server Cluster

Key Features of a PCI Express Fabric

Bandwidth and performance

The width of a PCIe port can range from one to 16 lanes. The 16-lane port configurations are primarily used in graphic applications, while the one-lane configurations are used in USB, wireless and other bridge applications. On a 16-lane port with each lane operating at Gen 3’s 8Gbps, a design can effectively achieve a data transfer rate of 15 GBytes/second and with a cut-through latency on the order of about 120ns.

Non-transparency

In a typical system, the PCIe device hierarchy is owned by a single host, and the operating system/BIOS allocates and controls all the memory resources to the devices within the system. If two PCIe hierarchies are connected, there is conflict and the system resources are allocated over one other. To avoid this problem, the system will use non-transparency, allowing address isolation between the two hosts. Each host allocates resources in the normal manner and when it discovers the PCI device with the NT port, it allocates resources as it would with any end-point based on the BAR memory setup. This address window can be used to tunnel and communicate between various hosts in the system. There is a mechanism built into the device that translates the address window on the receiving host system to a non-overlapping address space.

DMA

DMA engines are now available on PCIe switches, such as those from PLX Technology, which are quite versatile and have the standard descriptor fetch -> move data -> descriptor fetch approach with a lot of programmable pre-fetch options that allow efficient data movement directly between the memory of one host to that of another. The DMA engine in a PCIe switch serves a function similar to that of a DMA resource on a network or host-bus card, in that it moves data to and from the main memory. Figure 3 illustrates how an integrated DMA engine is used to move data, thus avoiding the need for using CPU resources in large file transfers. For low-latency messaging, the CPU can be used to directly use programmable IOs to write into the system memory of a destination host using an address window over NT.


Figure 3: DMA engine embedded in a PCIe Fabric enables efficient data transfer

Spread spectrum isolation

Most systems turn on spread spectrum on the clock sources to reduce electro-magnetic interference on certain frequencies. All PCIe devices are required support spread spectrum clock sources, as long as the all the connected PCIe devices reference the same clock source the system will work. However, with PCIe cluster topologies, multiple domains need to support different spread-spectrum clocks and, hence, the need for spread-spectrum isolation. This feature allows two PCIe ports on different spread-spectrum domains to be connected through an isolation port.

Data integrity

The PCIe specification provides extensive logging and error reporting mechanisms. The two key data-integrity features are link cyclic redundancy check (LCRC) and end-to-end cyclic redundancy check (ECRC), with LCRC protecting link-to-link packet protection and ECRC providing end-to-end CRC protection. For inter-processor communication, ECRC augmented with protocol-level support provides excellent data protection. These inherent features, coupled with the implementation-specific robustness in the data path, provide excellent data integrity in a PCIe fabric. Data protection has been fundamental to the protocol, given ever-increasing requirement in computing, storage and chip set architectures.

Lossless fabric

A PCIe link between two devices guarantees delivery of packets with a rigorous ACK/NAK acknowledgment protocol. This eliminates the need for higher-level protocol per packet acknowledgement mechanism. The protocol does not allow for any device to drop a packet in transmission without notification. The situations where this could occur are primarily silicon/hardware failures and with the right Error Correction and Parity protection schemes the probability of such occurrence is statistically insignificant.

Congestion management and scaling

There is no inherent fabric-level congestion management mechanism built into the protocol, hence the topologies where PCIe shines is in the area of small- to medium-scale clusters of up to 200 nodes. For scaling beyond these numbers of nodes, a shared I/O Ethernet controller or converged adapter could be used to connect between the mini-PCIe clusters over a converged Ethernet fabric, as shown in Figure 4. Within the PCIe cluster, congestion management could be achieved by using simple flow control schemes among the nodes.


Figure 4 : Scaling a PCIe Fabric

Shared I/O

The basic concept in shared I/O is that the same I/O device is virtually shared between the many virtual managers (VMs) on a given host. This allows more of the software overhead that is currently being done in the hypervisor to be offloaded with native support onto the hardware. The PCI-SIG workgroup involved in this has defined both the single-root IO virtualization (SR-IOV) and multi-root IO virtualization (MR-IOV) requirements in the end-points, switches and root complexes. There is more traction in the industry on SR-IOV-compatible end-points than on MR-IOV. It is now possible to share I/O controllers among multiple hosts on a PCIe fabric by several different schemes.

Software

PCIe devices have been designed such that their software discovery is backward-compatible with the legacy PCI configuration model — an approach that’s been critical to the success of the interconnect technology. This software compatibility works well within the framework on a single host, however for multiple hosts there is a layer of software needed to enable communication between the hosts. Given that many clustering applications are already available for standard APIs from Ethernet and InfiniBand, there is active, ongoing development to build similar software stacks for PCIe that leverage the existing API thus reusing existing infrastructure.

Power and cost reduction using PCIe

With the elimination of protocol adapters at every node in a cluster, a reduction in I/O interconnect power consumption of up to 70 percent can be achieved. Due to the widespread usage and ubiquity of PCIe, the cost of switches, on a per-lane basis, is significantly lower than comparable performance-level switches. The protocol provides excellent power management features that allows for system software to dynamically shut down and bring up lanes by monitoring the usage and traffic flows. Additionally, active state power management (ASPM) allows for the hardware to automatically shut down the lanes when there is no traffic.

Flexibility and Interoperability

Each of the ports on a PCIe fabric can automatically negotiate to x16, x8, x4 or x1 widths based on a protocol handshake. In addition an x16 port can be configured as 2 x8s or 4 x4s which enables the systems designer complete flexibility on how he allocates his resources and can customize based on usage. Interesting techniques for acceleration and storage open up on a PCIe fabric since the graphics adapters and flash storage cards have PCIe interfaces that can directly connect to the fabric.

In summary, there are well over 800 vendors developing various kinds of solutions around the PCIe interconnect standard’s three generations. This has led to broad, worldwide adoption of the protocol and to a large percentage of the CPUs available in the market now having PCIe ports native to the processor. Given the low-cost emphasis on PCIe since its initial launch, the total cost of building out a PCIe fabric is an order of magnitude lower than comparable high-performance interconnects. As this I/O standard has evolved and become ubiquitous, the natural progression has been to use the fabric for expansion boxes and inter-processor communication. Dual-host solutions using non transparency have been widely used in storage boxes since 2005 but now the industry is pushing the fabric to expand to larger-scale clusters. Both hardware and software enhancements now make this possible to build reasonably sized clusters with off-the-shelf components. With the evolution of the converged network and storage adapters, there is an increased emphasis to pack more functionality into them, which pushes against the power and cost thresholds when every computing node needs a heavy adapter. A simple solution would be to extend the existing PCIe interconnect on the computing nodes to address the inter-processor communication.

Vijay Meduri is vice president of engineering for PCI Express switching at PLX Technology, Sunnyvale, Calif. He can be reached at [email protected].

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

Oklahoma State University to Build New Supercomputer

August 18, 2022

Thanks to a grant from the National Science Foundation, Oklahoma State University (OSU) will be building a new supercomputer. The as-yet unnamed system will succeed OSU’s existing system, which is simply named “Pete.” “This is a big moment for OSU and the High Performance Computing Center (HPCC),” said Pratul Agarwal, assistant vice president of research cyberinfrastructure and... Read more…

DOE and ORNL Dedicate Frontier Supercomputer

August 17, 2022

“It is my privilege to welcome you to the dedication of Frontier, the supercomputer that broke the exascale barrier.” That was the introduction by Oak Ridge National Laboratory Director Thomas Zacharia, at a small, public event on August 17 to officially dedicate the supercomputer, which in May became the first system to achieve over 1.0 exaflops of 64-bit performance on the... Read more…

Tesla Bulks Up Its GPU-Powered AI Super – Is Dojo Next?

August 16, 2022

Tesla has revealed that its biggest in-house AI supercomputer – which we wrote about last year – now has a total of 7,360 A100 GPUs, a nearly 28 percent uplift from its previous total of 5,760 GPUs. That’s enough GPU oomph for a top seven spot on the Top500, although the tech company best known for its electric vehicles has not publicly benchmarked the system. If it had, it would... Read more…

Inflation Reduction Act Signed Into Law, with Major Computing Implications

August 16, 2022

For the second time in as many weeks, President Biden has signed into law a major bill with significant implications for the computing sector. The Inflation Reduction Act – which is certainly the cornerstone of Biden’s first two years in office – allocates hundreds of billions of dollars toward energy security, climate change and healthcare. Among those hundreds of billions are hundreds of millions for scientific computing. At the signing ceremony... Read more…

Glimpse into ORNL Quantum Science Center Efforts to Find the Elusive Majorana and Much More

August 16, 2022

The Quantum Science Center (QSC), headquartered at Oak Ridge National Laboratory, is one of five such centers created by the National Quantum Initiative Act in 2018 and run by the Department of Energy. They all have distinct and overlapping goals. That’s sort of the point, to bring both focus and cooperation, and a heavy dose of industry participation to advance quantum information sciences... Read more…

AWS Solution Channel

Shutterstock 1914742114

23andMe Innovates Drug and Therapeutic Discovery with HPC on AWS

23andMe Innovates Drug and Therapeutic Discovery with HPC on AWS

Genomics and biotechnology company 23andMe provides direct-to-customer genetic testing, giving customers valuable insights into their genetics. Read more…

Microsoft/NVIDIA Solution Channel

Shutterstock 1689646429

Gain a Competitive Edge using Cloud-Based, GPU-Accelerated AI KYC Recommender Systems

Financial services organizations face increased competition for customers from technologies such as FinTechs, mobile banking applications, and online payment systems. To meet this challenge, it is important for organizations to have a deep understanding of their customers. Read more…

Australian Government Unveils New Defense, Weather Supercomputers

August 15, 2022

The Australian government has been busy on the supercomputing front. In just the last two weeks, the Australian Department of Defence and the Australian Bureau of Meteorology have both revealed major supercomputing updates. The Department of Defence, for its part, unveiled a new supercomputer: Taingiwilta, named after the word for “powerful” in the language of the... Read more…

DOE and ORNL Dedicate Frontier Supercomputer

August 17, 2022

“It is my privilege to welcome you to the dedication of Frontier, the supercomputer that broke the exascale barrier.” That was the introduction by Oak Ridge National Laboratory Director Thomas Zacharia, at a small, public event on August 17 to officially dedicate the supercomputer, which in May became the first system to achieve over 1.0 exaflops of 64-bit performance on the... Read more…

Tesla Bulks Up Its GPU-Powered AI Super – Is Dojo Next?

August 16, 2022

Tesla has revealed that its biggest in-house AI supercomputer – which we wrote about last year – now has a total of 7,360 A100 GPUs, a nearly 28 percent uplift from its previous total of 5,760 GPUs. That’s enough GPU oomph for a top seven spot on the Top500, although the tech company best known for its electric vehicles has not publicly benchmarked the system. If it had, it would... Read more…

Inflation Reduction Act Signed Into Law, with Major Computing Implications

August 16, 2022

For the second time in as many weeks, President Biden has signed into law a major bill with significant implications for the computing sector. The Inflation Reduction Act – which is certainly the cornerstone of Biden’s first two years in office – allocates hundreds of billions of dollars toward energy security, climate change and healthcare. Among those hundreds of billions are hundreds of millions for scientific computing. At the signing ceremony... Read more…

Glimpse into ORNL Quantum Science Center Efforts to Find the Elusive Majorana and Much More

August 16, 2022

The Quantum Science Center (QSC), headquartered at Oak Ridge National Laboratory, is one of five such centers created by the National Quantum Initiative Act in 2018 and run by the Department of Energy. They all have distinct and overlapping goals. That’s sort of the point, to bring both focus and cooperation, and a heavy dose of industry participation to advance quantum information sciences... Read more…

Australian Government Unveils New Defense, Weather Supercomputers

August 15, 2022

The Australian government has been busy on the supercomputing front. In just the last two weeks, the Australian Department of Defence and the Australian Bureau of Meteorology have both revealed major supercomputing updates. The Department of Defence, for its part, unveiled a new supercomputer: Taingiwilta, named after the word for “powerful” in the language of the... Read more…

HPCwire Quantum Survey: First Up – IBM and Zapata – on Algorithms, Error Mitigation, More

August 15, 2022

Quantum computing technology advances so quickly that it is hard to stay current. HPCwire recently asked a handful of senior researchers and executives for their thoughts on nearer-term progress and challenges. We’ll present their responses as they trickle in through the late summer and fall. (These execs take vacations too!) This also allows us to present the respondent’s... Read more…

SC22 Unveils ACM Gordon Bell Prize Finalists

August 12, 2022

Courtesy of the schedule for the SC22 conference, we now have our first glimpse at the finalists for this year’s coveted Gordon Bell Prize. The Gordon Bell Pr Read more…

Q&A with ORNL’s Bronson Messer, an HPCwire Person to Watch in 2022

August 12, 2022

HPCwire presents our interview with Bronson Messer, distinguished scientist and director of Science at the Oak Ridge Leadership Computing Facility (OLCF), ORNL, and an HPCwire 2022 Person to Watch. Messer recaps ORNL's journey to exascale and sheds light on how all the pieces line up to support the all-important science. Also covered are the role... Read more…

Royalty-free stock illustration ID: 1919750255

Intel Says UCIe to Outpace PCIe in Speed Race

May 11, 2022

Intel has shared more details on a new interconnect that is the foundation of the company’s long-term plan for x86, Arm and RISC-V architectures to co-exist in a single chip package. The semiconductor company is taking a modular approach to chip design with the option for customers to cram computing blocks such as CPUs, GPUs and AI accelerators inside a single chip package. Read more…

The Final Frontier: US Has Its First Exascale Supercomputer

May 30, 2022

In April 2018, the U.S. Department of Energy announced plans to procure a trio of exascale supercomputers at a total cost of up to $1.8 billion dollars. Over the ensuing four years, many announcements were made, many deadlines were missed, and a pandemic threw the world into disarray. Now, at long last, HPE and Oak Ridge National Laboratory (ORNL) have announced that the first of those... Read more…

US Senate Passes CHIPS Act Temperature Check, but Challenges Linger

July 19, 2022

The U.S. Senate on Tuesday passed a major hurdle that will open up close to $52 billion in grants for the semiconductor industry to boost manufacturing, supply chain and research and development. U.S. senators voted 64-34 in favor of advancing the CHIPS Act, which sets the stage for the final consideration... Read more…

Top500: Exascale Is Officially Here with Debut of Frontier

May 30, 2022

The 59th installment of the Top500 list, issued today from ISC 2022 in Hamburg, Germany, officially marks a new era in supercomputing with the debut of the first-ever exascale system on the list. Frontier, deployed at the Department of Energy’s Oak Ridge National Laboratory, achieved 1.102 exaflops in its fastest High Performance Linpack run, which was completed... Read more…

Newly-Observed Higgs Mode Holds Promise in Quantum Computing

June 8, 2022

The first-ever appearance of a previously undetectable quantum excitation known as the axial Higgs mode – exciting in its own right – also holds promise for developing and manipulating higher temperature quantum materials... Read more…

AMD’s MI300 APUs to Power Exascale El Capitan Supercomputer

June 21, 2022

Additional details of the architecture of the exascale El Capitan supercomputer were disclosed today by Lawrence Livermore National Laboratory’s (LLNL) Terri Read more…

Exclusive Inside Look at First US Exascale Supercomputer

July 1, 2022

HPCwire takes you inside the Frontier datacenter at DOE's Oak Ridge National Laboratory (ORNL) in Oak Ridge, Tenn., for an interview with Frontier Project Direc Read more…

PsiQuantum’s Path to 1 Million Qubits

April 21, 2022

PsiQuantum, founded in 2016 by four researchers with roots at Bristol University, Stanford University, and York University, is one of a few quantum computing startups that’s kept a moderately low PR profile. (That’s if you disregard the roughly $700 million in funding it has attracted.) The main reason is PsiQuantum has eschewed the clamorous public chase for... Read more…

Leading Solution Providers

Contributors

AMD Opens Up Chip Design to the Outside for Custom Future

June 15, 2022

AMD is getting personal with chips as it sets sail to make products more to the liking of its customers. The chipmaker detailed a modular chip future in which customers can mix and match non-AMD processors in a custom chip package. "We are focused on making it easier to implement chips with more flexibility," said Mark Papermaster, chief technology officer at AMD during the analyst day meeting late last week. Read more…

Intel Reiterates Plans to Merge CPU, GPU High-performance Chip Roadmaps

May 31, 2022

Intel reiterated it is well on its way to merging its roadmap of high-performance CPUs and GPUs as it shifts over to newer manufacturing processes and packaging technologies in the coming years. The company is merging the CPU and GPU lineups into a chip (codenamed Falcon Shores) which Intel has dubbed an XPU. Falcon Shores... Read more…

Nvidia, Intel to Power Atos-Built MareNostrum 5 Supercomputer

June 16, 2022

The long-troubled, hotly anticipated MareNostrum 5 supercomputer finally has a vendor: Atos, which will be supplying a system that includes both Nvidia and Inte Read more…

India Launches Petascale ‘PARAM Ganga’ Supercomputer

March 8, 2022

Just a couple of weeks ago, the Indian government promised that it had five HPC systems in the final stages of installation and would launch nine new supercomputers this year. Now, it appears to be making good on that promise: the country’s National Supercomputing Mission (NSM) has announced the deployment of “PARAM Ganga” petascale supercomputer at Indian Institute of Technology (IIT)... Read more…

Tesla Bulks Up Its GPU-Powered AI Super – Is Dojo Next?

August 16, 2022

Tesla has revealed that its biggest in-house AI supercomputer – which we wrote about last year – now has a total of 7,360 A100 GPUs, a nearly 28 percent uplift from its previous total of 5,760 GPUs. That’s enough GPU oomph for a top seven spot on the Top500, although the tech company best known for its electric vehicles has not publicly benchmarked the system. If it had, it would... Read more…

Is Time Running Out for Compromise on America COMPETES/USICA Act?

June 22, 2022

You may recall that efforts proposed in 2020 to remake the National Science Foundation (Endless Frontier Act) have since expanded and morphed into two gigantic bills, the America COMPETES Act in the U.S. House of Representatives and the U.S. Innovation and Competition Act in the U.S. Senate. So far, efforts to reconcile the two pieces of legislation have snagged and recent reports... Read more…

Nvidia R&D Chief on How AI is Improving Chip Design

April 18, 2022

Getting a glimpse into Nvidia’s R&D has become a regular feature of the spring GTC conference with Bill Dally, chief scientist and senior vice president of research, providing an overview of Nvidia’s R&D organization and a few details on current priorities. This year, Dally focused mostly on AI tools that Nvidia is both developing and using in-house to improve... Read more…

Exascale Watch: Aurora Installation Underway, Now Open for Reservations

May 10, 2022

Installation has begun on the Aurora supercomputer, Rick Stevens (associate director of Argonne National Laboratory) revealed today during the Intel Vision event keynote taking place in Dallas, Texas, and online. Joining Intel exec Raja Koduri on stage, Stevens confirmed that the Aurora build is underway – a major development for a system that is projected to deliver more... Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire