Maxeler Technologies
HPCwire

Since 1986 - Covering the Fastest Computers
in the World and the People Who Run Them

Language Flags

Visit additional Tabor Communication Publications

Datanami
Digital Manufacturing Report
HPC in the Cloud

Blog: From the Editor

From the Editor | Main Blog Index

RoCE: An Ethernet-InfiniBand Love Story


To go along with the low-latency theme of this week's High Performance Computing Linux Financial Markets confab in New York City, the InfiniBand Trade Association (IBTA) announced the release of the RDMA over Converged Ethernet standard that brings InfiniBand-like performance and efficiency into the Ethernet realm.

Abbreviated RoCE (and pronounced "Rocky"), the new standard allows the RDMA guts of InfiniBand to run over Ethernet. Basically the IBTA has taken the InfiniBand stack, left the IB transport and network layers intact, and swapped the IB link layer for Ethernet. Or as OpenFabrics Alliance Executive Director Bill Boas put it: "The only change here is that the verbs in the InfiniBand standard have been implemented over Ethernet."

This is a much simpler solution than iWARP (Internet Wide Area RDMA Protocol), which also uses RDMA, but incorporates TCP/IP into the stack. In a sense, iWARP tried to unify InfiniBand and IP, but that model has garnered limited appeal. Supporting the TCP/IP stack meant latency could only get into the 10 microsecond range. Freed of that extra processing burden, RoCE latency can approach 1-3 microsecond territory. And it can be implemented more cheaply and with less power consumption. Yes, IP support is missing, but in a closed cluster environment, you would normally just use a gateway node to talk to the outside world.

In general, RoCE is aimed at users of clustered computing setups who might otherwise have opted for InfiniBand because of its speed and agility, but who are already married to Ethernet -- either to maintain compatibility with existing storage networks and compute infrastructure or because their local datacenter already has a big investment in Ethernet technology, expertise and management tools. Mellanox has been talking about this technology for a year or so, under the moniker low-latency Ethernet.

"Essentially what you're able to do now is run close to InfiniBand-like latency over 10 Gigabit Ethernet," says Brian Sparks, IBTA marketing working group co-chair and director of marketing communications at Mellanox . "But you don't have the InfiniBand barrier and the learning curve that goes with that."

RoCE isn't quite InfiniBand-strength, though. QDR IB nets 32 Gbps and sub-microsecond latencies, while RoCE is currently limited to 10 Gig and latencies closer to single-digit microseconds. For most apps, though, 10 Gig is plenty of bandwidth (and there's a clean path to 40 and 100 Gig when Ethernet catches up). The real hurt is on the latency side.

Financial services, database warehousing, cloud computing and related virtualization apps are all potential targets of this technology. One of the tastiest low-hanging fruits for RoCE is high frequency trading (HFT), an Ethernet-based application that is all about latency. HFT is a highly lucrative class of algorithmic trading that relies far more on network performance than compute muscle. The object of the game is to turn reams of market data coming in from Ethernet-based ticker feeds into split-second arbitrage opportunities. One person I recently spoke with characterized it as "picking up a nickel in front of a freight train." RoCE seems tailor-made for this type of application.

In more traditional HPC, RoCE could have plenty of takers. Again, the real draw here is the ubiquity of the Ethernet ecosystem and the promise of near-InfiniBand performance. It's worth noting that more than half the systems on the TOP500 list are still employing Ethernet interconnects. That's because there are plenty of big cluster-based workloads (for example, data mining) that don't require obsessively tight coupling, but would still benefit from better latency than vanilla Ethernet. As HPC makes deeper inroads into the enterprise, RoCE could look fill this role.

As of this week, RoCE is implemented in OpenFabrics Enterprise Distribution (OFED) 1.5.1. The Linux version is available today, with a Windows implementation to follow later this year. That makes it especially nice for applications already written for OFED RDMA. In these cases, there would be no need to twiddle with the code again; the apps should just auto-magically run over any RoCE fabric.

On the hardware side, basically you need an L2 Ethernet switch with IEEE DCB (Data Center Bridging, aka Converged Enhanced Ethernet) with support for priority flow control. On the compute or storage server end, you need an RoCE-capable network adapter. Expect the most enthusiastic vendors to come out with products later this year. Mellanox has already declared its intentions to offer RoCE-friendly adapters. OpenFabrics will release a software-based RoCE later in the second quarter. Soft-RoCE will make a regular 10GbE NIC act like the hardware version.

One might wonder why the IBTA and its InfiniBand-loving members decided to push an Ethernet protocol at all. If RoCE is successful, there's bound to be some cannibalization of the InfiniBand market. But that's the wrong way to think about it. First, there are no InfiniBand vendors anymore, at least not in the strict sense. All these companies -- Mellanox, Voltaire and QLogic -- offer Ethernet products of one sort or another. The market decided some time ago that IB technology would only spread so far. RoCE is another way for these vendors to reach customers they couldn't attract before. The calculation is that there's enough daylight between RoCE and InfiniBand to support the viability of both technologies.

Posted by Michael Feldman - April 22, 2010 @ 8:06 PM, Pacific Daylight Time

Discussion

There are 6 discussion items posted.

Why can't people get this right??
Submitted by BradBooth on Apr 22, 2010 @ 10:56 PM EDT


IEEE 802.1 Data Center Bridging is not "aka Converged Enhanced Ethernet". CEE or CE is a trademarked term that appears to be IBTA's way of stating IEEE 802.1 DCB, probably because RoDCB doesn't sound as cool as RoCE (rocky).

I salute the IBTA for finally realizing that Ethernet is a great transport technology. With the enhancements being introduced (aka, not widely deployed) by IEEE 802.1 DCB, technologies like Fibre Channel and InfiniBand can use Ethernet as their primary layer 1 and 2 protocol. The IEEE 802.1 DCB enhancements also improve protocols such as iSCSI and iWARP which also benefit from the ability to TCP/IP for routing and switching.

End users are finally being presented with the ability to deploy only an Ethernet network infrastructure and to use FCoE and iSCSI for storage, and IBoE (InfiniBand over Ethernet) and iWARP for HPC and interprocessor communications.

Post #1

iWARP provides RDMA over Ethernet - Part 1
Submitted by David Fair on Apr 23, 2010 @ 1:16 PM EDT


As chair of the Ethernet Alliance’s iWARP Working Group, I’d like to highlight some facts regarding the benefits of iWARP and why the industry chose this path several years ago. Running the Open Fabrics software stack (OFED) that supports RDMA for both InfiniBand and iWARP over Ethernet is nothing new. RoCE is the second protocol to run that stack over Ethernet. iWARP, as specified by the IETF, has been in production for years from multiple vendors.

The fundamental difference is that iWARP runs on top of TCP/IP and inherits its proven robustness. There’s no limitation on its scalability. It benefits from DCB but is in no way dependent upon it. iWARP works today with the entire Ethernet infrastructure. You don’t need to install DCB switches. In contract, RoCE uses a newly defined transport over the Ethernet wire and is restricted to Level 2 switching.

HPCWire states, “[RoCE] is a much simpler solution than iWARP.” RoCE certainly no simpler for the end user, as both technologies use exactly the same software stack. In fact, arguably RoCE is more challenging for the end user because he/she needs to deploy new DCB-capable and interoperable switches and insure that all the nodes of interest can be serviced by the L2 switching they happen to provide. If the statement is supposed to mean RoCE is simpler for silicon suppliers, why should anyone else besides silicon suppliers care even if true?

Post #2

iWARP provides RDMA over Ethernet – Part 2
Submitted by David Fair on Apr 23, 2010 @ 1:16 PM EDT


“And [RoCE] can be implemented more cheaply and with less power consumption.” Cheaper for whom? Is the claim really that RoCE NICs will be cheaper than iWARP NICs?

There certainly is no evidence that there is any significant performance advantage to RoCE. HPCWire states, “Supporting the TCP/IP stack meant latency could only get into the 10 microsecond range.” Intel and Chelsio Communications, for example, have been shipping iWARP NICs with latencies well below 10 microseconds for years. Recently, Chelsio announced an iWARP NIC with two microsecond latency: http://www.chelsio.com/pr_032210.html.

Unfortunately, a new technology has been launched upon the industry that surely would have failed the IEEE’s “distinct identity” test requiring that it deliver something substantially new. The unfortunate part is inevitable market confusion that threatens to slow deployment of RDMA in the Ethernet infrastructure.

David Fair
Chair of Ethernet Alliance iWARP Working Group

Post #3

RoCE does not require DCB
Submitted by Paul Grun on Apr 29, 2010 @ 3:19 PM EDT


A quick correction to Michael's article; RoCE does not require nor depend on DCB in any way.

Post #4

roce clear winner
Submitted by sngan on Feb 22, 2011 @ 8:03 AM EST


my long term experience clearly says:

iwarp is unstable (at least the most stuff in ofed) and sucks all the way long.

steve

Post #5


Submitted by bandmedia on May 2, 2011 @ 11:42 AM EDT


industrial flooring - Dog Treadmill

Post #6

Join the Discussion

Join the Discussion

Become a Registered User Today!


Registered Users Log in join the Discussion

Michael Feldman

Michael Feldman

Michael Feldman is the editor of HPCwire.

More Michael Feldman

Acer

Recent Comments

No Recent Blog Comments

Feature Articles

OpenACC Starts to Gather Developer Mindshare

PGI, Cray, and CAPS enterprise are moving quickly to get their new OpenACC-supported compilers into the hands of GPGPU developers. At NVIDIA's GPU Technology Conference this week, there was plenty of discussion around the new HPC accelerator framework, and all three OpenACC compiler makers, as well as NVIDIA, were talking up the technology.
Read more...

NVIDIA Launches Kepler Into HPC

NVIDIA has introduced its first Kepler-generation GPU product for high performance computing, and revealed some of the inner working of the new architecture. The announcement took place at the kickoff of the company's GPU Technology Conference taking place this week in San Jose, California.
Read more...

Intel Rolls Out New Server CPUs

Intel Corp. has launched three new families of Xeon processors, joining the Xeon E5-2600 series the chipmaker introduced in March. These latest chips span the entire market for the Xeon line, from four- and two-socket servers, down to entry-level workstations and microservers. A number of HPC server makers, including SGI, Dell, and Appro announced updated hardware based on the new silicon.
Read more...

Around the Web

NVIDIA’s Bill Dally Talks 3D Chips and More at GTC

May 16, 2012 | Chief scientist discusses memory stacks, interconnects, and US technology leadership.
Read more...

NVIDIA Unveils Virtualized GPU with Kepler-Based Board

May 15, 2012 | GPU maker conjures up visualization technology for virtual desktops.
Read more...

Zettaflops Will Happen Says HPC Analyst

May 14, 2012 | Pessimistic predictions about technology have a poor track record, according to 451's John Barr.
Read more...

Next-Gen Memory on the Horizon

May 10, 2012 | DRAM manufacturers gear up for DDR4.
Read more...

US Energy Secretary Talks Supercomputing

May 09, 2012 | Steven Chu discusses the role of supercomputing in energy research.
Read more...

Sponsored Whitepapers

Sponsored Multimedia