ESnet Launches Architecture to Help Researchers Deliver on Data-Intensive Science

By Nicole Hemsoth

April 26, 2012

The U.S. Department of Energy’s Energy Sciences Network, or ESnet, provides reliable high-bandwidth network services to thousands of researchers tackling some of the most pressing scientific and engineering problems, such as finding new sources of clean energy, increasing energy efficiency, understanding climate change, developing new materials for industry and discovering the nature of our universe. To support these research endeavors, ESnet connects scientists at more than 40 DOE sites with experimental and computing facilities in the U.S. and abroad, as well as with their collaborators around the world. ESnet is managed for DOE’s Office of Science by Lawrence Berkeley National Laboratory.

As science becomes increasingly data-intensive, the ESnet staff regularly meets with scientists to better understand their future networking needs, then develops and deploys the infrastructure and services to address those requirements before they become a reality. One example of this is the Advanced Networking Initiative, a prototype 100 gigabits-per-second networking connecting the DOE Office of Science’s top supercomputing centers in California, Illinois and Tennessee, and an international peering point in New York. This 100 Gbps prototype is now being transitioned to production and will be rolled out to all other connected DOE sites in the coming year.

In order to help these research institutions fully capitalize on this growing availability of bandwidth to manage their growing data sets, ESnet is now working with the scientific community to encourage the use of a network design model called the “Science DMZ.” The Science DMZ is a specially designed local networking infrastructure aimed at speeding the delivery of scientific data. In March 2012, the National Science Foundation supported the concept by issuing a solicitation for proposals from universities to develop Science DMZs as they upgrade their local network infrastructures.

Leading the development of the Science DMZ effort at ESnet is Eli Dart, a network engineer with previous experience at Sandia National Laboratories and the National Energy Research Scientific Computing Center. In this interview conducted by Jon Bashor of Berkeley Lab, Dart answers some basic questions about the nature of the project and its principle goals.

Jon Bashor: What is the Science DMZ and where did the Science DMZ idea come from?

Eli Dart: In its purest form, it’s an element of the overall network architecture, typically a dedicated portion of a site or campus network, located as close to the network perimeter as possible, that serves only high-performance science applications. The intent of the Science DMZ is to simplify the deployment and support of high-performance and data-intensive science applications that rely on high-speed networking for success. These applications have unique network requirements that typically cannot be met by networks that are optimized for normal business operations like web browsing, procurement and financial systems, and the like. The idea itself came from two places.

The concept of a DMZ network originated in the network security space where so-called network “demilitarized zones” or DMZs are used to provide a dedicated portion of the network near the site perimeter specifically configured to support services that interact with the outside world. These services often include authoritative DNS, incoming email, outward facing websites, etc. These services usually fall under a security policy that’s different than the one for the rest of the enterprise architecture.

You can extend that notion to build a dedicated piece of the network specifically for high performance scientific applications, again located at or near the perimeter, and with hardware you know can handle these applications. The Science DMZ is not configured to handle the standard enterprise or business functions, such as email and web servers, desktop applications, and so forth. These typically need a massive security infrastructure to protect them, and the security measures required to protect business servers and desktop applications typically cause problems for high-performance applications. The Science DMZ model explicitly separates the science traffic from general-purpose network traffic, and allows appropriate security policies and enforcement mechanisms to be applied to each.

The second source for the Science DMZ concept came from working with TCP, or the Transmission Control Protocol. While most science applications that need reliable data delivery use TCP-based tools for data movement, TCP’s interpretation of packet loss can cause performance issues. TCP interprets packet loss as network congestion, and so when loss is encountered TCP dramatically reduces its sending rate – slowing the data transfer. In practice even a tiny amount of loss (much less than 1%) can be enough to reduce TCP performance by over a factor of 100.

For years people have been trying to fix TCP (with some success), but packet loss combined with high latency is a serious performance killer. It’s easier to build an infrastructure to provide loss-free IP service and to accommodate TCP rather than change it – this is what the Science DMZ model aims to accomplish.

Bashor: What makes up the Science DMZ model?

Dart: The Science DMZ itself is a portion of the network, at or near the site perimeter, which is specifically configured to support high-performance science applications. There are several key aspects to the Science DMZ.

First, it must be built with capable equipment that can handle high-rate flows without dropping packets. Typically, that means good equipment (not cheap wiring closet switches) with enough output buffer space to handle bursty high-rate long-distance TCP flows. The switches and routers need to be able to accurately account for packets (especially the ones they drop) so that packet loss can be accounted for and its cause fixed.

Second, data transfer should be done on dedicated servers – Data Transfer Nodes, or DTNs – that are designed and configured for the purpose. Their TCP stacks need to be tuned and they need access to high-speed storage. We have seen successful DTN implementations using high-speed local RAID as well as GPFS or Lustre filesystems, the parallel filesystem model is typically found at supercomputer centers.

Third, a Science DMZ needs test and measurement infrastructure, typically perfSONAR that allows you to identify any issue that may be causing performance issues. Many problems that are real performance killers are what we call “soft failures.” A soft failure causes performance degradation so that the network is not useful for data-intensive science but does not cause an outage that identifies the failing component. The only way to find these is to independently test the infrastructure to locate the problem – if perfSONAR is already deployed, this is much easier than if the first step of the process is to find and deploy a test machine and the second step is to get the site at the other end to find a spare box and deploy it.

Finally, the Science DMZ incorporates a security policy that is tailored to the science applications rather than to general-purpose business computing. You don’t need to scan 50TB of simulation output for email viruses, and you don’t run an email client on your Data Transfer Node. So, why conflate the security policies and enforcement mechanisms for the two, especially when doing so will effectively compromise the science mission? Firewalls and other security enforcement boxes are typically unable to handle the throughput needed for data-intensive science – and they essentially never support advanced science services such as virtual circuits or software-defined networking.

Bashor: Why does it matter?

Dart: The real reason all this matters is that the current and future generations of scientific instruments are producing data at a level we’ve never seen before. Based on our projections, ESnet is expected to carry over 100 petabytes of data per month by 2015. And there is the potential for stupendous scientific advancements in that data deluge. The challenge is to figure out how to get the science done without spending the bulk of your time doing data management. Scientists are physicists, chemists, biologists, geneticists and so on, but they are seldom network experts. They are scientists.

The data volumes are becoming large enough that the systems and networks are not capable of handling them if the equipment is configured to default settings or to accommodate business applications. There’s a need for an infrastructure that supports data-intensive science. That infrastructure needs to be designed for data mobility, which means you can get the data where you need it, when you need it. In some cases, the analysis code is on a system close to the data, while other times the scientist wants to analyze the data on local resources – we need to support it all. Data-intensive science is what we’re all going to be doing for the next decade or more.

Bashor: Can you describe a typical user who would benefit from having a Science DMZ?

Dart: The main benefit of the Science DMZ is that the scientist who needs to move data doesn’t have to first troubleshoot the infrastructure in order to use it. Scientists should not have to fix the network, the data transfer servers, and so forth before they can get to work.

There really isn’t a typical user, but there are some basic commonalities. One example could be data taken from a beamline at DOE’s Advanced Light Source. A data transfer node has been set up and Globus Online installed for users who need to fetch the data. Then you have the well-known Large Hadron Collider, which has several primary Tier 1 centers feeding data to the Tier 2 centers. This requires significantly more infrastructure. In both cases, you need to make sure the network is designed correctly so that data transfer tools work correctly. These fundamental principles benefit all users.

Bashor: How does ESnet play into this equation?

Dart: ESnet is the high-performance network for DOE’s Office of Science. It’s the backbone network infrastructure for the national laboratory system, supporting science at those labs. Through our 25 years of experience serving the scientific community, we have become a central repository for the expertise to support high-performance networking. So, part of our job is to be available to support scientists at the labs and their collaborators, such as researchers at universities.

The assumption is that the high-performance network infrastructure exists to support all parts of these modern scientific collaborations. The services must be consistent from end to end – from scientist to scientist – now matter where they may located and regardless of who owns the pieces of the infrastructure. For example, if scientists at the SLAC Linear Accelerator Center are sharing data with colleagues at a Max Planck Institute in Germany, the data moves from SLAC’s local network over ESnet to GEANT, the pan-European research network, then over Germany’s DFN network and onto the local network at the institute – crossing five different domains, owned and operated by five different organizations. ESnet has built partnerships with the global ecosystem of research and education networks so that if a network problem occurs, we can work collaboratively to quickly resolve it – wherever it is.

Bashor: The NSF recently cited Science DMZ as an upgrade that universities should consider as they work to enhance their overall IT infrastructure. Your thoughts on this?

Dart: We think it’s wonderful. The infrastructure that will be built with those funds will enable discoveries that otherwise would not be possible. It’s a critical investment in the scientific infrastructure of this country.

As I said, we’re all going spend the next decade or more supporting data-intensive science, so we need to get the infrastructure right. It needs to be adaptable, flexible and expandable. We can see what’s coming in the next one to three years. In some fields, the cost of generating data has fallen to almost zero. In genome sequencing, the cost per genome has fallen off a cliff. The cost of a raw megabyte of DNA sequence is now less than 10 cents. In July 2001, it was about $4,500. What this means is that we are entering a world where scientific productivity is gated on data analysis, not data generation.

In physics, new detectors will capture data in the terabyte-per-second range, with data analysis and reduction built into the detectors, so that only the data the researchers are really interested in will be kept. This is already happening at the LHC. The ATLAS detector generates about a petabyte of data a second. It’s sent through a multi-stage trigger farm where it’s reduced to about 2.5 gigabits per second coming out. Now many other science domains are getting into this same situation.

Looking 10 years out is beyond the current planning and budget outlooks – and well outside the scope of a single procurement or a single technology. This puts the work into the architecture space, not the technology or device space. We do know that everything about the data is growing exponentially, but not the funding. So we need to design a system that works well in general and is adaptable.

If you want to do capability-class science, you need to have capability-class infrastructure. You have to have the resources appropriate to get the most return on your scientific investment.

Bashor: ESnet has a number of projects to improve end-to-end network performance through testing and measurement. Can you talk about those briefly?

Dart: Performance testing and measurement is absolutely critical. If we go back to the need to accommodate TCP because packet loss is the number one enemy of data-intensive science, we have to be able to find and fix any problems quickly. Because issues can arise anywhere on the network path which can include multiple administrative domains, you need to have the means to individually test the paths, and take out or reconfigure the problem areas.

For this reason, ESnet – with Internet2 and several other collaborators – helped develop perfSONAR, an infrastructure for network performance monitoring, making it easier to solve end-to-end performance problems on paths crossing several networks. ESnet has test and measurement capabilities at every hub site and router on our network. You have to have this infrastructure in place before a problem occurs – this allows you to find and fix the problem in hours or days, not months.

Another service for improving end-to-end performance is OSCARS, ESnet’s On-Demand Secure Circuits and Advance Reservation System. OSCARS provides multi-domain, high-bandwidth virtual circuits that guarantee end-to-end network data transfer performance. With a Science DMZ, OSCARS can touch down at an institution, along with other science-specific services. This allows for capability-class services to be used without interfering with the enterprise system. The bottom line is that science opportunities have a better chance of not being missed.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

New Panasas High Performance Storage Straddles Commercial-Traditional HPC

November 13, 2018

High performance storage vendor Panasas has launched a new version of its ActiveStor product line this morning featuring what the company said is the industry’s first plug-and-play, portable parallel file system that d Read more…

By Doug Black

SC18 Student Cluster Competition – Revealing the Field

November 13, 2018

It’s November again and we’re almost ready for the kick-off of one of the greatest computer sports events in the world – the SC Student Cluster Competition. This is the twelfth time that teams of university undergr Read more…

By Dan Olds

US Leads Supercomputing with #1, #2 Systems & Petascale Arm

November 12, 2018

The 31st Supercomputing Conference (SC) - commemorating 30 years since the first Supercomputing in 1988 - kicked off in Dallas yesterday, taking over the Kay Bailey Hutchison Convention Center and much of the surrounding Read more…

By Tiffany Trader

HPE Extreme Performance Solutions

AI Can Be Scary. But Choosing the Wrong Partners Can Be Mortifying!

As you continue to dive deeper into AI, you will discover it is more than just deep learning. AI is an extremely complex set of machine learning, deep learning, reinforcement, and analytics algorithms with varying compute, storage, memory, and communications needs. Read more…

IBM Accelerated Insights

New Data Management Techniques for Intelligent Simulations

The trend in high performance supercomputer design has evolved – from providing maximum compute capability for complex scalable science applications, to capacity computing utilizing efficient, cost-effective computing power for solving a small number of large problems or a large number of small problems. Read more…

OpenACC Talks Up Summit and Community Momentum at SC18

November 12, 2018

OpenACC – the directives-based parallel programing model for optimizing applications on heterogeneous architectures – is showcasing user traction and HPC impact at SC18. Most noteworthy is that five of 13 CAAR applic Read more…

By John Russell

New Panasas High Performance Storage Straddles Commercial-Traditional HPC

November 13, 2018

High performance storage vendor Panasas has launched a new version of its ActiveStor product line this morning featuring what the company said is the industry Read more…

By Doug Black

SC18 Student Cluster Competition – Revealing the Field

November 13, 2018

It’s November again and we’re almost ready for the kick-off of one of the greatest computer sports events in the world – the SC Student Cluster Competitio Read more…

By Dan Olds

US Leads Supercomputing with #1, #2 Systems & Petascale Arm

November 12, 2018

The 31st Supercomputing Conference (SC) - commemorating 30 years since the first Supercomputing in 1988 - kicked off in Dallas yesterday, taking over the Kay Ba Read more…

By Tiffany Trader

OpenACC Talks Up Summit and Community Momentum at SC18

November 12, 2018

OpenACC – the directives-based parallel programing model for optimizing applications on heterogeneous architectures – is showcasing user traction and HPC im Read more…

By John Russell

How ASCI Revolutionized the World of High-Performance Computing and Advanced Modeling and Simulation

November 9, 2018

The 1993 Supercomputing Conference was held in Portland, Oregon. That conference and it’s show floor provided a good snapshot of the uncertainty that U.S. supercomputing was facing in the early 1990s. Many of the companies exhibiting that year would soon be gone, either bankrupt or acquired by somebody else. Read more…

By Alex R. Larzelere

At SC18: GM, Boeing, Deere, BP Talk Enterprise HPC Strategies

November 9, 2018

SC18 in Dallas (Nov.11-16) will feature an impressive series of sessions focused on the enterprise HPC deployments at some of the largest industrial companies: Read more…

By Doug Black

SC 30th Anniversary Perennials 1988-2018

November 8, 2018

Many conferences try, fewer succeed. Thirty years ago, no one knew if the first SC would also be the last. Thirty years later, we know it’s the biggest annual Read more…

By Doug Black & Tiffany Trader

CEA’s Pick of ThunderX2-based Atos System Boosts Arm

November 8, 2018

Europe’s bet on Arm took another step forward today with selection of an Atos BullSequana X1310 system by CEA’s (French Alternative Energies and Atomic Ener Read more…

By John Russell

Cray Unveils Shasta, Lands NERSC-9 Contract

October 30, 2018

Cray revealed today the details of its next-gen supercomputing architecture, Shasta, selected to be the next flagship system at NERSC. We've known of the code-name "Shasta" since the Argonne slice of the CORAL project was announced in 2015 and although the details of that plan have changed considerably, Cray didn't slow down its timeline for Shasta. Read more…

By Tiffany Trader

TACC Wins Next NSF-funded Major Supercomputer

July 30, 2018

The Texas Advanced Computing Center (TACC) has won the next NSF-funded big supercomputer beating out rivals including the National Center for Supercomputing Ap Read more…

By John Russell

IBM at Hot Chips: What’s Next for Power

August 23, 2018

With processor, memory and networking technologies all racing to fill in for an ailing Moore’s law, the era of the heterogeneous datacenter is well underway, Read more…

By Tiffany Trader

Requiem for a Phi: Knights Landing Discontinued

July 25, 2018

On Monday, Intel made public its end of life strategy for the Knights Landing "KNL" Phi product set. The announcement makes official what has already been wide Read more…

By Tiffany Trader

CERN Project Sees Orders-of-Magnitude Speedup with AI Approach

August 14, 2018

An award-winning effort at CERN has demonstrated potential to significantly change how the physics based modeling and simulation communities view machine learni Read more…

By Rob Farber

House Passes $1.275B National Quantum Initiative

September 17, 2018

Last Thursday the U.S. House of Representatives passed the National Quantum Initiative Act (NQIA) intended to accelerate quantum computing research and developm Read more…

By John Russell

Summit Supercomputer is Already Making its Mark on Science

September 20, 2018

Summit, now the fastest supercomputer in the world, is quickly making its mark in science – five of the six finalists just announced for the prestigious 2018 Read more…

By John Russell

New Deep Learning Algorithm Solves Rubik’s Cube

July 25, 2018

Solving (and attempting to solve) Rubik’s Cube has delighted millions of puzzle lovers since 1974 when the cube was invented by Hungarian sculptor and archite Read more…

By John Russell

Leading Solution Providers

TACC’s ‘Frontera’ Supercomputer Expands Horizon for Extreme-Scale Science

August 29, 2018

The National Science Foundation and the Texas Advanced Computing Center announced today that a new system, called Frontera, will overtake Stampede 2 as the fast Read more…

By Tiffany Trader

US Leads Supercomputing with #1, #2 Systems & Petascale Arm

November 12, 2018

The 31st Supercomputing Conference (SC) - commemorating 30 years since the first Supercomputing in 1988 - kicked off in Dallas yesterday, taking over the Kay Ba Read more…

By Tiffany Trader

HPE No. 1, IBM Surges, in ‘Bucking Bronco’ High Performance Server Market

September 27, 2018

Riding healthy U.S. and global economies, strong demand for AI-capable hardware and other tailwind trends, the high performance computing server market jumped 28 percent in the second quarter 2018 to $3.7 billion, up from $2.9 billion for the same period last year, according to industry analyst firm Hyperion Research. Read more…

By Doug Black

Intel Announces Cooper Lake, Advances AI Strategy

August 9, 2018

Intel's chief datacenter exec Navin Shenoy kicked off the company's Data-Centric Innovation Summit Wednesday, the day-long program devoted to Intel's datacenter Read more…

By Tiffany Trader

Germany Celebrates Launch of Two Fastest Supercomputers

September 26, 2018

The new high-performance computer SuperMUC-NG at the Leibniz Supercomputing Center (LRZ) in Garching is the fastest computer in Germany and one of the fastest i Read more…

By Tiffany Trader

Houston to Field Massive, ‘Geophysically Configured’ Cloud Supercomputer

October 11, 2018

Based on some news stories out today, one might get the impression that the next system to crack number one on the Top500 would be an industrial oil and gas mon Read more…

By Tiffany Trader

D-Wave Breaks New Ground in Quantum Simulation

July 16, 2018

Last Friday D-Wave scientists and colleagues published work in Science which they say represents the first fulfillment of Richard Feynman’s 1982 notion that Read more…

By John Russell

MLPerf – Will New Machine Learning Benchmark Help Propel AI Forward?

May 2, 2018

Let the AI benchmarking wars begin. Today, a diverse group from academia and industry – Google, Baidu, Intel, AMD, Harvard, and Stanford among them – releas Read more…

By John Russell

  • arrow
  • Click Here for More Headlines
  • arrow
Do NOT follow this link or you will be banned from the site!
Share This