ESnet Launches Architecture to Help Researchers Deliver on Data-Intensive Science

By Nicole Hemsoth

April 26, 2012

The U.S. Department of Energy’s Energy Sciences Network, or ESnet, provides reliable high-bandwidth network services to thousands of researchers tackling some of the most pressing scientific and engineering problems, such as finding new sources of clean energy, increasing energy efficiency, understanding climate change, developing new materials for industry and discovering the nature of our universe. To support these research endeavors, ESnet connects scientists at more than 40 DOE sites with experimental and computing facilities in the U.S. and abroad, as well as with their collaborators around the world. ESnet is managed for DOE’s Office of Science by Lawrence Berkeley National Laboratory.

As science becomes increasingly data-intensive, the ESnet staff regularly meets with scientists to better understand their future networking needs, then develops and deploys the infrastructure and services to address those requirements before they become a reality. One example of this is the Advanced Networking Initiative, a prototype 100 gigabits-per-second networking connecting the DOE Office of Science’s top supercomputing centers in California, Illinois and Tennessee, and an international peering point in New York. This 100 Gbps prototype is now being transitioned to production and will be rolled out to all other connected DOE sites in the coming year.

In order to help these research institutions fully capitalize on this growing availability of bandwidth to manage their growing data sets, ESnet is now working with the scientific community to encourage the use of a network design model called the “Science DMZ.” The Science DMZ is a specially designed local networking infrastructure aimed at speeding the delivery of scientific data. In March 2012, the National Science Foundation supported the concept by issuing a solicitation for proposals from universities to develop Science DMZs as they upgrade their local network infrastructures.

Leading the development of the Science DMZ effort at ESnet is Eli Dart, a network engineer with previous experience at Sandia National Laboratories and the National Energy Research Scientific Computing Center. In this interview conducted by Jon Bashor of Berkeley Lab, Dart answers some basic questions about the nature of the project and its principle goals.

Jon Bashor: What is the Science DMZ and where did the Science DMZ idea come from?

Eli Dart: In its purest form, it’s an element of the overall network architecture, typically a dedicated portion of a site or campus network, located as close to the network perimeter as possible, that serves only high-performance science applications. The intent of the Science DMZ is to simplify the deployment and support of high-performance and data-intensive science applications that rely on high-speed networking for success. These applications have unique network requirements that typically cannot be met by networks that are optimized for normal business operations like web browsing, procurement and financial systems, and the like. The idea itself came from two places.

The concept of a DMZ network originated in the network security space where so-called network “demilitarized zones” or DMZs are used to provide a dedicated portion of the network near the site perimeter specifically configured to support services that interact with the outside world. These services often include authoritative DNS, incoming email, outward facing websites, etc. These services usually fall under a security policy that’s different than the one for the rest of the enterprise architecture.

You can extend that notion to build a dedicated piece of the network specifically for high performance scientific applications, again located at or near the perimeter, and with hardware you know can handle these applications. The Science DMZ is not configured to handle the standard enterprise or business functions, such as email and web servers, desktop applications, and so forth. These typically need a massive security infrastructure to protect them, and the security measures required to protect business servers and desktop applications typically cause problems for high-performance applications. The Science DMZ model explicitly separates the science traffic from general-purpose network traffic, and allows appropriate security policies and enforcement mechanisms to be applied to each.

The second source for the Science DMZ concept came from working with TCP, or the Transmission Control Protocol. While most science applications that need reliable data delivery use TCP-based tools for data movement, TCP’s interpretation of packet loss can cause performance issues. TCP interprets packet loss as network congestion, and so when loss is encountered TCP dramatically reduces its sending rate – slowing the data transfer. In practice even a tiny amount of loss (much less than 1%) can be enough to reduce TCP performance by over a factor of 100.

For years people have been trying to fix TCP (with some success), but packet loss combined with high latency is a serious performance killer. It’s easier to build an infrastructure to provide loss-free IP service and to accommodate TCP rather than change it – this is what the Science DMZ model aims to accomplish.

Bashor: What makes up the Science DMZ model?

Dart: The Science DMZ itself is a portion of the network, at or near the site perimeter, which is specifically configured to support high-performance science applications. There are several key aspects to the Science DMZ.

First, it must be built with capable equipment that can handle high-rate flows without dropping packets. Typically, that means good equipment (not cheap wiring closet switches) with enough output buffer space to handle bursty high-rate long-distance TCP flows. The switches and routers need to be able to accurately account for packets (especially the ones they drop) so that packet loss can be accounted for and its cause fixed.

Second, data transfer should be done on dedicated servers – Data Transfer Nodes, or DTNs – that are designed and configured for the purpose. Their TCP stacks need to be tuned and they need access to high-speed storage. We have seen successful DTN implementations using high-speed local RAID as well as GPFS or Lustre filesystems, the parallel filesystem model is typically found at supercomputer centers.

Third, a Science DMZ needs test and measurement infrastructure, typically perfSONAR that allows you to identify any issue that may be causing performance issues. Many problems that are real performance killers are what we call “soft failures.” A soft failure causes performance degradation so that the network is not useful for data-intensive science but does not cause an outage that identifies the failing component. The only way to find these is to independently test the infrastructure to locate the problem – if perfSONAR is already deployed, this is much easier than if the first step of the process is to find and deploy a test machine and the second step is to get the site at the other end to find a spare box and deploy it.

Finally, the Science DMZ incorporates a security policy that is tailored to the science applications rather than to general-purpose business computing. You don’t need to scan 50TB of simulation output for email viruses, and you don’t run an email client on your Data Transfer Node. So, why conflate the security policies and enforcement mechanisms for the two, especially when doing so will effectively compromise the science mission? Firewalls and other security enforcement boxes are typically unable to handle the throughput needed for data-intensive science – and they essentially never support advanced science services such as virtual circuits or software-defined networking.

Bashor: Why does it matter?

Dart: The real reason all this matters is that the current and future generations of scientific instruments are producing data at a level we’ve never seen before. Based on our projections, ESnet is expected to carry over 100 petabytes of data per month by 2015. And there is the potential for stupendous scientific advancements in that data deluge. The challenge is to figure out how to get the science done without spending the bulk of your time doing data management. Scientists are physicists, chemists, biologists, geneticists and so on, but they are seldom network experts. They are scientists.

The data volumes are becoming large enough that the systems and networks are not capable of handling them if the equipment is configured to default settings or to accommodate business applications. There’s a need for an infrastructure that supports data-intensive science. That infrastructure needs to be designed for data mobility, which means you can get the data where you need it, when you need it. In some cases, the analysis code is on a system close to the data, while other times the scientist wants to analyze the data on local resources – we need to support it all. Data-intensive science is what we’re all going to be doing for the next decade or more.

Bashor: Can you describe a typical user who would benefit from having a Science DMZ?

Dart: The main benefit of the Science DMZ is that the scientist who needs to move data doesn’t have to first troubleshoot the infrastructure in order to use it. Scientists should not have to fix the network, the data transfer servers, and so forth before they can get to work.

There really isn’t a typical user, but there are some basic commonalities. One example could be data taken from a beamline at DOE’s Advanced Light Source. A data transfer node has been set up and Globus Online installed for users who need to fetch the data. Then you have the well-known Large Hadron Collider, which has several primary Tier 1 centers feeding data to the Tier 2 centers. This requires significantly more infrastructure. In both cases, you need to make sure the network is designed correctly so that data transfer tools work correctly. These fundamental principles benefit all users.

Bashor: How does ESnet play into this equation?

Dart: ESnet is the high-performance network for DOE’s Office of Science. It’s the backbone network infrastructure for the national laboratory system, supporting science at those labs. Through our 25 years of experience serving the scientific community, we have become a central repository for the expertise to support high-performance networking. So, part of our job is to be available to support scientists at the labs and their collaborators, such as researchers at universities.

The assumption is that the high-performance network infrastructure exists to support all parts of these modern scientific collaborations. The services must be consistent from end to end – from scientist to scientist – now matter where they may located and regardless of who owns the pieces of the infrastructure. For example, if scientists at the SLAC Linear Accelerator Center are sharing data with colleagues at a Max Planck Institute in Germany, the data moves from SLAC’s local network over ESnet to GEANT, the pan-European research network, then over Germany’s DFN network and onto the local network at the institute – crossing five different domains, owned and operated by five different organizations. ESnet has built partnerships with the global ecosystem of research and education networks so that if a network problem occurs, we can work collaboratively to quickly resolve it – wherever it is.

Bashor: The NSF recently cited Science DMZ as an upgrade that universities should consider as they work to enhance their overall IT infrastructure. Your thoughts on this?

Dart: We think it’s wonderful. The infrastructure that will be built with those funds will enable discoveries that otherwise would not be possible. It’s a critical investment in the scientific infrastructure of this country.

As I said, we’re all going spend the next decade or more supporting data-intensive science, so we need to get the infrastructure right. It needs to be adaptable, flexible and expandable. We can see what’s coming in the next one to three years. In some fields, the cost of generating data has fallen to almost zero. In genome sequencing, the cost per genome has fallen off a cliff. The cost of a raw megabyte of DNA sequence is now less than 10 cents. In July 2001, it was about $4,500. What this means is that we are entering a world where scientific productivity is gated on data analysis, not data generation.

In physics, new detectors will capture data in the terabyte-per-second range, with data analysis and reduction built into the detectors, so that only the data the researchers are really interested in will be kept. This is already happening at the LHC. The ATLAS detector generates about a petabyte of data a second. It’s sent through a multi-stage trigger farm where it’s reduced to about 2.5 gigabits per second coming out. Now many other science domains are getting into this same situation.

Looking 10 years out is beyond the current planning and budget outlooks – and well outside the scope of a single procurement or a single technology. This puts the work into the architecture space, not the technology or device space. We do know that everything about the data is growing exponentially, but not the funding. So we need to design a system that works well in general and is adaptable.

If you want to do capability-class science, you need to have capability-class infrastructure. You have to have the resources appropriate to get the most return on your scientific investment.

Bashor: ESnet has a number of projects to improve end-to-end network performance through testing and measurement. Can you talk about those briefly?

Dart: Performance testing and measurement is absolutely critical. If we go back to the need to accommodate TCP because packet loss is the number one enemy of data-intensive science, we have to be able to find and fix any problems quickly. Because issues can arise anywhere on the network path which can include multiple administrative domains, you need to have the means to individually test the paths, and take out or reconfigure the problem areas.

For this reason, ESnet – with Internet2 and several other collaborators – helped develop perfSONAR, an infrastructure for network performance monitoring, making it easier to solve end-to-end performance problems on paths crossing several networks. ESnet has test and measurement capabilities at every hub site and router on our network. You have to have this infrastructure in place before a problem occurs – this allows you to find and fix the problem in hours or days, not months.

Another service for improving end-to-end performance is OSCARS, ESnet’s On-Demand Secure Circuits and Advance Reservation System. OSCARS provides multi-domain, high-bandwidth virtual circuits that guarantee end-to-end network data transfer performance. With a Science DMZ, OSCARS can touch down at an institution, along with other science-specific services. This allows for capability-class services to be used without interfering with the enterprise system. The bottom line is that science opportunities have a better chance of not being missed.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

Is Data Science the Fourth Pillar of the Scientific Method?

April 18, 2019

Nvidia CEO Jensen Huang revived a decade-old debate last month when he said that modern data science (AI plus HPC) has become the fourth pillar of the scientific method. While some disagree with the notion that statistic Read more…

By Alex Woodie

At ASF 2019: The Virtuous Circle of Big Data, AI and HPC

April 18, 2019

We've entered a new phase in IT -- in the world, really -- where the combination of big data, artificial intelligence, and high performance computing is pushing the bounds of what's possible in business and science, in w Read more…

By Alex Woodie with Doug Black and Tiffany Trader

Google Open Sources TensorFlow Version of MorphNet DL Tool

April 18, 2019

Designing optimum deep neural networks remains a non-trivial exercise. “Given the large search space of possible architectures, designing a network from scratch for your specific application can be prohibitively expens Read more…

By John Russell

HPE Extreme Performance Solutions

HPE and Intel® Omni-Path Architecture: How to Power a Cloud

Learn how HPE and Intel® Omni-Path Architecture provide critical infrastructure for leading Nordic HPC provider’s HPCFLOW cloud service.

powercloud_blog.jpgFor decades, HPE has been at the forefront of high-performance computing, and we’ve powered some of the fastest and most robust supercomputers in the world. Read more…

IBM Accelerated Insights

Bridging HPC and Cloud Native Development with Kubernetes

The HPC community has historically developed its own specialized software stack including schedulers, filesystems, developer tools, container technologies tuned for performance and large-scale on-premises deployments. Read more…

Interview with 2019 Person to Watch Michela Taufer

April 18, 2019

Today, as part of our ongoing HPCwire People to Watch focus series, we are highlighting our interview with 2019 Person to Watch Michela Taufer. Michela -- the General Chair of SC19 -- is an ACM Distinguished Scientist. Read more…

By HPCwire Editorial Team

At ASF 2019: The Virtuous Circle of Big Data, AI and HPC

April 18, 2019

We've entered a new phase in IT -- in the world, really -- where the combination of big data, artificial intelligence, and high performance computing is pushing Read more…

By Alex Woodie with Doug Black and Tiffany Trader

Interview with 2019 Person to Watch Michela Taufer

April 18, 2019

Today, as part of our ongoing HPCwire People to Watch focus series, we are highlighting our interview with 2019 Person to Watch Michela Taufer. Michela -- the Read more…

By HPCwire Editorial Team

Intel Gold U-Series SKUs Reveal Single Socket Intentions

April 18, 2019

Intel plans to jump into the single socket market with a portion of its just announced Cascade Lake microprocessor line according to one media report. This isn Read more…

By John Russell

BSC Researchers Shrink Floating Point Formats to Accelerate Deep Neural Network Training

April 15, 2019

Sometimes calculating solutions as precisely as a computer can wastes more CPU resources than is necessary. A case in point is with deep learning. In early stag Read more…

By Ken Strandberg

Intel Extends FPGA Ecosystem with 10nm Agilex

April 11, 2019

The insatiable appetite for higher throughput and lower latency – particularly where edge analytics and AI, network functions, or for a range of datacenter ac Read more…

By Doug Black

Nvidia Doubles Down on Medical AI

April 9, 2019

Nvidia is collaborating with medical groups to push GPU-powered AI tools into clinical settings, including radiology and drug discovery. The GPU leader said Monday it will collaborate with the American College of Radiology (ACR) to provide clinicians with its Clara AI tool kit. The partnership would allow radiologists to leverage AI techniques for diagnostic imaging using their own clinical data. Read more…

By George Leopold

Digging into MLPerf Benchmark Suite to Inform AI Infrastructure Decisions

April 9, 2019

With machine learning and deep learning storming into the datacenter, the new challenge is optimizing infrastructure choices to support diverse ML and DL workfl Read more…

By John Russell

AI and Enterprise Datacenters Boost HPC Server Revenues Past Expectations – Hyperion

April 9, 2019

Building on the big year of 2017 and spurred in part by the convergence of AI and HPC, global revenue for high performance servers jumped 15.6 percent last year Read more…

By Doug Black

The Case Against ‘The Case Against Quantum Computing’

January 9, 2019

It’s not easy to be a physicist. Richard Feynman (basically the Jimi Hendrix of physicists) once said: “The first principle is that you must not fool yourse Read more…

By Ben Criger

Why Nvidia Bought Mellanox: ‘Future Datacenters Will Be…Like High Performance Computers’

March 14, 2019

“Future datacenters of all kinds will be built like high performance computers,” said Nvidia CEO Jensen Huang during a phone briefing on Monday after Nvidia revealed scooping up the high performance networking company Mellanox for $6.9 billion. Read more…

By Tiffany Trader

ClusterVision in Bankruptcy, Fate Uncertain

February 13, 2019

ClusterVision, European HPC specialists that have built and installed over 20 Top500-ranked systems in their nearly 17-year history, appear to be in the midst o Read more…

By Tiffany Trader

Intel Reportedly in $6B Bid for Mellanox

January 30, 2019

The latest rumors and reports around an acquisition of Mellanox focus on Intel, which has reportedly offered a $6 billion bid for the high performance interconn Read more…

By Doug Black

It’s Official: Aurora on Track to Be First US Exascale Computer in 2021

March 18, 2019

The U.S. Department of Energy along with Intel and Cray confirmed today that an Intel/Cray supercomputer, "Aurora," capable of sustained performance of one exaf Read more…

By Tiffany Trader

Looking for Light Reading? NSF-backed ‘Comic Books’ Tackle Quantum Computing

January 28, 2019

Still baffled by quantum computing? How about turning to comic books (graphic novels for the well-read among you) for some clarity and a little humor on QC. The Read more…

By John Russell

IBM Quantum Update: Q System One Launch, New Collaborators, and QC Center Plans

January 10, 2019

IBM made three significant quantum computing announcements at CES this week. One was introduction of IBM Q System One; it’s really the integration of IBM’s Read more…

By John Russell

Deep500: ETH Researchers Introduce New Deep Learning Benchmark for HPC

February 5, 2019

ETH researchers have developed a new deep learning benchmarking environment – Deep500 – they say is “the first distributed and reproducible benchmarking s Read more…

By John Russell

Leading Solution Providers

SC 18 Virtual Booth Video Tour

Advania @ SC18 AMD @ SC18
ASRock Rack @ SC18
DDN Storage @ SC18
HPE @ SC18
IBM @ SC18
Lenovo @ SC18 Mellanox Technologies @ SC18
NVIDIA @ SC18
One Stop Systems @ SC18
Oracle @ SC18 Panasas @ SC18
Supermicro @ SC18 SUSE @ SC18 TYAN @ SC18
Verne Global @ SC18

IBM Bets $2B Seeking 1000X AI Hardware Performance Boost

February 7, 2019

For now, AI systems are mostly machine learning-based and “narrow” – powerful as they are by today's standards, they're limited to performing a few, narro Read more…

By Doug Black

The Deep500 – Researchers Tackle an HPC Benchmark for Deep Learning

January 7, 2019

How do you know if an HPC system, particularly a larger-scale system, is well-suited for deep learning workloads? Today, that’s not an easy question to answer Read more…

By John Russell

Arm Unveils Neoverse N1 Platform with up to 128-Cores

February 20, 2019

Following on its Neoverse roadmap announcement last October, Arm today revealed its next-gen Neoverse microarchitecture with compute and throughput-optimized si Read more…

By Tiffany Trader

France to Deploy AI-Focused Supercomputer: Jean Zay

January 22, 2019

HPE announced today that it won the contract to build a supercomputer that will drive France’s AI and HPC efforts. The computer will be part of GENCI, the Fre Read more…

By Tiffany Trader

Intel Launches Cascade Lake Xeons with Up to 56 Cores

April 2, 2019

At Intel's Data-Centric Innovation Day in San Francisco (April 2), the company unveiled its second-generation Xeon Scalable (Cascade Lake) family and debuted it Read more…

By Tiffany Trader

Microsoft to Buy Mellanox?

December 20, 2018

Networking equipment powerhouse Mellanox could be an acquisition target by Microsoft, according to a published report in an Israeli financial publication. Microsoft has reportedly gone so far as to engage Goldman Sachs to handle negotiations with Mellanox. Read more…

By Doug Black

HPC Reflections and (Mostly Hopeful) Predictions

December 19, 2018

So much ‘spaghetti’ gets tossed on walls by the technology community (vendors and researchers) to see what sticks that it is often difficult to peer through Read more…

By John Russell

Oil and Gas Supercloud Clears Out Remaining Knights Landing Inventory: All 38,000 Wafers

March 13, 2019

The McCloud HPC service being built by Australia’s DownUnder GeoSolutions (DUG) outside Houston is set to become the largest oil and gas cloud in the world th Read more…

By Tiffany Trader

  • arrow
  • Click Here for More Headlines
  • arrow
Do NOT follow this link or you will be banned from the site!
Share This