ESnet Launches Architecture to Help Researchers Deliver on Data-Intensive Science

By Nicole Hemsoth

April 26, 2012

The U.S. Department of Energy’s Energy Sciences Network, or ESnet, provides reliable high-bandwidth network services to thousands of researchers tackling some of the most pressing scientific and engineering problems, such as finding new sources of clean energy, increasing energy efficiency, understanding climate change, developing new materials for industry and discovering the nature of our universe. To support these research endeavors, ESnet connects scientists at more than 40 DOE sites with experimental and computing facilities in the U.S. and abroad, as well as with their collaborators around the world. ESnet is managed for DOE’s Office of Science by Lawrence Berkeley National Laboratory.

As science becomes increasingly data-intensive, the ESnet staff regularly meets with scientists to better understand their future networking needs, then develops and deploys the infrastructure and services to address those requirements before they become a reality. One example of this is the Advanced Networking Initiative, a prototype 100 gigabits-per-second networking connecting the DOE Office of Science’s top supercomputing centers in California, Illinois and Tennessee, and an international peering point in New York. This 100 Gbps prototype is now being transitioned to production and will be rolled out to all other connected DOE sites in the coming year.

In order to help these research institutions fully capitalize on this growing availability of bandwidth to manage their growing data sets, ESnet is now working with the scientific community to encourage the use of a network design model called the “Science DMZ.” The Science DMZ is a specially designed local networking infrastructure aimed at speeding the delivery of scientific data. In March 2012, the National Science Foundation supported the concept by issuing a solicitation for proposals from universities to develop Science DMZs as they upgrade their local network infrastructures.

Leading the development of the Science DMZ effort at ESnet is Eli Dart, a network engineer with previous experience at Sandia National Laboratories and the National Energy Research Scientific Computing Center. In this interview conducted by Jon Bashor of Berkeley Lab, Dart answers some basic questions about the nature of the project and its principle goals.

Jon Bashor: What is the Science DMZ and where did the Science DMZ idea come from?

Eli Dart: In its purest form, it’s an element of the overall network architecture, typically a dedicated portion of a site or campus network, located as close to the network perimeter as possible, that serves only high-performance science applications. The intent of the Science DMZ is to simplify the deployment and support of high-performance and data-intensive science applications that rely on high-speed networking for success. These applications have unique network requirements that typically cannot be met by networks that are optimized for normal business operations like web browsing, procurement and financial systems, and the like. The idea itself came from two places.

The concept of a DMZ network originated in the network security space where so-called network “demilitarized zones” or DMZs are used to provide a dedicated portion of the network near the site perimeter specifically configured to support services that interact with the outside world. These services often include authoritative DNS, incoming email, outward facing websites, etc. These services usually fall under a security policy that’s different than the one for the rest of the enterprise architecture.

You can extend that notion to build a dedicated piece of the network specifically for high performance scientific applications, again located at or near the perimeter, and with hardware you know can handle these applications. The Science DMZ is not configured to handle the standard enterprise or business functions, such as email and web servers, desktop applications, and so forth. These typically need a massive security infrastructure to protect them, and the security measures required to protect business servers and desktop applications typically cause problems for high-performance applications. The Science DMZ model explicitly separates the science traffic from general-purpose network traffic, and allows appropriate security policies and enforcement mechanisms to be applied to each.

The second source for the Science DMZ concept came from working with TCP, or the Transmission Control Protocol. While most science applications that need reliable data delivery use TCP-based tools for data movement, TCP’s interpretation of packet loss can cause performance issues. TCP interprets packet loss as network congestion, and so when loss is encountered TCP dramatically reduces its sending rate – slowing the data transfer. In practice even a tiny amount of loss (much less than 1%) can be enough to reduce TCP performance by over a factor of 100.

For years people have been trying to fix TCP (with some success), but packet loss combined with high latency is a serious performance killer. It’s easier to build an infrastructure to provide loss-free IP service and to accommodate TCP rather than change it – this is what the Science DMZ model aims to accomplish.

Bashor: What makes up the Science DMZ model?

Dart: The Science DMZ itself is a portion of the network, at or near the site perimeter, which is specifically configured to support high-performance science applications. There are several key aspects to the Science DMZ.

First, it must be built with capable equipment that can handle high-rate flows without dropping packets. Typically, that means good equipment (not cheap wiring closet switches) with enough output buffer space to handle bursty high-rate long-distance TCP flows. The switches and routers need to be able to accurately account for packets (especially the ones they drop) so that packet loss can be accounted for and its cause fixed.

Second, data transfer should be done on dedicated servers – Data Transfer Nodes, or DTNs – that are designed and configured for the purpose. Their TCP stacks need to be tuned and they need access to high-speed storage. We have seen successful DTN implementations using high-speed local RAID as well as GPFS or Lustre filesystems, the parallel filesystem model is typically found at supercomputer centers.

Third, a Science DMZ needs test and measurement infrastructure, typically perfSONAR that allows you to identify any issue that may be causing performance issues. Many problems that are real performance killers are what we call “soft failures.” A soft failure causes performance degradation so that the network is not useful for data-intensive science but does not cause an outage that identifies the failing component. The only way to find these is to independently test the infrastructure to locate the problem – if perfSONAR is already deployed, this is much easier than if the first step of the process is to find and deploy a test machine and the second step is to get the site at the other end to find a spare box and deploy it.

Finally, the Science DMZ incorporates a security policy that is tailored to the science applications rather than to general-purpose business computing. You don’t need to scan 50TB of simulation output for email viruses, and you don’t run an email client on your Data Transfer Node. So, why conflate the security policies and enforcement mechanisms for the two, especially when doing so will effectively compromise the science mission? Firewalls and other security enforcement boxes are typically unable to handle the throughput needed for data-intensive science – and they essentially never support advanced science services such as virtual circuits or software-defined networking.

Bashor: Why does it matter?

Dart: The real reason all this matters is that the current and future generations of scientific instruments are producing data at a level we’ve never seen before. Based on our projections, ESnet is expected to carry over 100 petabytes of data per month by 2015. And there is the potential for stupendous scientific advancements in that data deluge. The challenge is to figure out how to get the science done without spending the bulk of your time doing data management. Scientists are physicists, chemists, biologists, geneticists and so on, but they are seldom network experts. They are scientists.

The data volumes are becoming large enough that the systems and networks are not capable of handling them if the equipment is configured to default settings or to accommodate business applications. There’s a need for an infrastructure that supports data-intensive science. That infrastructure needs to be designed for data mobility, which means you can get the data where you need it, when you need it. In some cases, the analysis code is on a system close to the data, while other times the scientist wants to analyze the data on local resources – we need to support it all. Data-intensive science is what we’re all going to be doing for the next decade or more.

Bashor: Can you describe a typical user who would benefit from having a Science DMZ?

Dart: The main benefit of the Science DMZ is that the scientist who needs to move data doesn’t have to first troubleshoot the infrastructure in order to use it. Scientists should not have to fix the network, the data transfer servers, and so forth before they can get to work.

There really isn’t a typical user, but there are some basic commonalities. One example could be data taken from a beamline at DOE’s Advanced Light Source. A data transfer node has been set up and Globus Online installed for users who need to fetch the data. Then you have the well-known Large Hadron Collider, which has several primary Tier 1 centers feeding data to the Tier 2 centers. This requires significantly more infrastructure. In both cases, you need to make sure the network is designed correctly so that data transfer tools work correctly. These fundamental principles benefit all users.

Bashor: How does ESnet play into this equation?

Dart: ESnet is the high-performance network for DOE’s Office of Science. It’s the backbone network infrastructure for the national laboratory system, supporting science at those labs. Through our 25 years of experience serving the scientific community, we have become a central repository for the expertise to support high-performance networking. So, part of our job is to be available to support scientists at the labs and their collaborators, such as researchers at universities.

The assumption is that the high-performance network infrastructure exists to support all parts of these modern scientific collaborations. The services must be consistent from end to end – from scientist to scientist – now matter where they may located and regardless of who owns the pieces of the infrastructure. For example, if scientists at the SLAC Linear Accelerator Center are sharing data with colleagues at a Max Planck Institute in Germany, the data moves from SLAC’s local network over ESnet to GEANT, the pan-European research network, then over Germany’s DFN network and onto the local network at the institute – crossing five different domains, owned and operated by five different organizations. ESnet has built partnerships with the global ecosystem of research and education networks so that if a network problem occurs, we can work collaboratively to quickly resolve it – wherever it is.

Bashor: The NSF recently cited Science DMZ as an upgrade that universities should consider as they work to enhance their overall IT infrastructure. Your thoughts on this?

Dart: We think it’s wonderful. The infrastructure that will be built with those funds will enable discoveries that otherwise would not be possible. It’s a critical investment in the scientific infrastructure of this country.

As I said, we’re all going spend the next decade or more supporting data-intensive science, so we need to get the infrastructure right. It needs to be adaptable, flexible and expandable. We can see what’s coming in the next one to three years. In some fields, the cost of generating data has fallen to almost zero. In genome sequencing, the cost per genome has fallen off a cliff. The cost of a raw megabyte of DNA sequence is now less than 10 cents. In July 2001, it was about $4,500. What this means is that we are entering a world where scientific productivity is gated on data analysis, not data generation.

In physics, new detectors will capture data in the terabyte-per-second range, with data analysis and reduction built into the detectors, so that only the data the researchers are really interested in will be kept. This is already happening at the LHC. The ATLAS detector generates about a petabyte of data a second. It’s sent through a multi-stage trigger farm where it’s reduced to about 2.5 gigabits per second coming out. Now many other science domains are getting into this same situation.

Looking 10 years out is beyond the current planning and budget outlooks – and well outside the scope of a single procurement or a single technology. This puts the work into the architecture space, not the technology or device space. We do know that everything about the data is growing exponentially, but not the funding. So we need to design a system that works well in general and is adaptable.

If you want to do capability-class science, you need to have capability-class infrastructure. You have to have the resources appropriate to get the most return on your scientific investment.

Bashor: ESnet has a number of projects to improve end-to-end network performance through testing and measurement. Can you talk about those briefly?

Dart: Performance testing and measurement is absolutely critical. If we go back to the need to accommodate TCP because packet loss is the number one enemy of data-intensive science, we have to be able to find and fix any problems quickly. Because issues can arise anywhere on the network path which can include multiple administrative domains, you need to have the means to individually test the paths, and take out or reconfigure the problem areas.

For this reason, ESnet – with Internet2 and several other collaborators – helped develop perfSONAR, an infrastructure for network performance monitoring, making it easier to solve end-to-end performance problems on paths crossing several networks. ESnet has test and measurement capabilities at every hub site and router on our network. You have to have this infrastructure in place before a problem occurs – this allows you to find and fix the problem in hours or days, not months.

Another service for improving end-to-end performance is OSCARS, ESnet’s On-Demand Secure Circuits and Advance Reservation System. OSCARS provides multi-domain, high-bandwidth virtual circuits that guarantee end-to-end network data transfer performance. With a Science DMZ, OSCARS can touch down at an institution, along with other science-specific services. This allows for capability-class services to be used without interfering with the enterprise system. The bottom line is that science opportunities have a better chance of not being missed.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

ASC18: Final Results Revealed & Wrapped Up

May 17, 2018

It was an exciting week at ASC18 in Nanyang, China. The student teams braved extreme heat, extremely difficult applications, and extreme competition in order to cross the cluster competition finish line. The gala awards ceremony took place on Wednesday. The auditorium was packed with student teams, various dignitaries, the media, and other interested parties. So what happened? Read more…

By Dan Olds

ASC18: Tough Applications & Tough Luck

May 17, 2018

The applications at the ASC18 Student Cluster Competition were tough. Tougher than the $3.99 steak special at your local greasy spoon restaurant. The apps are so tough that even Chuck Norris backs away from them slowly. Read more…

By Dan Olds

Spring Meetings Underscore Quantum Computing’s Rise

May 17, 2018

The month of April 2018 saw four very important and interesting meetings to discuss the state of quantum computing technologies, their potential impacts, and the technology challenges ahead. These discussions happened in Read more…

By Alex R. Larzelere

HPE Extreme Performance Solutions

HPC and AI Convergence is Accelerating New Levels of Intelligence

Data analytics is the most valuable tool in the digital marketplace – so much so that organizations are employing high performance computing (HPC) capabilities to rapidly collect, share, and analyze endless streams of data. Read more…

IBM Accelerated Insights

Mastering the Big Data Challenge in Cognitive Healthcare

Patrick Chain, genomics researcher at Los Alamos National Laboratory, posed a question in a recent blog: What if a nurse could swipe a patient’s saliva and run a quick genetic test to determine if the patient’s sore throat was caused by a cold virus or a bacterial infection? Read more…

Quantum Network Hub Opens in Japan

May 17, 2018

Following on the launch of its Q Commercial quantum network last December with 12 industrial and academic partners, the official Japanese hub at Keio University is now open to facilitate the exploration of quantum applications important to science and business. The news comes a week after IBM announced that North Carolina State University was the first U.S. university to join its Q Network. Read more…

By Tiffany Trader

ASC18: Final Results Revealed & Wrapped Up

May 17, 2018

It was an exciting week at ASC18 in Nanyang, China. The student teams braved extreme heat, extremely difficult applications, and extreme competition in order to cross the cluster competition finish line. The gala awards ceremony took place on Wednesday. The auditorium was packed with student teams, various dignitaries, the media, and other interested parties. So what happened? Read more…

By Dan Olds

Spring Meetings Underscore Quantum Computing’s Rise

May 17, 2018

The month of April 2018 saw four very important and interesting meetings to discuss the state of quantum computing technologies, their potential impacts, and th Read more…

By Alex R. Larzelere

Quantum Network Hub Opens in Japan

May 17, 2018

Following on the launch of its Q Commercial quantum network last December with 12 industrial and academic partners, the official Japanese hub at Keio University is now open to facilitate the exploration of quantum applications important to science and business. The news comes a week after IBM announced that North Carolina State University was the first U.S. university to join its Q Network. Read more…

By Tiffany Trader

Democratizing HPC: OSC Releases Version 1.3 of OnDemand

May 16, 2018

Making HPC resources readily available and easier to use for scientists who may have less HPC expertise is an ongoing challenge. Open OnDemand is a project by t Read more…

By John Russell

PRACE 2017 Annual Report: Exascale Aspirations; Industry Collaboration; HPC Training

May 15, 2018

The Partnership for Advanced Computing in Europe (PRACE) today released its annual report showcasing 2017 activities and providing a glimpse into thinking about Read more…

By John Russell

US Forms AI Brain Trust

May 11, 2018

Amid calls for a U.S. strategy for promoting AI development, the Trump administration is forming a senior-level panel to help coordinate government and industry research efforts. The Select Committee on Artificial Intelligence was announced Thursday (May 10) during a White House summit organized by the Office of Science and Technology Policy (OSTP). Read more…

By George Leopold

Emerging Advanced Scale Tech Trends Focus of Annual Tabor Conference

May 9, 2018

At Tabor Communications' annual Advanced Scale Forum (ASF) held this week in Austin, the focus was on enterprise adoption of HPC-class technologies and high performance data analytics (HPDA). It’s a confab that brings together end users (CIOs, IT planners, department heads) and vendors and encourages... Read more…

By the Editorial Team

Google I/O 2018: AI Everywhere; TPU 3.0 Delivers 100+ Petaflops but Requires Liquid Cooling

May 9, 2018

All things AI dominated discussion at yesterday’s opening of Google’s I/O 2018 developers meeting covering much of Google's near-term product roadmap. The e Read more…

By John Russell

MLPerf – Will New Machine Learning Benchmark Help Propel AI Forward?

May 2, 2018

Let the AI benchmarking wars begin. Today, a diverse group from academia and industry – Google, Baidu, Intel, AMD, Harvard, and Stanford among them – releas Read more…

By John Russell

How the Cloud Is Falling Short for HPC

March 15, 2018

The last couple of years have seen cloud computing gradually build some legitimacy within the HPC world, but still the HPC industry lies far behind enterprise I Read more…

By Chris Downing

Russian Nuclear Engineers Caught Cryptomining on Lab Supercomputer

February 12, 2018

Nuclear scientists working at the All-Russian Research Institute of Experimental Physics (RFNC-VNIIEF) have been arrested for using lab supercomputing resources to mine crypto-currency, according to a report in Russia’s Interfax News Agency. Read more…

By Tiffany Trader

Inventor Claims to Have Solved Floating Point Error Problem

January 17, 2018

"The decades-old floating point error problem has been solved," proclaims a press release from inventor Alan Jorgensen. The computer scientist has filed for and Read more…

By Tiffany Trader

Researchers Measure Impact of ‘Meltdown’ and ‘Spectre’ Patches on HPC Workloads

January 17, 2018

Computer scientists from the Center for Computational Research, State University of New York (SUNY), University at Buffalo have examined the effect of Meltdown Read more…

By Tiffany Trader

Nvidia Responds to Google TPU Benchmarking

April 10, 2017

Nvidia highlights strengths of its newest GPU silicon in response to Google's report on the performance and energy advantages of its custom tensor processor. Read more…

By Tiffany Trader

Deep Learning at 15 PFlops Enables Training for Extreme Weather Identification at Scale

March 19, 2018

Petaflop per second deep learning training performance on the NERSC (National Energy Research Scientific Computing Center) Cori supercomputer has given climate Read more…

By Rob Farber

AI Cloud Competition Heats Up: Google’s TPUs, Amazon Building AI Chip

February 12, 2018

Competition in the white hot AI (and public cloud) market pits Google against Amazon this week, with Google offering AI hardware on its cloud platform intended Read more…

By Doug Black

Leading Solution Providers

US Plans $1.8 Billion Spend on DOE Exascale Supercomputing

April 11, 2018

On Monday, the United States Department of Energy announced its intention to procure up to three exascale supercomputers at a cost of up to $1.8 billion with th Read more…

By Tiffany Trader

Lenovo Unveils Warm Water Cooled ThinkSystem SD650 in Rampup to LRZ Install

February 22, 2018

This week Lenovo took the wraps off the ThinkSystem SD650 high-density server with third-generation direct water cooling technology developed in tandem with par Read more…

By Tiffany Trader

HPC and AI – Two Communities Same Future

January 25, 2018

According to Al Gara (Intel Fellow, Data Center Group), high performance computing and artificial intelligence will increasingly intertwine as we transition to Read more…

By Rob Farber

Google Chases Quantum Supremacy with 72-Qubit Processor

March 7, 2018

Google pulled ahead of the pack this week in the race toward "quantum supremacy," with the introduction of a new 72-qubit quantum processor called Bristlecone. Read more…

By Tiffany Trader

HPE Wins $57 Million DoD Supercomputing Contract

February 20, 2018

Hewlett Packard Enterprise (HPE) today revealed details of its massive $57 million HPC contract with the U.S. Department of Defense (DoD). The deal calls for HP Read more…

By Tiffany Trader

CFO Steps down in Executive Shuffle at Supermicro

January 31, 2018

Supermicro yesterday announced senior management shuffling including prominent departures, the completion of an audit linked to its delayed Nasdaq filings, and Read more…

By John Russell

Deep Learning Portends ‘Sea Change’ for Oil and Gas Sector

February 1, 2018

The billowing compute and data demands that spurred the oil and gas industry to be the largest commercial users of high-performance computing are now propelling Read more…

By Tiffany Trader

Nvidia Ups Hardware Game with 16-GPU DGX-2 Server and 18-Port NVSwitch

March 27, 2018

Nvidia unveiled a raft of new products from its annual technology conference in San Jose today, and despite not offering up a new chip architecture, there were still a few surprises in store for HPC hardware aficionados. Read more…

By Tiffany Trader

  • arrow
  • Click Here for More Headlines
  • arrow
Share This