ESnet Launches Architecture to Help Researchers Deliver on Data-Intensive Science

By Nicole Hemsoth

April 26, 2012

The U.S. Department of Energy’s Energy Sciences Network, or ESnet, provides reliable high-bandwidth network services to thousands of researchers tackling some of the most pressing scientific and engineering problems, such as finding new sources of clean energy, increasing energy efficiency, understanding climate change, developing new materials for industry and discovering the nature of our universe. To support these research endeavors, ESnet connects scientists at more than 40 DOE sites with experimental and computing facilities in the U.S. and abroad, as well as with their collaborators around the world. ESnet is managed for DOE’s Office of Science by Lawrence Berkeley National Laboratory.

As science becomes increasingly data-intensive, the ESnet staff regularly meets with scientists to better understand their future networking needs, then develops and deploys the infrastructure and services to address those requirements before they become a reality. One example of this is the Advanced Networking Initiative, a prototype 100 gigabits-per-second networking connecting the DOE Office of Science’s top supercomputing centers in California, Illinois and Tennessee, and an international peering point in New York. This 100 Gbps prototype is now being transitioned to production and will be rolled out to all other connected DOE sites in the coming year.

In order to help these research institutions fully capitalize on this growing availability of bandwidth to manage their growing data sets, ESnet is now working with the scientific community to encourage the use of a network design model called the “Science DMZ.” The Science DMZ is a specially designed local networking infrastructure aimed at speeding the delivery of scientific data. In March 2012, the National Science Foundation supported the concept by issuing a solicitation for proposals from universities to develop Science DMZs as they upgrade their local network infrastructures.

Leading the development of the Science DMZ effort at ESnet is Eli Dart, a network engineer with previous experience at Sandia National Laboratories and the National Energy Research Scientific Computing Center. In this interview conducted by Jon Bashor of Berkeley Lab, Dart answers some basic questions about the nature of the project and its principle goals.

Jon Bashor: What is the Science DMZ and where did the Science DMZ idea come from?

Eli Dart: In its purest form, it’s an element of the overall network architecture, typically a dedicated portion of a site or campus network, located as close to the network perimeter as possible, that serves only high-performance science applications. The intent of the Science DMZ is to simplify the deployment and support of high-performance and data-intensive science applications that rely on high-speed networking for success. These applications have unique network requirements that typically cannot be met by networks that are optimized for normal business operations like web browsing, procurement and financial systems, and the like. The idea itself came from two places.

The concept of a DMZ network originated in the network security space where so-called network “demilitarized zones” or DMZs are used to provide a dedicated portion of the network near the site perimeter specifically configured to support services that interact with the outside world. These services often include authoritative DNS, incoming email, outward facing websites, etc. These services usually fall under a security policy that’s different than the one for the rest of the enterprise architecture.

You can extend that notion to build a dedicated piece of the network specifically for high performance scientific applications, again located at or near the perimeter, and with hardware you know can handle these applications. The Science DMZ is not configured to handle the standard enterprise or business functions, such as email and web servers, desktop applications, and so forth. These typically need a massive security infrastructure to protect them, and the security measures required to protect business servers and desktop applications typically cause problems for high-performance applications. The Science DMZ model explicitly separates the science traffic from general-purpose network traffic, and allows appropriate security policies and enforcement mechanisms to be applied to each.

The second source for the Science DMZ concept came from working with TCP, or the Transmission Control Protocol. While most science applications that need reliable data delivery use TCP-based tools for data movement, TCP’s interpretation of packet loss can cause performance issues. TCP interprets packet loss as network congestion, and so when loss is encountered TCP dramatically reduces its sending rate – slowing the data transfer. In practice even a tiny amount of loss (much less than 1%) can be enough to reduce TCP performance by over a factor of 100.

For years people have been trying to fix TCP (with some success), but packet loss combined with high latency is a serious performance killer. It’s easier to build an infrastructure to provide loss-free IP service and to accommodate TCP rather than change it – this is what the Science DMZ model aims to accomplish.

Bashor: What makes up the Science DMZ model?

Dart: The Science DMZ itself is a portion of the network, at or near the site perimeter, which is specifically configured to support high-performance science applications. There are several key aspects to the Science DMZ.

First, it must be built with capable equipment that can handle high-rate flows without dropping packets. Typically, that means good equipment (not cheap wiring closet switches) with enough output buffer space to handle bursty high-rate long-distance TCP flows. The switches and routers need to be able to accurately account for packets (especially the ones they drop) so that packet loss can be accounted for and its cause fixed.

Second, data transfer should be done on dedicated servers – Data Transfer Nodes, or DTNs – that are designed and configured for the purpose. Their TCP stacks need to be tuned and they need access to high-speed storage. We have seen successful DTN implementations using high-speed local RAID as well as GPFS or Lustre filesystems, the parallel filesystem model is typically found at supercomputer centers.

Third, a Science DMZ needs test and measurement infrastructure, typically perfSONAR that allows you to identify any issue that may be causing performance issues. Many problems that are real performance killers are what we call “soft failures.” A soft failure causes performance degradation so that the network is not useful for data-intensive science but does not cause an outage that identifies the failing component. The only way to find these is to independently test the infrastructure to locate the problem – if perfSONAR is already deployed, this is much easier than if the first step of the process is to find and deploy a test machine and the second step is to get the site at the other end to find a spare box and deploy it.

Finally, the Science DMZ incorporates a security policy that is tailored to the science applications rather than to general-purpose business computing. You don’t need to scan 50TB of simulation output for email viruses, and you don’t run an email client on your Data Transfer Node. So, why conflate the security policies and enforcement mechanisms for the two, especially when doing so will effectively compromise the science mission? Firewalls and other security enforcement boxes are typically unable to handle the throughput needed for data-intensive science – and they essentially never support advanced science services such as virtual circuits or software-defined networking.

Bashor: Why does it matter?

Dart: The real reason all this matters is that the current and future generations of scientific instruments are producing data at a level we’ve never seen before. Based on our projections, ESnet is expected to carry over 100 petabytes of data per month by 2015. And there is the potential for stupendous scientific advancements in that data deluge. The challenge is to figure out how to get the science done without spending the bulk of your time doing data management. Scientists are physicists, chemists, biologists, geneticists and so on, but they are seldom network experts. They are scientists.

The data volumes are becoming large enough that the systems and networks are not capable of handling them if the equipment is configured to default settings or to accommodate business applications. There’s a need for an infrastructure that supports data-intensive science. That infrastructure needs to be designed for data mobility, which means you can get the data where you need it, when you need it. In some cases, the analysis code is on a system close to the data, while other times the scientist wants to analyze the data on local resources – we need to support it all. Data-intensive science is what we’re all going to be doing for the next decade or more.

Bashor: Can you describe a typical user who would benefit from having a Science DMZ?

Dart: The main benefit of the Science DMZ is that the scientist who needs to move data doesn’t have to first troubleshoot the infrastructure in order to use it. Scientists should not have to fix the network, the data transfer servers, and so forth before they can get to work.

There really isn’t a typical user, but there are some basic commonalities. One example could be data taken from a beamline at DOE’s Advanced Light Source. A data transfer node has been set up and Globus Online installed for users who need to fetch the data. Then you have the well-known Large Hadron Collider, which has several primary Tier 1 centers feeding data to the Tier 2 centers. This requires significantly more infrastructure. In both cases, you need to make sure the network is designed correctly so that data transfer tools work correctly. These fundamental principles benefit all users.

Bashor: How does ESnet play into this equation?

Dart: ESnet is the high-performance network for DOE’s Office of Science. It’s the backbone network infrastructure for the national laboratory system, supporting science at those labs. Through our 25 years of experience serving the scientific community, we have become a central repository for the expertise to support high-performance networking. So, part of our job is to be available to support scientists at the labs and their collaborators, such as researchers at universities.

The assumption is that the high-performance network infrastructure exists to support all parts of these modern scientific collaborations. The services must be consistent from end to end – from scientist to scientist – now matter where they may located and regardless of who owns the pieces of the infrastructure. For example, if scientists at the SLAC Linear Accelerator Center are sharing data with colleagues at a Max Planck Institute in Germany, the data moves from SLAC’s local network over ESnet to GEANT, the pan-European research network, then over Germany’s DFN network and onto the local network at the institute – crossing five different domains, owned and operated by five different organizations. ESnet has built partnerships with the global ecosystem of research and education networks so that if a network problem occurs, we can work collaboratively to quickly resolve it – wherever it is.

Bashor: The NSF recently cited Science DMZ as an upgrade that universities should consider as they work to enhance their overall IT infrastructure. Your thoughts on this?

Dart: We think it’s wonderful. The infrastructure that will be built with those funds will enable discoveries that otherwise would not be possible. It’s a critical investment in the scientific infrastructure of this country.

As I said, we’re all going spend the next decade or more supporting data-intensive science, so we need to get the infrastructure right. It needs to be adaptable, flexible and expandable. We can see what’s coming in the next one to three years. In some fields, the cost of generating data has fallen to almost zero. In genome sequencing, the cost per genome has fallen off a cliff. The cost of a raw megabyte of DNA sequence is now less than 10 cents. In July 2001, it was about $4,500. What this means is that we are entering a world where scientific productivity is gated on data analysis, not data generation.

In physics, new detectors will capture data in the terabyte-per-second range, with data analysis and reduction built into the detectors, so that only the data the researchers are really interested in will be kept. This is already happening at the LHC. The ATLAS detector generates about a petabyte of data a second. It’s sent through a multi-stage trigger farm where it’s reduced to about 2.5 gigabits per second coming out. Now many other science domains are getting into this same situation.

Looking 10 years out is beyond the current planning and budget outlooks – and well outside the scope of a single procurement or a single technology. This puts the work into the architecture space, not the technology or device space. We do know that everything about the data is growing exponentially, but not the funding. So we need to design a system that works well in general and is adaptable.

If you want to do capability-class science, you need to have capability-class infrastructure. You have to have the resources appropriate to get the most return on your scientific investment.

Bashor: ESnet has a number of projects to improve end-to-end network performance through testing and measurement. Can you talk about those briefly?

Dart: Performance testing and measurement is absolutely critical. If we go back to the need to accommodate TCP because packet loss is the number one enemy of data-intensive science, we have to be able to find and fix any problems quickly. Because issues can arise anywhere on the network path which can include multiple administrative domains, you need to have the means to individually test the paths, and take out or reconfigure the problem areas.

For this reason, ESnet – with Internet2 and several other collaborators – helped develop perfSONAR, an infrastructure for network performance monitoring, making it easier to solve end-to-end performance problems on paths crossing several networks. ESnet has test and measurement capabilities at every hub site and router on our network. You have to have this infrastructure in place before a problem occurs – this allows you to find and fix the problem in hours or days, not months.

Another service for improving end-to-end performance is OSCARS, ESnet’s On-Demand Secure Circuits and Advance Reservation System. OSCARS provides multi-domain, high-bandwidth virtual circuits that guarantee end-to-end network data transfer performance. With a Science DMZ, OSCARS can touch down at an institution, along with other science-specific services. This allows for capability-class services to be used without interfering with the enterprise system. The bottom line is that science opportunities have a better chance of not being missed.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

IBM Launches Commercial Quantum Network with Samsung, ORNL

December 14, 2017

In the race to commercialize quantum computing, IBM is one of several companies leading the pack. Today, IBM announced it had signed JPMorgan Chase, Daimler AG, Samsung and a number of other corporations to its IBM Q Net Read more…

By Tiffany Trader

TACC Researchers Test AI Traffic Monitoring Tool in Austin

December 13, 2017

Traffic jams and mishaps are often painful and sometimes dangerous facts of life. At this week’s IEEE International Conference on Big Data being held in Boston, researchers from TACC and colleagues will present a new Read more…

By HPCwire Staff

AMD Wins Another: Baidu to Deploy EPYC on Single Socket Servers

December 13, 2017

When AMD introduced its EPYC chip line in June, the company said a portion of the line was specifically designed to re-invigorate a single socket segment in what has become an overwhelmingly two-socket landscape in the d Read more…

By John Russell

HPE Extreme Performance Solutions

Explore the Origins of Space with COSMOS and Memory-Driven Computing

From the formation of black holes to the origins of space, data is the key to unlocking the secrets of the early universe. Read more…

Microsoft Wants to Speed Quantum Development

December 12, 2017

Quantum computing continues to make headlines in what remains of 2017 as several tech giants jockey to establish a pole position in the race toward commercialization of quantum. This week, Microsoft took the next step in Read more…

By Tiffany Trader

IBM Launches Commercial Quantum Network with Samsung, ORNL

December 14, 2017

In the race to commercialize quantum computing, IBM is one of several companies leading the pack. Today, IBM announced it had signed JPMorgan Chase, Daimler AG, Read more…

By Tiffany Trader

AMD Wins Another: Baidu to Deploy EPYC on Single Socket Servers

December 13, 2017

When AMD introduced its EPYC chip line in June, the company said a portion of the line was specifically designed to re-invigorate a single socket segment in wha Read more…

By John Russell

Microsoft Wants to Speed Quantum Development

December 12, 2017

Quantum computing continues to make headlines in what remains of 2017 as several tech giants jockey to establish a pole position in the race toward commercializ Read more…

By Tiffany Trader

HPC Iron, Soft, Data, People – It Takes an Ecosystem!

December 11, 2017

Cutting edge advanced computing hardware (aka big iron) does not stand by itself. These computers are the pinnacle of a myriad of technologies that must be care Read more…

By Alex R. Larzelere

IBM Begins Power9 Rollout with Backing from DOE, Google

December 6, 2017

After over a year of buildup, IBM is unveiling its first Power9 system based on the same architecture as the Department of Energy CORAL supercomputers, Summit a Read more…

By Tiffany Trader

Microsoft Spins Cycle Computing into Core Azure Product

December 5, 2017

Last August, cloud giant Microsoft acquired HPC cloud orchestration pioneer Cycle Computing. Since then the focus has been on integrating Cycle’s organization Read more…

By John Russell

GlobalFoundries, Ayar Labs Team Up to Commercialize Optical I/O

December 4, 2017

GlobalFoundries (GF) and Ayar Labs, a startup focused on using light, instead of electricity, to transfer data between chips, today announced they've entered in Read more…

By Tiffany Trader

HPE In-Memory Platform Comes to COSMOS

November 30, 2017

Hewlett Packard Enterprise is on a mission to accelerate space research. In August, it sent the first commercial-off-the-shelf HPC system into space for testing Read more…

By Tiffany Trader

US Coalesces Plans for First Exascale Supercomputer: Aurora in 2021

September 27, 2017

At the Advanced Scientific Computing Advisory Committee (ASCAC) meeting, in Arlington, Va., yesterday (Sept. 26), it was revealed that the "Aurora" supercompute Read more…

By Tiffany Trader

NERSC Scales Scientific Deep Learning to 15 Petaflops

August 28, 2017

A collaborative effort between Intel, NERSC and Stanford has delivered the first 15-petaflops deep learning software running on HPC platforms and is, according Read more…

By Rob Farber

Oracle Layoffs Reportedly Hit SPARC and Solaris Hard

September 7, 2017

Oracle’s latest layoffs have many wondering if this is the end of the line for the SPARC processor and Solaris OS development. As reported by multiple sources Read more…

By John Russell

AMD Showcases Growing Portfolio of EPYC and Radeon-based Systems at SC17

November 13, 2017

AMD’s charge back into HPC and the datacenter is on full display at SC17. Having launched the EPYC processor line in June along with its MI25 GPU the focus he Read more…

By John Russell

Nvidia Responds to Google TPU Benchmarking

April 10, 2017

Nvidia highlights strengths of its newest GPU silicon in response to Google's report on the performance and energy advantages of its custom tensor processor. Read more…

By Tiffany Trader

Japan Unveils Quantum Neural Network

November 22, 2017

The U.S. and China are leading the race toward productive quantum computing, but it's early enough that ultimate leadership is still something of an open questi Read more…

By Tiffany Trader

GlobalFoundries Puts Wind in AMD’s Sails with 12nm FinFET

September 24, 2017

From its annual tech conference last week (Sept. 20), where GlobalFoundries welcomed more than 600 semiconductor professionals (reaching the Santa Clara venue Read more…

By Tiffany Trader

Amazon Debuts New AMD-based GPU Instances for Graphics Acceleration

September 12, 2017

Last week Amazon Web Services (AWS) streaming service, AppStream 2.0, introduced a new GPU instance called Graphics Design intended to accelerate graphics. The Read more…

By John Russell

Leading Solution Providers

Google Releases Deeplearn.js to Further Democratize Machine Learning

August 17, 2017

Spreading the use of machine learning tools is one of the goals of Google’s PAIR (People + AI Research) initiative, which was introduced in early July. Last w Read more…

By John Russell

Perspective: What Really Happened at SC17?

November 22, 2017

SC is over. Now comes the myriad of follow-ups. Inboxes are filled with templated emails from vendors and other exhibitors hoping to win a place in the post-SC thinking of booth visitors. Attendees of tutorials, workshops and other technical sessions will be inundated with requests for feedback. Read more…

By Andrew Jones

IBM Begins Power9 Rollout with Backing from DOE, Google

December 6, 2017

After over a year of buildup, IBM is unveiling its first Power9 system based on the same architecture as the Department of Energy CORAL supercomputers, Summit a Read more…

By Tiffany Trader

EU Funds 20 Million Euro ARM+FPGA Exascale Project

September 7, 2017

At the Barcelona Supercomputer Centre on Wednesday (Sept. 6), 16 partners gathered to launch the EuroEXA project, which invests €20 million over three-and-a-half years into exascale-focused research and development. Led by the Horizon 2020 program, EuroEXA picks up the banner of a triad of partner projects — ExaNeSt, EcoScale and ExaNoDe — building on their work... Read more…

By Tiffany Trader

Delays, Smoke, Records & Markets – A Candid Conversation with Cray CEO Peter Ungaro

October 5, 2017

Earlier this month, Tom Tabor, publisher of HPCwire and I had a very personal conversation with Cray CEO Peter Ungaro. Cray has been on something of a Cinderell Read more…

By Tiffany Trader & Tom Tabor

Tensors Come of Age: Why the AI Revolution Will Help HPC

November 13, 2017

Thirty years ago, parallel computing was coming of age. A bitter battle began between stalwart vector computing supporters and advocates of various approaches to parallel computing. IBM skeptic Alan Karp, reacting to announcements of nCUBE’s 1024-microprocessor system and Thinking Machines’ 65,536-element array, made a public $100 wager that no one could get a parallel speedup of over 200 on real HPC workloads. Read more…

By John Gustafson & Lenore Mullin

Flipping the Flops and Reading the Top500 Tea Leaves

November 13, 2017

The 50th edition of the Top500 list, the biannual publication of the world’s fastest supercomputers based on public Linpack benchmarking results, was released Read more…

By Tiffany Trader

Intel Launches Software Tools to Ease FPGA Programming

September 5, 2017

Field Programmable Gate Arrays (FPGAs) have a reputation for being difficult to program, requiring expertise in specialty languages, like Verilog or VHDL. Easin Read more…

By Tiffany Trader

  • arrow
  • Click Here for More Headlines
  • arrow
Share This