ESnet Launches Architecture to Help Researchers Deliver on Data-Intensive Science

By Nicole Hemsoth

April 26, 2012

The U.S. Department of Energy’s Energy Sciences Network, or ESnet, provides reliable high-bandwidth network services to thousands of researchers tackling some of the most pressing scientific and engineering problems, such as finding new sources of clean energy, increasing energy efficiency, understanding climate change, developing new materials for industry and discovering the nature of our universe. To support these research endeavors, ESnet connects scientists at more than 40 DOE sites with experimental and computing facilities in the U.S. and abroad, as well as with their collaborators around the world. ESnet is managed for DOE’s Office of Science by Lawrence Berkeley National Laboratory.

As science becomes increasingly data-intensive, the ESnet staff regularly meets with scientists to better understand their future networking needs, then develops and deploys the infrastructure and services to address those requirements before they become a reality. One example of this is the Advanced Networking Initiative, a prototype 100 gigabits-per-second networking connecting the DOE Office of Science’s top supercomputing centers in California, Illinois and Tennessee, and an international peering point in New York. This 100 Gbps prototype is now being transitioned to production and will be rolled out to all other connected DOE sites in the coming year.

In order to help these research institutions fully capitalize on this growing availability of bandwidth to manage their growing data sets, ESnet is now working with the scientific community to encourage the use of a network design model called the “Science DMZ.” The Science DMZ is a specially designed local networking infrastructure aimed at speeding the delivery of scientific data. In March 2012, the National Science Foundation supported the concept by issuing a solicitation for proposals from universities to develop Science DMZs as they upgrade their local network infrastructures.

Leading the development of the Science DMZ effort at ESnet is Eli Dart, a network engineer with previous experience at Sandia National Laboratories and the National Energy Research Scientific Computing Center. In this interview conducted by Jon Bashor of Berkeley Lab, Dart answers some basic questions about the nature of the project and its principle goals.

Jon Bashor: What is the Science DMZ and where did the Science DMZ idea come from?

Eli Dart: In its purest form, it’s an element of the overall network architecture, typically a dedicated portion of a site or campus network, located as close to the network perimeter as possible, that serves only high-performance science applications. The intent of the Science DMZ is to simplify the deployment and support of high-performance and data-intensive science applications that rely on high-speed networking for success. These applications have unique network requirements that typically cannot be met by networks that are optimized for normal business operations like web browsing, procurement and financial systems, and the like. The idea itself came from two places.

The concept of a DMZ network originated in the network security space where so-called network “demilitarized zones” or DMZs are used to provide a dedicated portion of the network near the site perimeter specifically configured to support services that interact with the outside world. These services often include authoritative DNS, incoming email, outward facing websites, etc. These services usually fall under a security policy that’s different than the one for the rest of the enterprise architecture.

You can extend that notion to build a dedicated piece of the network specifically for high performance scientific applications, again located at or near the perimeter, and with hardware you know can handle these applications. The Science DMZ is not configured to handle the standard enterprise or business functions, such as email and web servers, desktop applications, and so forth. These typically need a massive security infrastructure to protect them, and the security measures required to protect business servers and desktop applications typically cause problems for high-performance applications. The Science DMZ model explicitly separates the science traffic from general-purpose network traffic, and allows appropriate security policies and enforcement mechanisms to be applied to each.

The second source for the Science DMZ concept came from working with TCP, or the Transmission Control Protocol. While most science applications that need reliable data delivery use TCP-based tools for data movement, TCP’s interpretation of packet loss can cause performance issues. TCP interprets packet loss as network congestion, and so when loss is encountered TCP dramatically reduces its sending rate – slowing the data transfer. In practice even a tiny amount of loss (much less than 1%) can be enough to reduce TCP performance by over a factor of 100.

For years people have been trying to fix TCP (with some success), but packet loss combined with high latency is a serious performance killer. It’s easier to build an infrastructure to provide loss-free IP service and to accommodate TCP rather than change it – this is what the Science DMZ model aims to accomplish.

Bashor: What makes up the Science DMZ model?

Dart: The Science DMZ itself is a portion of the network, at or near the site perimeter, which is specifically configured to support high-performance science applications. There are several key aspects to the Science DMZ.

First, it must be built with capable equipment that can handle high-rate flows without dropping packets. Typically, that means good equipment (not cheap wiring closet switches) with enough output buffer space to handle bursty high-rate long-distance TCP flows. The switches and routers need to be able to accurately account for packets (especially the ones they drop) so that packet loss can be accounted for and its cause fixed.

Second, data transfer should be done on dedicated servers – Data Transfer Nodes, or DTNs – that are designed and configured for the purpose. Their TCP stacks need to be tuned and they need access to high-speed storage. We have seen successful DTN implementations using high-speed local RAID as well as GPFS or Lustre filesystems, the parallel filesystem model is typically found at supercomputer centers.

Third, a Science DMZ needs test and measurement infrastructure, typically perfSONAR that allows you to identify any issue that may be causing performance issues. Many problems that are real performance killers are what we call “soft failures.” A soft failure causes performance degradation so that the network is not useful for data-intensive science but does not cause an outage that identifies the failing component. The only way to find these is to independently test the infrastructure to locate the problem – if perfSONAR is already deployed, this is much easier than if the first step of the process is to find and deploy a test machine and the second step is to get the site at the other end to find a spare box and deploy it.

Finally, the Science DMZ incorporates a security policy that is tailored to the science applications rather than to general-purpose business computing. You don’t need to scan 50TB of simulation output for email viruses, and you don’t run an email client on your Data Transfer Node. So, why conflate the security policies and enforcement mechanisms for the two, especially when doing so will effectively compromise the science mission? Firewalls and other security enforcement boxes are typically unable to handle the throughput needed for data-intensive science – and they essentially never support advanced science services such as virtual circuits or software-defined networking.

Bashor: Why does it matter?

Dart: The real reason all this matters is that the current and future generations of scientific instruments are producing data at a level we’ve never seen before. Based on our projections, ESnet is expected to carry over 100 petabytes of data per month by 2015. And there is the potential for stupendous scientific advancements in that data deluge. The challenge is to figure out how to get the science done without spending the bulk of your time doing data management. Scientists are physicists, chemists, biologists, geneticists and so on, but they are seldom network experts. They are scientists.

The data volumes are becoming large enough that the systems and networks are not capable of handling them if the equipment is configured to default settings or to accommodate business applications. There’s a need for an infrastructure that supports data-intensive science. That infrastructure needs to be designed for data mobility, which means you can get the data where you need it, when you need it. In some cases, the analysis code is on a system close to the data, while other times the scientist wants to analyze the data on local resources – we need to support it all. Data-intensive science is what we’re all going to be doing for the next decade or more.

Bashor: Can you describe a typical user who would benefit from having a Science DMZ?

Dart: The main benefit of the Science DMZ is that the scientist who needs to move data doesn’t have to first troubleshoot the infrastructure in order to use it. Scientists should not have to fix the network, the data transfer servers, and so forth before they can get to work.

There really isn’t a typical user, but there are some basic commonalities. One example could be data taken from a beamline at DOE’s Advanced Light Source. A data transfer node has been set up and Globus Online installed for users who need to fetch the data. Then you have the well-known Large Hadron Collider, which has several primary Tier 1 centers feeding data to the Tier 2 centers. This requires significantly more infrastructure. In both cases, you need to make sure the network is designed correctly so that data transfer tools work correctly. These fundamental principles benefit all users.

Bashor: How does ESnet play into this equation?

Dart: ESnet is the high-performance network for DOE’s Office of Science. It’s the backbone network infrastructure for the national laboratory system, supporting science at those labs. Through our 25 years of experience serving the scientific community, we have become a central repository for the expertise to support high-performance networking. So, part of our job is to be available to support scientists at the labs and their collaborators, such as researchers at universities.

The assumption is that the high-performance network infrastructure exists to support all parts of these modern scientific collaborations. The services must be consistent from end to end – from scientist to scientist – now matter where they may located and regardless of who owns the pieces of the infrastructure. For example, if scientists at the SLAC Linear Accelerator Center are sharing data with colleagues at a Max Planck Institute in Germany, the data moves from SLAC’s local network over ESnet to GEANT, the pan-European research network, then over Germany’s DFN network and onto the local network at the institute – crossing five different domains, owned and operated by five different organizations. ESnet has built partnerships with the global ecosystem of research and education networks so that if a network problem occurs, we can work collaboratively to quickly resolve it – wherever it is.

Bashor: The NSF recently cited Science DMZ as an upgrade that universities should consider as they work to enhance their overall IT infrastructure. Your thoughts on this?

Dart: We think it’s wonderful. The infrastructure that will be built with those funds will enable discoveries that otherwise would not be possible. It’s a critical investment in the scientific infrastructure of this country.

As I said, we’re all going spend the next decade or more supporting data-intensive science, so we need to get the infrastructure right. It needs to be adaptable, flexible and expandable. We can see what’s coming in the next one to three years. In some fields, the cost of generating data has fallen to almost zero. In genome sequencing, the cost per genome has fallen off a cliff. The cost of a raw megabyte of DNA sequence is now less than 10 cents. In July 2001, it was about $4,500. What this means is that we are entering a world where scientific productivity is gated on data analysis, not data generation.

In physics, new detectors will capture data in the terabyte-per-second range, with data analysis and reduction built into the detectors, so that only the data the researchers are really interested in will be kept. This is already happening at the LHC. The ATLAS detector generates about a petabyte of data a second. It’s sent through a multi-stage trigger farm where it’s reduced to about 2.5 gigabits per second coming out. Now many other science domains are getting into this same situation.

Looking 10 years out is beyond the current planning and budget outlooks – and well outside the scope of a single procurement or a single technology. This puts the work into the architecture space, not the technology or device space. We do know that everything about the data is growing exponentially, but not the funding. So we need to design a system that works well in general and is adaptable.

If you want to do capability-class science, you need to have capability-class infrastructure. You have to have the resources appropriate to get the most return on your scientific investment.

Bashor: ESnet has a number of projects to improve end-to-end network performance through testing and measurement. Can you talk about those briefly?

Dart: Performance testing and measurement is absolutely critical. If we go back to the need to accommodate TCP because packet loss is the number one enemy of data-intensive science, we have to be able to find and fix any problems quickly. Because issues can arise anywhere on the network path which can include multiple administrative domains, you need to have the means to individually test the paths, and take out or reconfigure the problem areas.

For this reason, ESnet – with Internet2 and several other collaborators – helped develop perfSONAR, an infrastructure for network performance monitoring, making it easier to solve end-to-end performance problems on paths crossing several networks. ESnet has test and measurement capabilities at every hub site and router on our network. You have to have this infrastructure in place before a problem occurs – this allows you to find and fix the problem in hours or days, not months.

Another service for improving end-to-end performance is OSCARS, ESnet’s On-Demand Secure Circuits and Advance Reservation System. OSCARS provides multi-domain, high-bandwidth virtual circuits that guarantee end-to-end network data transfer performance. With a Science DMZ, OSCARS can touch down at an institution, along with other science-specific services. This allows for capability-class services to be used without interfering with the enterprise system. The bottom line is that science opportunities have a better chance of not being missed.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

How ‘Knights Mill’ Gets Its Deep Learning Flops

June 22, 2017

Intel, the subject of much speculation regarding the delayed, rewritten or potentially canceled “Aurora” contract (the Argonne Lab part of the CORAL “pre-exascale” award), parsed out additional information ab Read more…

By Tiffany Trader

Tsinghua Crowned Eight-Time Student Cluster Champions at ISC

June 22, 2017

Always a hard-fought competition, the Student Cluster Competition awards were announced Wednesday, June 21, at the ISC High Performance Conference 2017. Amid whoops and hollers from the crowd, Thomas Sterling presented t Read more…

By Kim McMahon

GPUs, Power9, Figure Prominently in IBM’s Bet on Weather Forecasting

June 22, 2017

IBM jumped into the weather forecasting business roughly a year and a half ago by purchasing The Weather Company. This week at ISC 2017, Big Blue rolled out plans to push deeper into climate science and develop more gran Read more…

By John Russell

Intersect 360 at ISC: HPC Industry at $44B by 2021

June 22, 2017

The care, feeding and sustained growth of the HPC industry increasingly is in the hands of the commercial market sector – in particular, it’s the hyperscale companies and their embrace of AI and deep learning – tha Read more…

By Doug Black

HPE Extreme Performance Solutions

Creating a Roadmap for HPC Innovation at ISC 2017

In an era where technological advancements are driving innovation to every sector, and powering major economic and scientific breakthroughs, high performance computing (HPC) is crucial to tackle the challenges of today and tomorrow. Read more…

At ISC – Goh on Go: Humans Can’t Scale, the Data-Centric Learning Machine Can

June 22, 2017

I've seen the future this week at ISC, it’s on display in prototype or Powerpoint form, and it’s going to dumbfound you. The future is an AI neural network designed to emulate and compete with the human brain. In thi Read more…

By Doug Black

Cray Brings AI and HPC Together on Flagship Supers

June 20, 2017

Cray took one more step toward the convergence of big data and high performance computing (HPC) today when it announced that it’s adding a full suite of big data and artificial intelligence software to its top-of-the-l Read more…

By Alex Woodie

AMD Charges Back into the Datacenter and HPC Workflows with EPYC Processor

June 20, 2017

AMD is charging back into the enterprise datacenter and select HPC workflows with its new EPYC 7000 processor line, code-named Naples, announced today at a “global” launch event in Austin TX. In many ways it was a fu Read more…

By John Russell

Hyperion: Deep Learning, AI Helping Drive Healthy HPC Industry Growth

June 20, 2017

To be at the ISC conference in Frankfurt this week is to experience deep immersion in deep learning. Users want to learn about it, vendors want to talk about it, analysts and journalists want to report on it. Deep learni Read more…

By Doug Black

How ‘Knights Mill’ Gets Its Deep Learning Flops

June 22, 2017

Intel, the subject of much speculation regarding the delayed, rewritten or potentially canceled “Aurora” contract (the Argonne Lab part of the CORAL “ Read more…

By Tiffany Trader

Tsinghua Crowned Eight-Time Student Cluster Champions at ISC

June 22, 2017

Always a hard-fought competition, the Student Cluster Competition awards were announced Wednesday, June 21, at the ISC High Performance Conference 2017. Amid wh Read more…

By Kim McMahon

GPUs, Power9, Figure Prominently in IBM’s Bet on Weather Forecasting

June 22, 2017

IBM jumped into the weather forecasting business roughly a year and a half ago by purchasing The Weather Company. This week at ISC 2017, Big Blue rolled out pla Read more…

By John Russell

Intersect 360 at ISC: HPC Industry at $44B by 2021

June 22, 2017

The care, feeding and sustained growth of the HPC industry increasingly is in the hands of the commercial market sector – in particular, it’s the hyperscale Read more…

By Doug Black

At ISC – Goh on Go: Humans Can’t Scale, the Data-Centric Learning Machine Can

June 22, 2017

I've seen the future this week at ISC, it’s on display in prototype or Powerpoint form, and it’s going to dumbfound you. The future is an AI neural network Read more…

By Doug Black

Cray Brings AI and HPC Together on Flagship Supers

June 20, 2017

Cray took one more step toward the convergence of big data and high performance computing (HPC) today when it announced that it’s adding a full suite of big d Read more…

By Alex Woodie

AMD Charges Back into the Datacenter and HPC Workflows with EPYC Processor

June 20, 2017

AMD is charging back into the enterprise datacenter and select HPC workflows with its new EPYC 7000 processor line, code-named Naples, announced today at a “g Read more…

By John Russell

Hyperion: Deep Learning, AI Helping Drive Healthy HPC Industry Growth

June 20, 2017

To be at the ISC conference in Frankfurt this week is to experience deep immersion in deep learning. Users want to learn about it, vendors want to talk about it Read more…

By Doug Black

Quantum Bits: D-Wave and VW; Google Quantum Lab; IBM Expands Access

March 21, 2017

For a technology that’s usually characterized as far off and in a distant galaxy, quantum computing has been steadily picking up steam. Just how close real-wo Read more…

By John Russell

Trump Budget Targets NIH, DOE, and EPA; No Mention of NSF

March 16, 2017

President Trump’s proposed U.S. fiscal 2018 budget issued today sharply cuts science spending while bolstering military spending as he promised during the cam Read more…

By John Russell

HPC Compiler Company PathScale Seeks Life Raft

March 23, 2017

HPCwire has learned that HPC compiler company PathScale has fallen on difficult times and is asking the community for help or actively seeking a buyer for its a Read more…

By Tiffany Trader

Google Pulls Back the Covers on Its First Machine Learning Chip

April 6, 2017

This week Google released a report detailing the design and performance characteristics of the Tensor Processing Unit (TPU), its custom ASIC for the inference Read more…

By Tiffany Trader

CPU-based Visualization Positions for Exascale Supercomputing

March 16, 2017

In this contributed perspective piece, Intel’s Jim Jeffers makes the case that CPU-based visualization is now widely adopted and as such is no longer a contrarian view, but is rather an exascale requirement. Read more…

By Jim Jeffers, Principal Engineer and Engineering Leader, Intel

Nvidia Responds to Google TPU Benchmarking

April 10, 2017

Nvidia highlights strengths of its newest GPU silicon in response to Google's report on the performance and energy advantages of its custom tensor processor. Read more…

By Tiffany Trader

Nvidia’s Mammoth Volta GPU Aims High for AI, HPC

May 10, 2017

At Nvidia's GPU Technology Conference (GTC17) in San Jose, Calif., this morning, CEO Jensen Huang announced the company's much-anticipated Volta architecture a Read more…

By Tiffany Trader

Facebook Open Sources Caffe2; Nvidia, Intel Rush to Optimize

April 18, 2017

From its F8 developer conference in San Jose, Calif., today, Facebook announced Caffe2, a new open-source, cross-platform framework for deep learning. Caffe2 is the successor to Caffe, the deep learning framework developed by Berkeley AI Research and community contributors. Read more…

By Tiffany Trader

Leading Solution Providers

MIT Mathematician Spins Up 220,000-Core Google Compute Cluster

April 21, 2017

On Thursday, Google announced that MIT math professor and computational number theorist Andrew V. Sutherland had set a record for the largest Google Compute Engine (GCE) job. Sutherland ran the massive mathematics workload on 220,000 GCE cores using preemptible virtual machine instances. Read more…

By Tiffany Trader

Google Debuts TPU v2 and will Add to Google Cloud

May 25, 2017

Not long after stirring attention in the deep learning/AI community by revealing the details of its Tensor Processing Unit (TPU), Google last week announced the Read more…

By John Russell

US Supercomputing Leaders Tackle the China Question

March 15, 2017

Joint DOE-NSA report responds to the increased global pressures impacting the competitiveness of U.S. supercomputing. Read more…

By Tiffany Trader

Russian Researchers Claim First Quantum-Safe Blockchain

May 25, 2017

The Russian Quantum Center today announced it has overcome the threat of quantum cryptography by creating the first quantum-safe blockchain, securing cryptocurrencies like Bitcoin, along with classified government communications and other sensitive digital transfers. Read more…

By Doug Black

Groq This: New AI Chips to Give GPUs a Run for Deep Learning Money

April 24, 2017

CPUs and GPUs, move over. Thanks to recent revelations surrounding Google’s new Tensor Processing Unit (TPU), the computing world appears to be on the cusp of Read more…

By Alex Woodie

DOE Supercomputer Achieves Record 45-Qubit Quantum Simulation

April 13, 2017

In order to simulate larger and larger quantum systems and usher in an age of “quantum supremacy,” researchers are stretching the limits of today’s most advanced supercomputers. Read more…

By Tiffany Trader

Messina Update: The US Path to Exascale in 16 Slides

April 26, 2017

Paul Messina, director of the U.S. Exascale Computing Project, provided a wide-ranging review of ECP’s evolving plans last week at the HPC User Forum. Read more…

By John Russell

Knights Landing Processor with Omni-Path Makes Cloud Debut

April 18, 2017

HPC cloud specialist Rescale is partnering with Intel and HPC resource provider R Systems to offer first-ever cloud access to Xeon Phi "Knights Landing" processors. The infrastructure is based on the 68-core Intel Knights Landing processor with integrated Omni-Path fabric (the 7250F Xeon Phi). Read more…

By Tiffany Trader

  • arrow
  • Click Here for More Headlines
  • arrow
Share This