HPC in the Land of 24/7

By Michael Feldman

November 23, 2007

More businesses than ever are employing high performance computing capabilities to fulfill their mission-critical needs. While many of these companies aren’t using traditional technical computing, they still require a level of processing power, networking performance or storage scale that necessitates HPC assets. In most cases, the systems are not being used to produce a single answer or model a specific problem, but rather provide a continuous high performance capability for processing real-time transactions. In this type of environment, pure performance is not enough; marrying HPC with mission-critical computing is the real challenge.

Examples of such businesses include Wal-Mart, NASDAQ, and FedEx, three companies that shared their experiences with high performance computing at a Masterworks session at SC07 in Reno last week. The session was organized with the help of the Council on Competitiveness, an NGO that focuses on U.S. economic competitiveness opportunities and challenges.

NASDAQ — Speed, Cost and Reliability are Key

As executive vice president of Operations and Technology and chief information officer of NASDAQ since 2005, Anna Ewing has witnessed a rapid transformation of financial market exchanges. Although the industry is now extremely high-tech, it’s been slow to become globalized in the manner of most other industries. Here in the U.S., and even more so, in other countries, the exchanges have been maintained and protected as near monopolies by their government benefactors. Today though, the globalization of market exchanges is occurring in parallel with the rapid increase in electronic trading volume. In this environment, transaction speed, data throughput and low latency messaging are the technological features that give exchanges their competitive edge.

The most immediate challenge for NASDAQ is to keep up with the message data as electronic exchange traffic continues to skyrocket. Ewing says the exchange use to double its data traffic every year; now it’s every six months. The interconnectedness of the global markets is also stressing the system. Thanks to the near instantaneous transfer of market data, disruptive financial events quickly ripple through the world’s markets. In this volatile environment, predictability becomes a real asset and users gravitate to those exchanges where they know the trades can be executed reliably.

According to Ewing , their target is Four Nines (99.99 percent) reliability and they’ve been tracking to Five Nines (99.999 percent). Immediately after 9/11, the NASDAQ systems remained operational, thanks to a virtualized model and computing resources that were distributed across the country. But a lot of their customers were not nearly so fortunate, either because they relied on New York assets or because the redundant systems they had in place had never been tested, and didn’t perform as expected. Because of this and the general chaos of the financial environment, NASDAQ ended up voluntarily shutting down the exchange after 9/11. The lesson for NASDAQ was to include their customers in their business continuity planning and testing.

Because of the ubiquity of Internet applications and recent changes to the market regulatory framework, the barriers to automated trading have lowered dramatically. Achieving low latency market data messaging has becomes a critical feature for attracting traders. At NASDAQ, they’re constantly looking at ways for improving the messaging infrastructure to shave time off transactions. Ewing says they now can provide less than a 1 ms round-trip per message. In an effort to shave microseconds of latency from trades, some customers are collocating in NASDAQ facilities to get an edge over their competitors coming through the WAN.

“From a technology perspective, speed, reliability and low-cost are the life blood of our market,” says Ewing ” On any given day, we will process over two billion transactions at sub-millisecond speeds, at rates of over 200,000 transactions per second.”

Because of the rapidly increasing volumes of transactions, scaling their computing infrastructure becomes a continuous process, not something to be addressed every three or four years as equipment becomes obsolete. NASDAQ relies almost exclusively on commodity platforms, along with their own customized software. Using this model, over the last several years they’ve been able to reduce their cost base by 70 percent.

“There’s nothing fancy about our platforms,” explains Ewing. “It’s the software and network engineering that we perform that is, quite frankly, our core competence — our secret sauce, if you will.”

Wal-Mart — The Challenge of the 410 Billion Row Table

Nancy Stewart, senior vice president and chief technology officer of Wal-Mart Stores Inc., is in charge of the company’s infrastructure, operations and technology roadmap. That turns out to be quite a responsibility. Wal-Mart is the largest retailer in the world, a $370 billion company, whose revenue is larger than IBM, Intel, Microsoft, HP and Dell combined. The company is on track to become the first $1 trillion dollar company within the next few years.

Although Wal-Mart does not talk specifics about the scope of the computing and storage infrastructure it administers, in order to manage their inventory and supply chain, the company must process a 410 billion row table to figure out what is going to end up on its world-wide store shelves on any given day. The data has to be massaged very quickly, so that inventory control can react to real time events, like disasters, man-made supply disruptions or seasonal demand spikes. While the stores themselves may close, the company’s IT infrastructure is up 24/7.

“The value for us in using high performance computing is related to the fact that we have one of the largest data stores in the world,” says Stewart. “In terms of using that data store, in any given two hour period we have to process over two petabytes of data.”

Wal-Mart develops about 80 percent of their software in-house to maintain the level of reliability and availability that they require. When your company is netting $2 billion per hour on the day after Thanksgiving, downtime is not really an option. To work with Wal-Mart, suppliers and other partners have to match the retailer’s devotion to continuous availability. Because of the magnitude of transactions and the cash flow, Wal-Mart doesn’t maintain service level agreements (SLAs) with their computing partners. According to Stewart, none of them could afford the penalties involved with any downtime.

The ongoing problem for Wal-Mart is that their inventory management database has become so large that they’ve maxed out on their ability to handle it. The company’s application represents the “Grand Challenge” of real-time transaction processing. A trillion-row table, which they foresee in the next few years, is going to be difficult to process in real time. What they’re really looking for are predictable tools that can scale to their future needs. In truth, Stewart would prefer even faster turnaround on the inventory they currently manage.

“I really need to be able to mine the data much more quickly than I am now” admits Stewart…. “I’m not getting that today.”

FedEx — Logistics Planning on a Grand Scale

Kevin Humphries is the senior vice president of Technology Systems for FedEx Corporate Services and is responsible for setting technology direction as well as providing data center, network and field infrastructure support. The company’s computing technology orchestrates the delivery of millions of items each day around the world, using a fleet of over 600 aircraft and 75,000 motorized vehicles.

According to Humphries, the only way they’re able pull off this global logistics puzzle is to employ HPC simulation and modeling to help plan the FedEx routes. Trucks and planes have to be continually shuffled from place to place in the most efficient manner possible to make timely deliveries and to optimize resources. It’s not just a mega-version of the traveling salesman problem. In addition to the complex routing, the company has to deal with unforeseen events like weather and equipment breakdowns. On top of that, FedEx has essentially no control over shipping demand at any given time. But it’s the scope of the problem that precipitates the need for HPC.

“We have to take everything that comes our way,” says Humphries. “That creates about 30 million origin-destination pairs that have to be planned 24/7 every hour of the day, over all the assets that we own.”

The initial logistics plan for using the assets is performed with traditional HPC cluster tools well in advance of the actual shipments. As the time winds down to the day of execution, the model is continuously refined (some on grid platforms) to support a real time response. The refined model has to react to environmental conditions, like weather, mechanical breakdowns and infrastructure problems. An extremely high capacity computing environment is used to coalesce all the information in real time.

Humphries main frustration with high performance computing technology is its uniqueness. Businesses like FedEx would like to see their HPC assets seamlessly embedded into their overall enterprise infrastructure rather than have to be treated as an island of resources devoted to solving specialized problems. He thinks that transition is occurring, but they still struggle with some of the distinctive aspects of HPC, especially as it pertains to their cluster computing resources. The mainframes of the past were much easier to deal with compared to a system with thousands of nodes, where the job has to split up into little pieces. Further constraining the use of these systems is the limited pool of talent that can manage those resources.

“I don’t know where that changes though,” says Humphries. “It’s not something that every kid is going to learn in college and it’s not something everybody is going to learn on the job.”

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

UCSD, AIST Forge Tighter Alliance with AI-Focused MOU

January 18, 2018

The rich history of collaboration between UC San Diego and AIST in Japan is getting richer. The organizations entered into a five-year memorandum of understanding on January 10. The MOU represents the continuation of a 1 Read more…

By Tiffany Trader

New Blueprint for Converging HPC, Big Data

January 18, 2018

After five annual workshops on Big Data and Extreme-Scale Computing (BDEC), a group of international HPC heavyweights including Jack Dongarra (University of Tennessee), Satoshi Matsuoka (Tokyo Institute of Technology), Read more…

By John Russell

Researchers Measure Impact of ‘Meltdown’ and ‘Spectre’ Patches on HPC Workloads

January 17, 2018

Computer scientists from the Center for Computational Research, State University of New York (SUNY), University at Buffalo have examined the effect of Meltdown and Spectre security updates on the performance of popular H Read more…

By Tiffany Trader

HPE Extreme Performance Solutions

HPE and NREL Take Steps to Create a Sustainable, Energy-Efficient Data Center with an H2 Fuel Cell

As enterprises attempt to manage rising volumes of data, unplanned data center outages are becoming more common and more expensive. As the cost of downtime rises, enterprises lose out on productivity and valuable competitive advantage without access to their critical data. Read more…

Fostering Lustre Advancement Through Development and Contributions

January 17, 2018

Six months after organizational changes at Intel's High Performance Data (HPDD) division, most in the Lustre community have shed any initial apprehension around the potential changes that could affect or disrupt Lustre Read more…

By Carlos Aoki Thomaz

UCSD, AIST Forge Tighter Alliance with AI-Focused MOU

January 18, 2018

The rich history of collaboration between UC San Diego and AIST in Japan is getting richer. The organizations entered into a five-year memorandum of understandi Read more…

By Tiffany Trader

New Blueprint for Converging HPC, Big Data

January 18, 2018

After five annual workshops on Big Data and Extreme-Scale Computing (BDEC), a group of international HPC heavyweights including Jack Dongarra (University of Te Read more…

By John Russell

Researchers Measure Impact of ‘Meltdown’ and ‘Spectre’ Patches on HPC Workloads

January 17, 2018

Computer scientists from the Center for Computational Research, State University of New York (SUNY), University at Buffalo have examined the effect of Meltdown Read more…

By Tiffany Trader

Fostering Lustre Advancement Through Development and Contributions

January 17, 2018

Six months after organizational changes at Intel's High Performance Data (HPDD) division, most in the Lustre community have shed any initial apprehension aroun Read more…

By Carlos Aoki Thomaz

When the Chips Are Down

January 11, 2018

In the last article, "The High Stakes Semiconductor Game that Drives HPC Diversity," I alluded to the challenges facing the semiconductor industry and how that may impact the evolution of HPC systems over the next few years. I thought I’d lift the covers a little and look at some of the commercial challenges that impact the component technology we use in HPC. Read more…

By Dairsie Latimer

How Meltdown and Spectre Patches Will Affect HPC Workloads

January 10, 2018

There have been claims that the fixes for the Meltdown and Spectre security vulnerabilities, named the KPTI (aka KAISER) patches, are going to affect applicatio Read more…

By Rosemary Francis

Momentum Builds for US Exascale

January 9, 2018

2018 looks to be a great year for the U.S. exascale program. The last several months of 2017 revealed a number of important developments that help put the U.S. Read more…

By Alex R. Larzelere

ANL’s Rick Stevens on CANDLE, ARM, Quantum, and More

January 8, 2018

Late last year HPCwire caught up with Rick Stevens, associate laboratory director for computing, environment and life Sciences at Argonne National Laboratory, f Read more…

By John Russell

Inventor Claims to Have Solved Floating Point Error Problem

January 17, 2018

"The decades-old floating point error problem has been solved," proclaims a press release from inventor Alan Jorgensen. The computer scientist has filed for and Read more…

By Tiffany Trader

US Coalesces Plans for First Exascale Supercomputer: Aurora in 2021

September 27, 2017

At the Advanced Scientific Computing Advisory Committee (ASCAC) meeting, in Arlington, Va., yesterday (Sept. 26), it was revealed that the "Aurora" supercompute Read more…

By Tiffany Trader

Japan Unveils Quantum Neural Network

November 22, 2017

The U.S. and China are leading the race toward productive quantum computing, but it's early enough that ultimate leadership is still something of an open questi Read more…

By Tiffany Trader

AMD Showcases Growing Portfolio of EPYC and Radeon-based Systems at SC17

November 13, 2017

AMD’s charge back into HPC and the datacenter is on full display at SC17. Having launched the EPYC processor line in June along with its MI25 GPU the focus he Read more…

By John Russell

Nvidia Responds to Google TPU Benchmarking

April 10, 2017

Nvidia highlights strengths of its newest GPU silicon in response to Google's report on the performance and energy advantages of its custom tensor processor. Read more…

By Tiffany Trader

IBM Begins Power9 Rollout with Backing from DOE, Google

December 6, 2017

After over a year of buildup, IBM is unveiling its first Power9 system based on the same architecture as the Department of Energy CORAL supercomputers, Summit a Read more…

By Tiffany Trader

Fast Forward: Five HPC Predictions for 2018

December 21, 2017

What’s on your list of high (and low) lights for 2017? Volta 100’s arrival on the heels of the P100? Appearance, albeit late in the year, of IBM’s Power9? Read more…

By John Russell

GlobalFoundries Puts Wind in AMD’s Sails with 12nm FinFET

September 24, 2017

From its annual tech conference last week (Sept. 20), where GlobalFoundries welcomed more than 600 semiconductor professionals (reaching the Santa Clara venue Read more…

By Tiffany Trader

Leading Solution Providers

Chip Flaws ‘Meltdown’ and ‘Spectre’ Loom Large

January 4, 2018

The HPC and wider tech community have been abuzz this week over the discovery of critical design flaws that impact virtually all contemporary microprocessors. T Read more…

By Tiffany Trader

Perspective: What Really Happened at SC17?

November 22, 2017

SC is over. Now comes the myriad of follow-ups. Inboxes are filled with templated emails from vendors and other exhibitors hoping to win a place in the post-SC thinking of booth visitors. Attendees of tutorials, workshops and other technical sessions will be inundated with requests for feedback. Read more…

By Andrew Jones

Tensors Come of Age: Why the AI Revolution Will Help HPC

November 13, 2017

Thirty years ago, parallel computing was coming of age. A bitter battle began between stalwart vector computing supporters and advocates of various approaches to parallel computing. IBM skeptic Alan Karp, reacting to announcements of nCUBE’s 1024-microprocessor system and Thinking Machines’ 65,536-element array, made a public $100 wager that no one could get a parallel speedup of over 200 on real HPC workloads. Read more…

By John Gustafson & Lenore Mullin

Delays, Smoke, Records & Markets – A Candid Conversation with Cray CEO Peter Ungaro

October 5, 2017

Earlier this month, Tom Tabor, publisher of HPCwire and I had a very personal conversation with Cray CEO Peter Ungaro. Cray has been on something of a Cinderell Read more…

By Tiffany Trader & Tom Tabor

Flipping the Flops and Reading the Top500 Tea Leaves

November 13, 2017

The 50th edition of the Top500 list, the biannual publication of the world’s fastest supercomputers based on public Linpack benchmarking results, was released Read more…

By Tiffany Trader

GlobalFoundries, Ayar Labs Team Up to Commercialize Optical I/O

December 4, 2017

GlobalFoundries (GF) and Ayar Labs, a startup focused on using light, instead of electricity, to transfer data between chips, today announced they've entered in Read more…

By Tiffany Trader

How Meltdown and Spectre Patches Will Affect HPC Workloads

January 10, 2018

There have been claims that the fixes for the Meltdown and Spectre security vulnerabilities, named the KPTI (aka KAISER) patches, are going to affect applicatio Read more…

By Rosemary Francis

HPC Chips – A Veritable Smorgasbord?

October 10, 2017

For the first time since AMD's ill-fated launch of Bulldozer the answer to the question, 'Which CPU will be in my next HPC system?' doesn't have to be 'Whichever variety of Intel Xeon E5 they are selling when we procure'. Read more…

By Dairsie Latimer

  • arrow
  • Click Here for More Headlines
  • arrow
Share This