Tales from a Trading Desk: Resiliency Made Easy

By By Mike Stolz, Vice President of Architecture, GemStone Systems

November 5, 2007

To Keep It Running, Keep It Simple

Today’s electronic world has resulted in a major shift in terms of how organizations think about resiliency. Firms of all sizes are not only faced with the challenge of determining how resilient their mission-critical systems need to be, but also how they can efficiently and cost-effectively architect a “resilient” system. With millions of dollars per minute running through electronic channels 24×7, traditional high availability and disaster recovery notions are no longer good enough.

For the past five or 10 years, high availability meant the ability to recover from a server outage within about 15 minutes. Solutions like N+1 clustering and storage area network replication were perfectly acceptable. Today, however, the recovery time associated with these high-availability schemes can cause millions of dollars in lost revenue.

To avoid a potentially massive loss in revenue and efficiency in today’s fast-moving markets, firms must significantly improve their enterprise resiliency. Continuous availability is now the acceptable level of resiliency, and it is quite common in Web-based or other electronic channels to use load-balanced, hot/hot clusters of servers to serve up the business logic. These servers typically are stateless in design, so it is easy to add or remove servers and re-balance the work load. The difficult part is designing a resiliency architecture that makes the data behind those business services hot/hot.

Meeting the Resiliency Challenge: An EDF Approach

The best way to provide nearly 100 percent uptime for data and deliver maximum resiliency is by using data management middleware to ensure there are multiple consistent copies of the active business objects in-memory at all times. As firms strive to get ever closer to 100 percent uptime and ensure resiliency, distributed data caching is gaining in popularity.

Solutions such as an enterprise data fabric (EDF) are ideal for meeting those demands. Presented as a simple HashMap API, the EDF programming paradigm is extremely simple and familiar yet delivers maximum value behind the scenes: You simply “put” your state into the HashMap and, under the covers, the middleware takes care of replicating this business object to multiple additional servers.

Sounds easy, right? It is — until you start to think about the various failure modes, guarantees around zero data loss, low latency and scalability. That’s what makes a product like an EDF worth its weight in gold. The most difficult parts of data management are resiliency, scalability, throughput, latency and dataset size — and you have to get it right. Every time.

By deploying an EDF, firms will benefit from a very fast, highly scalable distributed caching system. An EDF is designed for use in many diverse data management situations, but is especially useful for high-volume, latency-sensitive, mission-critical, transactional systems. There are several critical features to consider when evaluating an EDF, including:

  • Language neutrality. This is the ability to access the data natively from common programming languages like Java, C++ and C#.
  • Cache coherency, which is especially important in globally distributed systems.
  • Persistence/overflow so no data is ever lost regardless of circumstances.
  • Highly reliable business object replication, both synchronous and asynchronous, to multiple locations for safe, high-volume transactional environments.
  • Horizontal scalability to thousands of cache nodes.
  • A loosely coupled WAN gateway for long-haul distribution of data.
  • And all of this with continuous availability — never any unexpected down time.

So how does it work? As soon as an application puts data into the cache it is replicated synchronously to at least one additional member of the cache. It also can be replicated to additional members or written to a persistent store, but this can be done on a low priority, asynchronous thread so it doesn’t hold up mainstream processing.

Leveraging Multiple Topologies to Deliver Maximum Value

A true EDF should use three topologies in order to achieve the highest levels of reliability, scalability and speed. The first — and the backbone of the system — is the peer-to-peer topology. In this configuration, everybody knows about everybody else. If a new node joins the distributed system, everybody gets notified, and if a node leaves, everybody gets notified. This enables users to dynamically adjust distributed systems. There is no notion of a “broker” and no single point of failure; fault tolerance is designed right in.

The trouble with peer-to-peer architectures is that they have so much metadata flying around that they can only have limited scale. In most cases, this topology should only be scaled up to about 100 or 200 nodes.

Scalability can be improved by using a second type of topology — client-server — where we elect some of the peers from the peer-to-peer backbone to be servers for client applications (your business logic servers). Each server should be able to manage as many as 100 clients. As there is much less metadata overhead in this topology, it can scale to thousands of nodes.

The third topology is a WAN gateway topology, which can glue together multiple client-server distributed systems. This is an ideal way of creating an enterprise data grid that is globally distributed and appears as one large distributed system, even though it is really many distributed systems glued together.

Appropriate use of these three topologies will enable you to achieve your business requirements around recovery point objective and recovery time objective. Data is replicated across the entire distributed cache, and replication is transactional and performed at the in-memory object level. As soon as an object is put into the cache, it is replicated in-memory to at least one additional node. The data can be replicated to additional nodes either synchronously or asynchronously depending on sensitivity to latency and tolerance for data loss in the event of a catastrophic failure. Write-through to a database or other persistent store is done asynchronously as time permits. In essence, the distributed cache behaves much like RAID for the enterprise.

Additionally, the data can be actively used in both the primary and secondary sites. In fact, the only thing that typically drives the notion of one site even being primary is the external connectivity to the exchanges or ECNs.

Another factor to consider when evaluating an EDF is what we’ll term a “shared nothing” architecture. Because the data in an EDF can be mirrored across multiple nodes in a distributed cache, it eliminates the need for any type of fancy shared storage. In fact, the local disks that are on the blades themselves are often sufficient. In the event that a disk fails, only one node is taken down in the distributed system and there are other nodes alive and ready to take over that workload. Finally, the workload itself is distributed across all the nodes in the distributed system. Exchanges may be split between the two sites and clients will likely be distributed across the two sites, as everything except external connectivity is in a hot/hot configuration.

Let’s walk through the simple H/A recovery process for a single node failure: detect the failure; reconnect the clients; recovery is complete. In total, there is less than 1 second from detection of the failure to complete recovery. A little better than 15 minutes! Because the data is all in-memory in the form of business objects all the time, there is no re-booting, no re-fetching of data and no re-creation of objects.

But what about a catastrophic failure? EDF clusters are virtual, so the nodes needn’t be located close together within the datacenter — they can be on separate subnets, using separate routers, power sources, etc. In fact, some of the nodes actually can be physically located in a different site. Therefore, the notion of losing a “cluster” is non-existent; we’re actually talking about loss of an entire datacenter.

If a disaster occurs and the entire primary data-center fails, the recovery process goes like this: detect the failure; reconnect the exchange at the alternate site; reconnect the clients; recovery is complete. The typical time to recover from point of detection is around 1 second. That’s a huge difference to the typical 1-4 hour disaster recovery time common in business today!

Summary

As distributed computing deployments become the norm rather than the exception, resiliency will become one of the most critical issues facing global corporations. By using an EDF, firms can achieve nearly instantaneous recovery from outages — real business continuity — while simultaneously simplifying their architectures. This one product takes the place of an H/A solution, a shared-storage environment, storage-level replication and wide-area data distribution, removing the need to design a data resiliency architecture for mission-critical systems.

About Mike Stolz

Mike Stolz is vice president of architecture and strategy for financial services at GemStone Systems. In his role, Stolz leverages his expertise in targeting, developing and delivering innovative technology solutions to expand GemStone’s global financial services offering and cultivate its growing capital markets division. Stolz served during the last nine years as director and chief architect of Merrill Lynch’s global markets and investment banking debt division. In this role, Stolz was responsible for the design and development of trading systems and trading support systems for interest rate, credit and asset backed derivatives, as well as FX and repos and fixed income products.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

IBM Boosts Deep Learning Accuracy on Memristive Chips

May 27, 2020

IBM researchers have taken another step towards making in-memory computing based on phase change (PCM) memory devices a reality. Papers in Nature and Frontiers in Neuroscience this month present IBM work using a mixed-si Read more…

By John Russell

Australian Researchers Break All-Time Internet Speed Record

May 26, 2020

If you’ve been stuck at home for the last few months, you’ve probably become more attuned to the quality (or lack thereof) of your internet connection. Even in the U.S. (which has a reasonably fast average broadband Read more…

By Oliver Peckham

Hats Over Hearts: Remembering Rich Brueckner

May 26, 2020

It is with great sadness that we announce the death of Rich Brueckner. His passing is an unexpected and enormous blow to both his family and our HPC family. Rich was born in Milwaukee, Wisconsin on April 12, 1962. His Read more…

Supercomputer Simulations Reveal the Fate of the Neanderthals

May 25, 2020

For hundreds of thousands of years, neanderthals roamed the planet, eventually (almost 50,000 years ago) giving way to homo sapiens, which quickly became the dominant primate species, with the neanderthals disappearing b Read more…

By Oliver Peckham

Discovering Alternative Solar Panel Materials with Supercomputing

May 23, 2020

Solar power is quickly growing in the world’s energy mix, but silicon – a crucial material in the construction of photovoltaic solar panels – remains expensive, hindering solar’s expansion and competitiveness wit Read more…

By Oliver Peckham

AWS Solution Channel

Computational Fluid Dynamics on AWS

Over the past 30 years Computational Fluid Dynamics (CFD) has grown to become a key part of many engineering design processes. From aircraft design to modelling the blood flow in our bodies, the ability to understand the behaviour of fluids has enabled countless innovations and improved the time to market for many products. Read more…

Nvidia Q1 Earnings Top Expectations, Datacenter Revenue Breaks $1B

May 22, 2020

Nvidia’s seemingly endless roll continued in the first quarter with the company announcing blockbuster earnings that exceeded Wall Street expectations. Nvidia said revenues for the period ended April 26 were up 39 perc Read more…

By Doug Black

IBM Boosts Deep Learning Accuracy on Memristive Chips

May 27, 2020

IBM researchers have taken another step towards making in-memory computing based on phase change (PCM) memory devices a reality. Papers in Nature and Frontiers Read more…

By John Russell

Microsoft’s Massive AI Supercomputer on Azure: 285k CPU Cores, 10k GPUs

May 20, 2020

Microsoft has unveiled a supercomputing monster – among the world’s five most powerful, according to the company – aimed at what is known in scientific an Read more…

By Doug Black

HPC in Life Sciences 2020 Part 1: Rise of AMD, Data Management’s Wild West, More 

May 20, 2020

Given the disruption caused by the COVID-19 pandemic and the massive enlistment of major HPC resources to fight the pandemic, it is especially appropriate to re Read more…

By John Russell

AMD Epyc Rome Picked for New Nvidia DGX, but HGX Preserves Intel Option

May 19, 2020

AMD continues to make inroads into the datacenter with its second-generation Epyc "Rome" processor, which last week scored a win with Nvidia's announcement that Read more…

By Tiffany Trader

Hacking Streak Forces European Supercomputers Offline in Midst of COVID-19 Research Effort

May 18, 2020

This week, a number of European supercomputers discovered intrusive malware hosted on their systems. Now, in the midst of a massive supercomputing research effo Read more…

By Oliver Peckham

Nvidia’s Ampere A100 GPU: Up to 2.5X the HPC, 20X the AI

May 14, 2020

Nvidia's first Ampere-based graphics card, the A100 GPU, packs a whopping 54 billion transistors on 826mm2 of silicon, making it the world's largest seven-nanom Read more…

By Tiffany Trader

Wafer-Scale Engine AI Supercomputer Is Fighting COVID-19

May 13, 2020

Seemingly every supercomputer in the world is allied in the fight against the coronavirus pandemic – but not many of them are fresh out of the box. Cerebras S Read more…

By Oliver Peckham

Startup MemVerge on Memory-centric Mission

May 12, 2020

Memory situated at the center of the computing universe, replacing processing, has long been envisioned as instrumental to radically improved datacenter systems Read more…

By Doug Black

Supercomputer Modeling Tests How COVID-19 Spreads in Grocery Stores

April 8, 2020

In the COVID-19 era, many people are treating simple activities like getting gas or groceries with caution as they try to heed social distancing mandates and protect their own health. Still, significant uncertainty surrounds the relative risk of different activities, and conflicting information is prevalent. A team of Finnish researchers set out to address some of these uncertainties by... Read more…

By Oliver Peckham

[email protected] Turns Its Massive Crowdsourced Computer Network Against COVID-19

March 16, 2020

For gamers, fighting against a global crisis is usually pure fantasy – but now, it’s looking more like a reality. As supercomputers around the world spin up Read more…

By Oliver Peckham

[email protected] Rallies a Legion of Computers Against the Coronavirus

March 24, 2020

Last week, we highlighted [email protected], a massive, crowdsourced computer network that has turned its resources against the coronavirus pandemic sweeping the globe – but [email protected] isn’t the only game in town. The internet is buzzing with crowdsourced computing... Read more…

By Oliver Peckham

Global Supercomputing Is Mobilizing Against COVID-19

March 12, 2020

Tech has been taking some heavy losses from the coronavirus pandemic. Global supply chains have been disrupted, virtually every major tech conference taking place over the next few months has been canceled... Read more…

By Oliver Peckham

DoE Expands on Role of COVID-19 Supercomputing Consortium

March 25, 2020

After announcing the launch of the COVID-19 High Performance Computing Consortium on Sunday, the Department of Energy yesterday provided more details on its sco Read more…

By John Russell

Steve Scott Lays Out HPE-Cray Blended Product Roadmap

March 11, 2020

Last week, the day before the El Capitan processor disclosures were made at HPE's new headquarters in San Jose, Steve Scott (CTO for HPC & AI at HPE, and former Cray CTO) was on-hand at the Rice Oil & Gas HPC conference in Houston. He was there to discuss the HPE-Cray transition and blended roadmap, as well as his favorite topic, Cray's eighth-gen networking technology, Slingshot. Read more…

By Tiffany Trader

Honeywell’s Big Bet on Trapped Ion Quantum Computing

April 7, 2020

Honeywell doesn’t spring to mind when thinking of quantum computing pioneers, but a decade ago the high-tech conglomerate better known for its control systems waded deliberately into the then calmer quantum computing (QC) waters. Fast forward to March when Honeywell announced plans to introduce an ion trap-based quantum computer whose ‘performance’ would... Read more…

By John Russell

Fujitsu A64FX Supercomputer to Be Deployed at Nagoya University This Summer

February 3, 2020

Japanese tech giant Fujitsu announced today that it will supply Nagoya University Information Technology Center with the first commercial supercomputer powered Read more…

By Tiffany Trader

Leading Solution Providers

SC 2019 Virtual Booth Video Tour

AMD
AMD
ASROCK RACK
ASROCK RACK
AWS
AWS
CEJN
CJEN
CRAY
CRAY
DDN
DDN
DELL EMC
DELL EMC
IBM
IBM
MELLANOX
MELLANOX
ONE STOP SYSTEMS
ONE STOP SYSTEMS
PANASAS
PANASAS
SIX NINES IT
SIX NINES IT
VERNE GLOBAL
VERNE GLOBAL
WEKAIO
WEKAIO

Contributors

Tech Conferences Are Being Canceled Due to Coronavirus

March 3, 2020

Several conferences scheduled to take place in the coming weeks, including Nvidia’s GPU Technology Conference (GTC) and the Strata Data + AI conference, have Read more…

By Alex Woodie

Exascale Watch: El Capitan Will Use AMD CPUs & GPUs to Reach 2 Exaflops

March 4, 2020

HPE and its collaborators reported today that El Capitan, the forthcoming exascale supercomputer to be sited at Lawrence Livermore National Laboratory and serve Read more…

By John Russell

‘Billion Molecules Against COVID-19’ Challenge to Launch with Massive Supercomputing Support

April 22, 2020

Around the world, supercomputing centers have spun up and opened their doors for COVID-19 research in what may be the most unified supercomputing effort in hist Read more…

By Oliver Peckham

Cray to Provide NOAA with Two AMD-Powered Supercomputers

February 24, 2020

The United States’ National Oceanic and Atmospheric Administration (NOAA) last week announced plans for a major refresh of its operational weather forecasting supercomputers, part of a 10-year, $505.2 million program, which will secure two HPE-Cray systems for NOAA’s National Weather Service to be fielded later this year and put into production in early 2022. Read more…

By Tiffany Trader

Summit Supercomputer is Already Making its Mark on Science

September 20, 2018

Summit, now the fastest supercomputer in the world, is quickly making its mark in science – five of the six finalists just announced for the prestigious 2018 Read more…

By John Russell

Supercomputer Simulations Reveal the Fate of the Neanderthals

May 25, 2020

For hundreds of thousands of years, neanderthals roamed the planet, eventually (almost 50,000 years ago) giving way to homo sapiens, which quickly became the do Read more…

By Oliver Peckham

15 Slides on Programming Aurora and Exascale Systems

May 7, 2020

Sometime in 2021, Aurora, the first planned U.S. exascale system, is scheduled to be fired up at Argonne National Laboratory. Cray (now HPE) and Intel are the k Read more…

By John Russell

TACC Supercomputers Run Simulations Illuminating COVID-19, DNA Replication

March 19, 2020

As supercomputers around the world spin up to combat the coronavirus, the Texas Advanced Computing Center (TACC) is announcing results that may help to illumina Read more…

By Staff report

  • arrow
  • Click Here for More Headlines
  • arrow
Do NOT follow this link or you will be banned from the site!
Share This