Using an In-Memory Data Grid for Near Real-Time Data Analysis

By Nicole Hemsoth

August 6, 2012

by Dr. William Bain, ScaleOut Software, Inc.

Introduction

In today’s competitive world, businesses need to make fast decisions to respond to changing market conditions and to maintain a competitive edge. The explosion of data that must be analyzed to find trends or hidden insights intensifies this challenge. Both the private and public sectors are turning to parallel computing techniques, such as “map/reduce” to quickly sift through large data volumes.

In some cases, it is practical to analyze huge sets of historical, disk-based data over the course of minutes or hours using batch processing platforms such as Hadoop. For example, risk modeling to optimize the handling of insurance claims potentially needs to analyze billions of records and tens of terabytes of data. However, many applications need to continuously analyze relatively small but fast-changing data sets measured in the hundreds of gigabytes and reaching into terabytes.  Examples include clickstream data to optimize online promotions, stock trading data to implement trading strategies, machine log data to tune manufacturing processes, smart grid data, and many more.

Over the last several years, in-memory data grids (IMDGs) have proven their value in storing fast-changing application data and scaling application performance.  More recently, IMDGs have integrated map/reduce analytics into the grid to achieve powerful, easy-to-use analysis and enable near real-time decision making. For example, the following diagram illustrates an IMDG used to store and analyze incoming streams of market and news data to help generate alerts and strategies for optimizing financial operations. This article explains how using an IMDG with integrated map/reduce capabilities can simplify data analysis and provide important competitive advantages.

Real-Time Analytics Engine 

 

What is an In-Memory Data Grid?

By storing fast-changing data within a middleware software tier, IMDGs enable applications to seamlessly scale performance by adding servers that access and update a shared, memory-based data set.  To maximize scalability, IMDGs automatically load-balance data across servers on which the grid is hosted. They also redundantly store data on multiple servers to ensure high availability in case a server or network link fails. Additional capabilities, including eventing and distributed locking, make IMDGs a powerful data storage platform.

IMDGs typically integrate their data storage model with object-oriented programming languages, such as Java and C#. They store data as a collection of objects which are accessible either by specifying an identifying key or by querying object properties. The IMDG’s built-in parallel query mechanism can quickly scan a large data set for objects whose properties match a query specification. This provides an important tool for identifying data to be reviewed or analyzed. The following diagram illustrates the use of parallel query for selecting stock history data.

In Memory Data Grid 

Using an IMDG for Analytics

Without a doubt, the field of data analytics has gained a powerful new tool with the “map/reduce” analysis model, which has recently surged in popularity as open source solutions such as Hadoop have raised awareness. In fact, the roots of the map/reduce pattern date back to pioneering work in the 1980s which originally demonstrated the power of data-parallel computing.

Map/reduce implementations take many forms and are offered as components in several competing frameworks. Nearly all of these solutions are aimed at accelerating data analysis for disk-based data. With some data sets reaching petabytes in size, the benefits are often measured in reducing batch job processing times from hours to minutes for these “big data” analyses.

However, the overhead (and complexity) of disk-based map/reduce platforms is too high for applications which must quickly analyze fast-changing data sets measured in hundreds of gigabytes or terabytes. (Estimates by some analysts indicate that as much as sixty percent of data sets are smaller than ten terabytes.) In many situations, an answer in hours or minutes is not acceptable.  For example, an e-commerce Web site may need to monitor online shopping carts to see which products are selling. A financial services company might need to hone its equity trading strategy as it optimizes its response to fast-changing market conditions.

To address this challenge, leading-edge IMDGs have incorporated map/reduce analytics engines, transforming them from just scalable, memory-based data stores into parallel computing platforms for analyzing data and providing fast, near real-time results. IMDGs leverage the grid’s automatic load-balancing to minimize data motion and speed up analysis. Instead of migrating data into memory from disk, an IMDG analyzes data in place. Results also are stored and combined in memory, minimizing file I/O to calculate the final results. By eliminating these overheads, IMDGs dramatically reduce network usage and thereby shorten analysis time.

Moreover, by simplifying the programming model, IMDGs offer another advantage over popular, disk-based map/reduce platforms. Instead of requiring the application developer to create a key space for identifying objects to be analyzed, they make use of object-oriented query specifications to select objects.  Also, both the analysis (“map”) and merge (“reduce”) codes can be structured as straightforward, object-oriented methods written as if to be executed on a single workstation. These capabilities shorten design time and enable analysis applications to be quickly developed and revised.

The following diagram illustrates a map/reduce analysis of stock trading strategies across a set of stock histories held in the IMDG. A parallel query selects stocks for analysis, and the IMDG analyzes the stocks and merges the results using the supplied methods:

 Running Map/Reduce on an IMDG 

Running Map/Reduce on an IMDG

ScaleOut Grid Computing Edition (GCE) from ScaleOut Software is an example of an IMDG with an integrated data analytics engine. Using it as an example, the following steps demonstrate how an IMDG performs a map/reduce data analysis:

  • The data set to be analyzed in the IMDG originates from one of two sources. In many cases, especially those with tight latency requirements, the application continuously updates the grid as data flows through for processing. Alternatively, the application may stage the data set in the grid from persistent storage via a bulk loading operation. In either case, the IMDG holds the data, creates replicas for high availability, and load-balances it across servers to avoid hot spots.
  • ScaleOut GCE allows a query specification to be written either in Java using filter methods or in C# using the Microsoft language integrated query (LINQ) mechanism. This query specification selects the data to be analyzed, for example, ticker symbols, sales data, machine data, etc.
  • In ScaleOut GCE, the analysis and merge methods can be written either in Java or C#. Since GCE holds the objects to be analyzed or merged in memory, these methods are written without the need to use grid APIs. The analysis method specifies the analysis logic for a single data object selected by the query specification. For example, it might calculate stock trading profits for one company’s recent history of stock prices. The merge method combines the results of analyzing multiple objects and is repeatedly executed as necessary to merge all results. In the above example, it might calculate the average return for stock trades spanning many companies.
  • Using a special API in GCE called “invoke” and supplying the query specification and both the analysis and merge methods, the application starts a map/reduce computation called a “parallel method invocation” (PMI). GCE automatically performs the query, analysis, and merge steps in parallel across all grid servers using a multi-threaded computation engine and then returns the final, merged result back to the application. PMI operations can be performed repeatedly to provide a continuous stream of results. Because GCE avoids batch scheduling and keeps the overhead for starting and running the analysis low, it returns results with minimum latency for near real-time performance.

When using an IMDG, all computations are performed “in-place,” reducing data motion which is the enemy of high performance for map/reduce. Also, the IMDG leverages its features for maximizing scalability and high availability, such as partitioning, peer-to-peer architecture, and load-balancing. In addition, GCE implements special features for ensuring the high availability of map/reduce computations.

Lowering the Complexity Barrier

The map/reduce programming model has generated widespread interest in large part due to the popularity of the Hadoop open source software stack. However, Hadoop introduces a complex programming model and deployment architecture which must be thoroughly understood for Hadoop to be used effectively. For example, applications need to be written to fit Hadoop’s specific parallel execution model, incorporating several specialized elements such as record readers, mappers, combiners, and reducers. The number and interaction of these elements impact performance and require tuning. Beyond this, Hadoop’s execution environment, including the HDFS file system, job tracker (that is, the batch scheduler), and task trackers on each execution node must be deployed and managed. It may take a seasoned Java developer with knowledge of parallel programming weeks to become proficient with Hadoop. These complexities create a steep learning curve which impedes rapid adoption.

In contrast, the IMDG-based approach to map/reduce data analysis eliminates much of Hadoop’s complexity. Its object-oriented approach offers a simpler parallel execution model that reduces development time and eliminates the need for tuning. The user invests much less time in learning the model and focuses more on the analytical challenges of the business problem. Learning curves are flattened, and productivity is increased.

Delivering High Performance

To see the performance benefits of using an IMDG with integrated map/reduce, consider a real-world financial analysis application that  compares various stock trading strategies based on historical market data stored in the IMDG. This application makes use of the IMDG’s analytics engine to perform a map/reduce analysis across all grid servers and merge the results. Each stock history is stored as a separate object within the IMDG, and specific stock histories are selected for analysis using a parallel query. The analysis method evaluates a set of trading strategies across a single stock history, and the merge method combines the results across two stocks. The analytics engine repeatedly executes these methods to analyze all selected stocks and merge the results.

Performance measurements were made for this application using ScaleOut GCE’s IMDG to evaluate throughput scaling as the number of stock histories and grid servers was proportionally increased. As the graph below illustrates, the IMDG delivers linearly scalable throughput (shown as the red line in the graph). An alternative implementation of this application was measured using Hadoop’s map/reduce environment. Hadoop provided linear scaling with about 16X lower throughput (shown as the blue line in the graph) due to significant overhead introduced by file I/O and batch scheduling. By staging the stock history data in the IMDG instead of the Hadoop file system (HDFS), Hadoop’s throughput was increased by about 6X (shown as the green line), although it was still significantly below the IMDG’s throughput due to file I/O between the map and reduce phases.

 Throughput Comparison 

In Summary

With the ever increasing explosion in data for analysis and the need for fast insights on emerging trends, IMDGs offer a highly attractive platform for hosting map/reduce analysis. By simplifying the development model, IMDGs shorten the learning curve in developing analysis codes and eliminate the tuning steps required by more complex platforms. Because IMDGs run the analysis on data already staged in memory and load-balanced across grid servers, file I/O is eliminated and data motion is minimized. IMDGs also provide the infrastructure needed to automatically run analysis code on all grid servers in parallel and then combine the results with minimum latency. The net result is that by using an IMDG, application developers can easily analyze fast-changing, memory-based data and discover data patterns and trends that are vital to a company’s success.

 

Dr. William L. Bain is founder and CEO of ScaleOut Software, Inc. Bill has a Ph.D. in electrical engineering/parallel computing from Rice University, and he has worked at Bell Labs research, Intel, and Microsoft. Bill founded and ran three start-up companies prior to joining Microsoft. In the most recent company (Valence Research), he developed a distributed Web load-balancing software solution that was acquired by Microsoft and is now called Network Load Balanc­ing within the Windows Server operating system. Dr. Bain holds several patents in computer architecture and distributed computing. As a member of the Seattle-based Alliance of Angels, Dr. Bain is actively involved in entrepreneurship and the angel community.

www.scaleoutsoftware.com

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

US Exascale Computing Update with Paul Messina

December 8, 2016

Around the world, efforts are ramping up to cross the next major computing threshold with machines that are 50-100x more performant than today’s fastest number crunchers.  Read more…

By Tiffany Trader

Weekly Twitter Roundup (Dec. 8, 2016)

December 8, 2016

Here at HPCwire, we aim to keep the HPC community apprised of the most relevant and interesting news items that get tweeted throughout the week. Read more…

By Thomas Ayres

Qualcomm Targets Intel Datacenter Dominance with 10nm ARM-based Server Chip

December 8, 2016

Claiming no less than a reshaping of the future of Intel-dominated datacenter computing, Qualcomm Technologies, the market leader in smartphone chips, announced the forthcoming availability of what it says is the world’s first 10nm processor for servers, based on ARM Holding’s chip designs. Read more…

By Doug Black

Which Schools Produce the Top Coders in the World?

December 8, 2016

Ever wonder which universities worldwide produce the best coders? The answers may surprise you, at least as judged by the results of a competition posted yesterday on the HackerRank blog. Read more…

By John Russell

Enlisting Deep Learning in the War on Cancer

December 7, 2016

Sometime in Q2 2017 the first ‘results’ of the Joint Design of Advanced Computing Solutions for Cancer (JDACS4C) will become publicly available according to Rick Stevens. He leads one of three JDACS4C pilot projects pressing deep learning (DL) into service in the War on Cancer. The pilots, supported in part by DOE exascale funding, not only seek to do good by advancing cancer research and therapy but also to advance deep learning capabilities and infrastructure with an eye towards eventual use on exascale machines. Read more…

By John Russell

DDN Enables 50TB/Day Trans-Pacific Data Transfer for Yahoo Japan

December 6, 2016

Transferring data from one data center to another in search of lower regional energy costs isn’t a new concept, but Yahoo Japan is putting the idea into transcontinental effect with a system that transfers 50TB of data a day from Japan to the U.S., where electricity costs a quarter of the rates in Japan. Read more…

By Doug Black

Infographic Highlights Career of Admiral Grace Murray Hopper

December 5, 2016

Dr. Grace Murray Hopper (December 9, 1906 – January 1, 1992) was an early pioneer of computer science and one of the most famous women achievers in a field dominated by men. Read more…

By Staff

Ganthier, Turkel on the Dell EMC Road Ahead

December 5, 2016

Who is Dell EMC and why should you care? Glad you asked is Jim Ganthier’s quick response. Ganthier is SVP for validated solutions and high performance computing for the new (even bigger) technology giant Dell EMC following Dell’s acquisition of EMC in September. In this case, says Ganthier, the blending of the two companies is a 1+1 = 5 proposition. Not bad math if you can pull it off. Read more…

By John Russell

US Exascale Computing Update with Paul Messina

December 8, 2016

Around the world, efforts are ramping up to cross the next major computing threshold with machines that are 50-100x more performant than today’s fastest number crunchers.  Read more…

By Tiffany Trader

Enlisting Deep Learning in the War on Cancer

December 7, 2016

Sometime in Q2 2017 the first ‘results’ of the Joint Design of Advanced Computing Solutions for Cancer (JDACS4C) will become publicly available according to Rick Stevens. He leads one of three JDACS4C pilot projects pressing deep learning (DL) into service in the War on Cancer. The pilots, supported in part by DOE exascale funding, not only seek to do good by advancing cancer research and therapy but also to advance deep learning capabilities and infrastructure with an eye towards eventual use on exascale machines. Read more…

By John Russell

Ganthier, Turkel on the Dell EMC Road Ahead

December 5, 2016

Who is Dell EMC and why should you care? Glad you asked is Jim Ganthier’s quick response. Ganthier is SVP for validated solutions and high performance computing for the new (even bigger) technology giant Dell EMC following Dell’s acquisition of EMC in September. In this case, says Ganthier, the blending of the two companies is a 1+1 = 5 proposition. Not bad math if you can pull it off. Read more…

By John Russell

AWS Launches Massive 100 Petabyte ‘Sneakernet’

December 1, 2016

Amazon Web Services now offers a way to move data into its cloud by the truckload. Read more…

By Tiffany Trader

Lighting up Aurora: Behind the Scenes at the Creation of the DOE’s Upcoming 200 Petaflops Supercomputer

December 1, 2016

In April 2015, U.S. Department of Energy Undersecretary Franklin Orr announced that Intel would be the prime contractor for Aurora: Read more…

By Jan Rowell

Seagate-led SAGE Project Delivers Update on Exascale Goals

November 29, 2016

Roughly a year and a half after its launch, the SAGE exascale storage project led by Seagate has delivered a substantive interim report – Data Storage for Extreme Scale. Read more…

By John Russell

Nvidia Sees Bright Future for AI Supercomputing

November 23, 2016

Graphics chipmaker Nvidia made a strong showing at SC16 in Salt Lake City last week. Read more…

By Tiffany Trader

HPE-SGI to Tackle Exascale and Enterprise Targets

November 22, 2016

At first blush, and maybe second blush too, Hewlett Packard Enterprise’s (HPE) purchase of SGI seems like an unambiguous win-win. SGI’s advanced shared memory technology, its popular UV product line (Hanna), deep vertical market expertise, and services-led go-to-market capability all give HPE a leg up in its drive to remake itself. Bear in mind HPE came into existence just a year ago with the split of Hewlett-Packard. The computer landscape, including HPC, is shifting with still unclear consequences. One wonders who’s next on the deal block following Dell’s recent merger with EMC. Read more…

By John Russell

Why 2016 Is the Most Important Year in HPC in Over Two Decades

August 23, 2016

In 1994, two NASA employees connected 16 commodity workstations together using a standard Ethernet LAN and installed open-source message passing software that allowed their number-crunching scientific application to run on the whole “cluster” of machines as if it were a single entity. Read more…

By Vincent Natoli, Stone Ridge Technology

IBM Advances Against x86 with Power9

August 30, 2016

After offering OpenPower Summit attendees a limited preview in April, IBM is unveiling further details of its next-gen CPU, Power9, which the tech mainstay is counting on to regain market share ceded to rival Intel. Read more…

By Tiffany Trader

AWS Beats Azure to K80 General Availability

September 30, 2016

Amazon Web Services has seeded its cloud with Nvidia Tesla K80 GPUs to meet the growing demand for accelerated computing across an increasingly-diverse range of workloads. The P2 instance family is a welcome addition for compute- and data-focused users who were growing frustrated with the performance limitations of Amazon's G2 instances, which are backed by three-year-old Nvidia GRID K520 graphics cards. Read more…

By Tiffany Trader

Think Fast – Is Neuromorphic Computing Set to Leap Forward?

August 15, 2016

Steadily advancing neuromorphic computing technology has created high expectations for this fundamentally different approach to computing. Read more…

By John Russell

The Exascale Computing Project Awards $39.8M to 22 Projects

September 7, 2016

The Department of Energy’s Exascale Computing Project (ECP) hit an important milestone today with the announcement of its first round of funding, moving the nation closer to its goal of reaching capable exascale computing by 2023. Read more…

By Tiffany Trader

ARM Unveils Scalable Vector Extension for HPC at Hot Chips

August 22, 2016

ARM and Fujitsu today announced a scalable vector extension (SVE) to the ARMv8-A architecture intended to enhance ARM capabilities in HPC workloads. Fujitsu is the lead silicon partner in the effort (so far) and will use ARM with SVE technology in its post K computer, Japan’s next flagship supercomputer planned for the 2020 timeframe. This is an important incremental step for ARM, which seeks to push more aggressively into mainstream and HPC server markets. Read more…

By John Russell

IBM Debuts Power8 Chip with NVLink and Three New Systems

September 8, 2016

Not long after revealing more details about its next-gen Power9 chip due in 2017, IBM today rolled out three new Power8-based Linux servers and a new version of its Power8 chip featuring Nvidia’s NVLink interconnect. Read more…

By John Russell

Vectors: How the Old Became New Again in Supercomputing

September 26, 2016

Vector instructions, once a powerful performance innovation of supercomputing in the 1970s and 1980s became an obsolete technology in the 1990s. But like the mythical phoenix bird, vector instructions have arisen from the ashes. Here is the history of a technology that went from new to old then back to new. Read more…

By Lynd Stringer

Leading Solution Providers

US, China Vie for Supercomputing Supremacy

November 14, 2016

The 48th edition of the TOP500 list is fresh off the presses and while there is no new number one system, as previously teased by China, there are a number of notable entrants from the US and around the world and significant trends to report on. Read more…

By Tiffany Trader

HPE Gobbles SGI for Larger Slice of $11B HPC Pie

August 11, 2016

Hewlett Packard Enterprise (HPE) announced today that it will acquire rival HPC server maker SGI for $7.75 per share, or about $275 million, inclusive of cash and debt. The deal ends the seven-year reprieve that kept the SGI banner flying after Rackable Systems purchased the bankrupt Silicon Graphics Inc. for $25 million in 2009 and assumed the SGI brand. Bringing SGI into its fold bolsters HPE's high-performance computing and data analytics capabilities and expands its position... Read more…

By Tiffany Trader

Intel Launches Silicon Photonics Chip, Previews Next-Gen Phi for AI

August 18, 2016

At the Intel Developer Forum, held in San Francisco this week, Intel Senior Vice President and General Manager Diane Bryant announced the launch of Intel's Silicon Photonics product line and teased a brand-new Phi product, codenamed "Knights Mill," aimed at machine learning workloads. Read more…

By Tiffany Trader

CPU Benchmarking: Haswell Versus POWER8

June 2, 2015

With OpenPOWER activity ramping up and IBM’s prominent role in the upcoming DOE machines Summit and Sierra, it’s a good time to look at how the IBM POWER CPU stacks up against the x86 Xeon Haswell CPU from Intel. Read more…

By Tiffany Trader

Beyond von Neumann, Neuromorphic Computing Steadily Advances

March 21, 2016

Neuromorphic computing – brain inspired computing – has long been a tantalizing goal. The human brain does with around 20 watts what supercomputers do with megawatts. And power consumption isn’t the only difference. Fundamentally, brains ‘think differently’ than the von Neumann architecture-based computers. While neuromorphic computing progress has been intriguing, it has still not proven very practical. Read more…

By John Russell

Dell EMC Engineers Strategy to Democratize HPC

September 29, 2016

The freshly minted Dell EMC division of Dell Technologies is on a mission to take HPC mainstream with a strategy that hinges on engineered solutions, beginning with a focus on three industry verticals: manufacturing, research and life sciences. "Unlike traditional HPC where everybody bought parts, assembled parts and ran the workloads and did iterative engineering, we want folks to focus on time to innovation and let us worry about the infrastructure," said Jim Ganthier, senior vice president, validated solutions organization at Dell EMC Converged Platforms Solution Division. Read more…

By Tiffany Trader

Container App ‘Singularity’ Eases Scientific Computing

October 20, 2016

HPC container platform Singularity is just six months out from its 1.0 release but already is making inroads across the HPC research landscape. It's in use at Lawrence Berkeley National Laboratory (LBNL), where Singularity founder Gregory Kurtzer has worked in the High Performance Computing Services (HPCS) group for 16 years. Read more…

By Tiffany Trader

Micron, Intel Prepare to Launch 3D XPoint Memory

August 16, 2016

Micron Technology used last week’s Flash Memory Summit to roll out its new line of 3D XPoint memory technology jointly developed with Intel while demonstrating the technology in solid-state drives. Micron claimed its Quantx line delivers PCI Express (PCIe) SSD performance with read latencies at less than 10 microseconds and writes at less than 20 microseconds. Read more…

By George Leopold

  • arrow
  • Click Here for More Headlines
  • arrow
Share This