Why Big Data Needs InfiniBand to Continue Evolving

By Nicole Hemsoth

April 1, 2013

Increasingly, it’s a Big Data world we live in.  Just in case you’ve been living under a rock and need proof of that, a major retailer can use an unimaginable number of data points to predict the pregnancy of a teenage girl outside Minneapolis before she gets a chance to tell her family.  That’s just one example, but there are countless others that point to the idea that mining huge data volumes can uncover gold nuggets of actionable proportions (although sometimes they freak people out – for example that girl’s father). 

We’re still at the dawn of this Big Data era and as the market is showing, one-size-fits-all data processing is no longer adequate.  To take the next step in this evolution, specialized Big Data software can improve not only by using cloud computing, but also by utilizing specialized networking infrastructure, InfiniBand, from the supercomputing community.  Before understanding why, though, you need to understand the history of how we got to this Big Data world in the first place.

How Did We Get Here? The Birth of the Relational Database

1970 isn’t just the year of the Unix Epoch, it’s also the year that the granddaddy of all Relational Database (RDB) papers was written.  IBM Researcher E. F. Codd wrote “A Relational Model for Large Shared Data Banks” for Communications of the ACM magazine in June of that year, and it became the defining work on data layouts for decades.  Codd’s model would be refined over the next 40 years, but what he proposed evolved into a generic toolset for structuring and manipulating data that was used for everything from managing bank assets to storing food recipes.

This general-purpose data analysis software also ran exceptionally well on general-purpose computing hardware.  The two got along great, actually, since all you really needed was a disk big enough to handle the structured data and enough CPU and RAM to perform the queries.  In fact, some hardware manufacturers such as Hewlett-Packard would give away database software when you purchased the hardware to run it on.  For the Enterprise especially, the Relational Database was the killer app of the data center hardware business.

At this point, everybody was happily solving problems and making money.  Then something happened that changed everything and completely disrupted this ecosystem forever.  It was called Google.

Then Google Happened

During the Nixon Administration, copying the entire Internet was not a difficult problem given its diminutive size.  But this was not so by the late 1990s, when the first wave of search engines like Lycos and Alta Vista had supposedly solved the problem of finding information online.  Shortly thereafter, Google happened and disrupted not only the online search industry but also data processing.

It turns out that if you can keep a copy of the modern Internet at all times, you can do some amazing things in determining relevance and, therefore, return better search results.  However, you can’t use a traditional RDB to tackle that problem for several reasons.  First of all, to solve this problem you need to store a lot of data.  So much so, it becomes impractical to rely solely on vertical scaling by adding more disk/CPU/Ram to a system and a RDB does not scale horizontally very well.  Adding more machines to a RDB does not improve its execution or ability to store more data.  That disk/CPU/RAM marriage has been around for 40 years and it’s not easy to break apart.

Further, as the size of the data set in an RDB gets larger the query speed generally degrades.  For a financial services company querying trends on stock prices that may be acceptable, since that influences the time of a handful of analysts who can do something else while that processing is going on.  But for an Internet search company trying to deliver sub 3-second responses to millions of customers simultaneously that just won’t fly.

Finally, given the large data volumes and the query speed required for Internet searches, the necessity for data redundancy is implied since the data is needed at all times.  As such, the simple master-slave model employed by most RDB deployments over the last four decades is a lot less bullet proof than what is needed when you are trying to constantly copy the entire Internet.  One big mirror simply won’t cut it.

Distributed File Systems and Map/Reduce Change Everything

If Codd’s seminal RDB paper had grandchildren, they would be a pair of papers released by Google that described how they conquered their data problem.  Published in 2003, “The Google File System” by Sanjay Ghermawat, Howard Gobioff, and Shun-tak Leung described how a new way of storing data across many, many different machines provided a mechanism for dealing with huge volumes in a much more economical way than the traditional RDB. 

The follow-up paper from 2004 entitled, “MapReduce: Simplified Data Processing on Large Clusters” by Ghermawat and Jeffrey Dean further revealed that Google performs queries across its large, distributed data set by breaking up the problem into smaller parts, sending those smaller parts to nodes out on the distributed system (the Map step), and finally assembling the results of the smaller solution (the Reduce step) into a whole.

Together, these two papers created a data processing renaissance.  While RDBs still have their place, they are no longer the single solution to all problems in the data processing world.  For problems involving large data volumes in particular, solutions derived from these two papers have emerged over the past decade to give developers and architects far more choice than they had in the RDB exclusive world that existed previously.

Hadoop Democratizes Big Data; Now Where Are You Going to Run It?

The next logical step in this evolution in an era of Open Source programming was for somebody to take the theories laid out in these Google papers and transform them into a reality that everyone could use.  This is precisely what Doug Cutting and Michael J. Cafarella did, and they called the result Hadoop.  With Hadoop, anyone now had the software to tackle huge data volumes and perform sophisticated queries.  What not everybody could afford, however, was the hardware to run it on.

Enter cloud computing, specifically Infrastructure as a Service (IaaS).  Primarily invented by Amazon with its Amazon Web Services offering, anyone could lease the 100s if not 1000s of compute nodes necessary to run big Hadoop jobs instead of purchasing the physical machines necessary for the job.  Combine that idea with orchestration software from folks like OpsCode or Puppet Labs and you could automate the creation of your virtualized hardware, the installation and configuration of the Hadoop software, and the loading of large data volumes to minimize the costs of performing these queries.

Again, everybody is happily solving problems and making money.  But we aren’t done.  There’s another step to this evolution, and it’s happening now.

InfiniBand: Making Hadoop Faster and More Economical

Processing Hadoop and other Big Data queries on IaaS produces results, but slowly.  This combination is praised for the answers it can find but at the cost of reduced speed.  We saw a data processing revolution sparked by different software approaches than those pioneered in the 1970’s.   Better-performing Hadoop clusters, with all the network traffic they produce in their Map and Reduce steps, can be found by taking a similar approach with a different network infrastructure.

Ethernet, the most widely used network infrastructure technology today, has followed a path similar to that of RDBs.  Invented in 1980, Ethernet uses a hierarchal structure of subnets to string computers together on a network.  It is so common that, like RDBs 10 years ago, most people don’t think they have a choice of something different.

The performance problem with Ethernet comes in its basic structure.  With hierarchies of subnets connected by routers, network packets have exactly one path they can traverse between any two points on the network.  You can increase the size of the pipe between those two points slightly, but fundamentally you still just have the one path.

Born in the supercomputing community during the 21st Century, InfiniBand instead uses a grid system which enables multiple paths for network packets to traverse between two points.  Smart routing that knows what part of the grid is currently busy, akin to automobile traffic reporting found on smart phone map apps, keeps the flow of traffic throughout the system working optimally.  A typical Ethernet-based network runs at 1 Gigabit per second (Gb/s), and a fast one runs at 10 Gb/s.  A dual-channel InfiniBand network runs at 80 Gb/s, making it a great compliment to Map/Reduce steps on a Hadoop cluster.

We’ve seen how a software revolution getting us past the exclusive use of RDBs has enabled data mining that was previously unimaginable.  Open Source and cloud computing have made Big Data approachable to a wider audience.  Better speed, resulting in shorter query times and time reductions needed in leasing IaaS space, is achievable using public cloud providers offering InfiniBand.  This is the next step in the data processing revolution and the next generation of Cloud Computing services (also known as Cloud Computing 2.0) bring InfiniBand to the public cloud.  ProfitBricks is the first provider to offer supercomputing like performance to the public cloud at an affordable price.  Data is becoming democratized, and now High Performance Computing is as well.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

What’s New in HPC Research: Rabies, Smog, Robots & More

October 14, 2019

In this bimonthly feature, HPCwire highlights newly published research in the high-performance computing community and related domains. From parallel programming to exascale to quantum computing, the details are here. Read more…

By Oliver Peckham

Crystal Ball Gazing: IBM’s Vision for the Future of Computing

October 14, 2019

Dario Gil, IBM’s relatively new director of research, painted a intriguing portrait of the future of computing along with a rough idea of how IBM thinks we’ll get there at last month’s MIT-IBM Watson AI Lab’s AI Read more…

By John Russell

Summit Simulates Braking – on Mars

October 14, 2019

NASA is planning to send humans to Mars by the 2030s – and landing on the surface will be considerably trickier than landing a rover like Curiosity. To solve the problem, NASA researchers are using the world’s fastes Read more…

By Staff report

Chaminade University’s Immersion Program Builds Capacity for Data Science in Hawaii, Pacific Region

October 10, 2019

Kuleana is a uniquely Hawaiian value and practice which embodies responsibility to self, community, and the ‘aina' (land). At Chaminade University, a federally designated Native Hawaiian serving university in Hawai‘i Read more…

By Faith Singer-Villalobos

Trovares Drives Memory-Driven, Property Graph Analytics Strategy with HPE

October 10, 2019

Trovares, a high performance property graph analytics company, has partnered with HPE and its Superdome Flex memory-driven servers on a cybersecurity capability the companies say “routinely” runs near-time workloads on 24TB-capacity systems... Read more…

By Doug Black

AWS Solution Channel

Making High Performance Computing Affordable and Accessible for Small and Medium Businesses with HPC on AWS

High performance computing (HPC) brings a powerful set of tools to a broad range of industries, helping to drive innovation and boost revenue in finance, genomics, oil and gas extraction, and other fields. Read more…

HPE Extreme Performance Solutions

Intel FPGAs: More Than Just an Accelerator Card

FPGA (Field Programmable Gate Array) acceleration cards are not new, as they’ve been commercially available since 1984. Typically, the emphasis around FPGAs has centered on the fact that they’re programmable accelerators, and that they can truly offer workload specific hardware acceleration solutions without requiring custom silicon. Read more…

IBM Accelerated Insights

HPC in the Cloud: Avoid These Common Pitfalls

[Connect with LSF users and learn new skills in the IBM Spectrum LSF User Community.]

It seems that everyone is experimenting about cloud computing. Read more…

Intel, Lenovo Join Forces on HPC Cluster for Flatiron

October 9, 2019

An HPC cluster with deep learning techniques will be used to process petabytes of scientific data as part of workload-intensive projects spanning astrophysics to genomics. AI partners Intel and Lenovo said they are providing... Read more…

By George Leopold

Crystal Ball Gazing: IBM’s Vision for the Future of Computing

October 14, 2019

Dario Gil, IBM’s relatively new director of research, painted a intriguing portrait of the future of computing along with a rough idea of how IBM thinks we’ Read more…

By John Russell

Summit Simulates Braking – on Mars

October 14, 2019

NASA is planning to send humans to Mars by the 2030s – and landing on the surface will be considerably trickier than landing a rover like Curiosity. To solve Read more…

By Staff report

Trovares Drives Memory-Driven, Property Graph Analytics Strategy with HPE

October 10, 2019

Trovares, a high performance property graph analytics company, has partnered with HPE and its Superdome Flex memory-driven servers on a cybersecurity capability the companies say “routinely” runs near-time workloads on 24TB-capacity systems... Read more…

By Doug Black

Intel, Lenovo Join Forces on HPC Cluster for Flatiron

October 9, 2019

An HPC cluster with deep learning techniques will be used to process petabytes of scientific data as part of workload-intensive projects spanning astrophysics to genomics. AI partners Intel and Lenovo said they are providing... Read more…

By George Leopold

Optimizing Offshore Wind Farms with Supercomputer Simulations

October 9, 2019

Offshore wind farms offer a number of benefits; many of the areas with the strongest winds are located offshore, and siting wind farms offshore ameliorates many of the land use concerns associated with onshore wind farms. Some estimates say that, if leveraged, offshore wind power... Read more…

By Oliver Peckham

Harvard Deploys Cannon, New Lenovo Water-Cooled HPC Cluster

October 9, 2019

Harvard's Faculty of Arts & Sciences Research Computing (FASRC) center announced a refresh of their primary HPC resource. The new cluster, called Cannon after the pioneering American astronomer Annie Jump Cannon, is supplied by Lenovo... Read more…

By Tiffany Trader

NSF Announces New AI Program; Plans $120M in Funding Next Year

October 8, 2019

As the saying goes, when you’re hot, you’re hot. Right now, AI is scalding. Today the National Science Foundation announced a new AI initiative – The National Artificial Intelligence Research Institutes program – with plans to invest about “$120 million in grants next year... Read more…

By Staff report

DOE Sets Sights on Accelerating AI (and other) Technology Transfer

October 3, 2019

For the past two days DOE leaders along with ~350 members from academia and industry gathered in Chicago to discuss AI development and the ways in which industr Read more…

By John Russell

Supercomputer-Powered AI Tackles a Key Fusion Energy Challenge

August 7, 2019

Fusion energy is the Holy Grail of the energy world: low-radioactivity, low-waste, zero-carbon, high-output nuclear power that can run on hydrogen or lithium. T Read more…

By Oliver Peckham

DARPA Looks to Propel Parallelism

September 4, 2019

As Moore’s law runs out of steam, new programming approaches are being pursued with the goal of greater hardware performance with less coding. The Defense Advanced Projects Research Agency is launching a new programming effort aimed at leveraging the benefits of massive distributed parallelism with less sweat. Read more…

By George Leopold

Cray Wins NNSA-Livermore ‘El Capitan’ Exascale Contract

August 13, 2019

Cray has won the bid to build the first exascale supercomputer for the National Nuclear Security Administration (NNSA) and Lawrence Livermore National Laborator Read more…

By Tiffany Trader

AMD Launches Epyc Rome, First 7nm CPU

August 8, 2019

From a gala event at the Palace of Fine Arts in San Francisco yesterday (Aug. 7), AMD launched its second-generation Epyc Rome x86 chips, based on its 7nm proce Read more…

By Tiffany Trader

Ayar Labs to Demo Photonics Chiplet in FPGA Package at Hot Chips

August 19, 2019

Silicon startup Ayar Labs continues to gain momentum with its DARPA-backed optical chiplet technology that puts advanced electronics and optics on the same chip Read more…

By Tiffany Trader

Chinese Company Sugon Placed on US ‘Entity List’ After Strong Showing at International Supercomputing Conference

June 26, 2019

After more than a decade of advancing its supercomputing prowess, operating the world’s most powerful supercomputer from June 2013 to June 2018, China is keep Read more…

By Tiffany Trader

D-Wave’s Path to 5000 Qubits; Google’s Quantum Supremacy Claim

September 24, 2019

On the heels of IBM’s quantum news last week come two more quantum items. D-Wave Systems today announced the name of its forthcoming 5000-qubit system, Advantage (yes the name choice isn’t serendipity), at its user conference being held this week in Newport, RI. Read more…

By John Russell

A Behind-the-Scenes Look at the Hardware That Powered the Black Hole Image

June 24, 2019

Two months ago, the first-ever image of a black hole took the internet by storm. A team of scientists took years to produce and verify the striking image – an Read more…

By Oliver Peckham

Leading Solution Providers

ISC 2019 Virtual Booth Video Tour

CRAY
CRAY
DDN
DDN
DELL EMC
DELL EMC
GOOGLE
GOOGLE
ONE STOP SYSTEMS
ONE STOP SYSTEMS
PANASAS
PANASAS
VERNE GLOBAL
VERNE GLOBAL

Intel Confirms Retreat on Omni-Path

August 1, 2019

Intel Corp.’s plans to make a big splash in the network fabric market for linking HPC and other workloads has apparently belly-flopped. The chipmaker confirmed to us the outlines of an earlier report by the website CRN that it has jettisoned plans for a second-generation version of its Omni-Path interconnect... Read more…

By Staff report

Kubernetes, Containers and HPC

September 19, 2019

Software containers and Kubernetes are important tools for building, deploying, running and managing modern enterprise applications at scale and delivering enterprise software faster and more reliably to the end user — while using resources more efficiently and reducing costs. Read more…

By Daniel Gruber, Burak Yenier and Wolfgang Gentzsch, UberCloud

Intel Debuts Pohoiki Beach, Its 8M Neuron Neuromorphic Development System

July 17, 2019

Neuromorphic computing has received less fanfare of late than quantum computing whose mystery has captured public attention and which seems to have generated mo Read more…

By John Russell

Rise of NIH’s Biowulf Mirrors the Rise of Computational Biology

July 29, 2019

The story of NIH’s supercomputer Biowulf is fascinating, important, and in many ways representative of the transformation of life sciences and biomedical res Read more…

By John Russell

Quantum Bits: Neven’s Law (Who Asked for That), D-Wave’s Steady Push, IBM’s Li-O2- Simulation

July 3, 2019

Quantum computing’s (QC) many-faceted R&D train keeps slogging ahead and recently Japan is taking a leading role. Yesterday D-Wave Systems announced it ha Read more…

By John Russell

With the Help of HPC, Astronomers Prepare to Deflect a Real Asteroid

September 26, 2019

For years, NASA has been running simulations of asteroid impacts to understand the risks (and likelihoods) of asteroids colliding with Earth. Now, NASA and the European Space Agency (ESA) are preparing for the next, crucial step in planetary defense against asteroid impacts: physically deflecting a real asteroid. Read more…

By Oliver Peckham

ISC Keynote: Thomas Sterling’s Take on Whither HPC

June 20, 2019

Entertaining, insightful, and unafraid to launch the occasional verbal ICBM, HPC pioneer Thomas Sterling delivered his 16th annual closing keynote at ISC yesterday. He explored, among other things: exascale machinations; quantum’s bubbling money pot; Arm’s new HPC viability; Europe’s... Read more…

By John Russell

Argonne Team Makes Record Globus File Transfer

July 10, 2019

A team of scientists at Argonne National Laboratory has broken a data transfer record by moving a staggering 2.9 petabytes of data for a research project.  The data – from three large cosmological simulations – was generated and stored on the Summit supercomputer at the Oak Ridge Leadership Computing Facility (OLCF)... Read more…

By Oliver Peckham

  • arrow
  • Click Here for More Headlines
  • arrow
Do NOT follow this link or you will be banned from the site!
Share This