Big Data Is HPC – Let’s Embrace It

By Gary Johnson

October 25, 2012

big data graphicBig data is all the rage these days. It is the subject of a recent Presidential Initiative, has its own news portal, and, in the guise of Watson, is a game show celebrity. Big data has also caused concern in some circles that it might sap interest and funding from the exascale computing initiative. So, is big data distinct from HPC – or is it just a new aspect of our evolving world of high-performance computing? Should we care? Can we all get along together?

Distinct or Different Aspects of the Same Thing?

The distinction made between big data and HPC may arguably be attributed to events that transpired in the days when computers were people – that is, people using mechanical calculating engine. Statistics, probability, differential equations, numerical analysis, number theory and discrete mathematics all have different roots. While we think of them collectively as simply part of math, their practitioners see them, at best, in distinct clusters. The clusters have distinct traditions, lore, professional societies, meetings, journals and sometimes, distinct academic departments.

At this point in time, digital computation as we know it is deeply entrenched in the number- and data-crunching applications based on this math. Science and engineering rely heavily on differential equations and numerical analysis; code making and breaking depend on number theory and discrete mathematics; data-intensive applications use statistics and (perhaps) probability.

However, not all applications look the same computationally. Science & engineering applications deal mostly with continuous mathematics and focus on solving partial differential equations. Cryptography & cryptanalysis deal mostly with discrete mathematics. While both applications areas place serious demands on HPC architectures, those demands have been viewed as distinct enough to merit separate development paths. Both areas have been continuously at the forefront of HPC since the Second World War.

Until recently, “data-intensive” applications have been viewed from a computational perspective as not being all that intense. Much of the computation in this applications area has played out on workstations or clusters, using spreadsheets or relational databases. From an HPC perspective, data was a backwater – important, to be sure, but uninteresting computationally.

Now, things appear to be changing. Contemporary processor technologies, on the one hand, and the great expense of developing and fielding trans-petascale computers, on the other, seem to be blurring the boundaries between the “continuous” and “discrete” mathematics camps. Distinctions remain, but cooperation prevails.

At the same time, data has become “big” and its growth rates indicate that it will get much bigger very soon. Interest in this area has also increased dramatically. Traditional data applications have grown very large and are straining the limits of their technology. Meanwhile, a whole new class of applications for which relational databases don’t work well has sprung up – think social networking, counterterrorism, eScience. So, now big data is both important and interesting.

Now all of our HPC applications areas are interesting, two of them are pushing the limits of computer architectures, and the third – big data – is rapidly catching up. So maybe we should think of them as not so distinct, but simply as different aspects of the same thing. Let’s take a closer look at big data to see if this view is justified.

Advent of Big Data

Since the creation of the Web twenty years ago, data-intensive applications have slowly but inexorably become important. These early phases of digital data and data intensive applications developed outside the sphere of interest of those concerned with advancing numerically-intensive applications of science and engineering, except, perhaps, for a small group in the intelligence community. By and large, data-intensive applications were for transaction processing and customer relationship management – important to commerce but not challenging to the intellects of the science community.

Ten years ago, the age of social networks began with Friendster. The table below, based on Wikipedia data, shows the current sizes of a few of the currently more popular social networks (note recent news reports that Facebook has now passed 1,000,000,000 users). 

 

Social Network

Year Launched

Active Users

Number of Users

Date of User Count

LinkedIn

2003

161,000,000

February 2012

Facebook

2004

901,000,000

April 2012

Twitter

2006

500,000,000

April 2012

Google+

2011

250,000,000

(registered)

June 2012

 

The collection of all social networks and related services, such as cloud-based email and photo and video sharing, is sometimes called the geosocial universe. The rise of social networks and the geosocial universe is significant for many reasons. To name just a few:

  • It marks a transition from a world with a few data/information providers to one where virtually anyone can be a provider;
  • It exploits cognitive surplus and allows large numbers of people to collaborate, interact with, exchange, and analyze data, and publish their outcomes; and
  • It has enhanced interest in and facilitated the development of:

The advent of digital information from traditional sources, combined with that flowing from the geosocial universe, leads to predictions of enormous future data volumes in our digital universe. A recent CSC study cites a 4,300 percent increase in annual data generation by 2020 – by which time the global data volume is predicted to reach 35 zettabytes (or 35 billion terabytes). The claim is also made that, by 2020, more than 70 percent of the digital universe will be generated by individuals. But enterprises will have responsibility for storing, protecting and managing 80 percent of it.

So, the world of data-intensive computing has become intellectually rich, is poised to grow explosively, and needs all the help it can get.

Next >> Big Data Challenges

Big Data Challenges

The short version of the “challenges” story is: How do we design, develop and field an infrastructure to capture, curate, analyze, visualize, and use all of this data?

First, we distinguish among three different kinds of data:

  • Observational Data – uncontrolled events happen and we record data about them.
    • Examples include astronomy, earth observation, geophysics, medicine, commerce, social data, the internet of things.
  • Experimental Data – we design controlled events for the purpose of recording data about them.
    • Examples include particle physics, photon sources, neutron sources, bioinformatics, product development.
  • Simulation Data – we create a model, simulate something, and record the resulting data.
    • Examples include weather & climate, nuclear & fusion energy, high-energy physics, materials, chemistry, biology, fluid dynamics.

A useful summary of the current state of the “data deluge” has been provided by Fox, Hey and Trefethen and is drawn upon here. Since most data is yet to be collected, we focus here on data rates rather than absolute amounts. A very high level summary of some of the current or expected data rates in the three data categories is contained in the table below.

Data Type

Data Rate

Timing

Observational

 

 

Astronomy: Square Kilometer Array

>100Tb/sec

2016-2022

Medicine: Imaging

>1EB/year

now

Earth Observation

4PB/year

now

Facebook

>180PB/year

now

Experimental

 

 

Particle Physics: Large Hadron Collider

15PB/year

now

Photon Sources: Advanced Light Sources

7TB/hour

2015

Bioinformatics: Human Genome Sequencing

700Pb/year

now

Bioinformatics: Human Genome Sequencing

10Eb/year

future

Simulation

 

 

Fusion Energy

2PB/time step

now

Fusion Energy

200PB/time step

2020

Climate Modeling

400PB/year

now

 

One immediately notices that the data are hard to compare. The rates for observational data are probably the clearest. For example, if we assume that the Square Kilometer Array was to operate continuously at its full capability, then in the 2022 time frame it would be generating just under 400 exabytes per year. This would appear to make it the world’s largest single data generator.

But medical imaging, social data, or the Internet of Things might be larger by 2022. As for the Internet of Things, it is interesting to note the recent publicity about project TrapWire, which is purported to be networking a very large nationwide collection of security cameras and combining this with a predictive software system designed to detect patterns indicative of terrorist attacks or criminal operations. While data rates for this project are not available, it is reasonable to assume that they are very high.

Big Data Computing Platforms

Two HPC companies have been very visible in big data: IBM and YarcData (a Cray company). IBM has captured the big data high ground through its thrust to bring these applications to the enterprise and by using Watson to cleverly exploit large datasets and bring analytics to the foreground.

With its recent creation of YarcData, Cray has clearly stated its intention to focus on big data and to provide platforms (e.g. uRiKA, “a big data appliance for real time graph analytics”) and graph analytics solutions to the world. YarcData has also gained substantial visibility in the data analytics community through its Graph Analytics Challenge.

While specialized computers clearly have a role, for the immediate future most big data can probably be exploited on existing hardware. For example, note that the latest Graph 500 List includes five Blue Gene/Q systems in its top 10, shown below. (There are actually 11 systems in the “top 10,” since Mira and Sequoia, the Blue Gene/Q systems at Argonne and Livermore, are tied for first place.)

The Graph 500 List

Also recall that the current TOP500 list contains four Blue Gene/Q systems in its top 10, including the number one machine, Sequoia. Furthermore, none of the machines listed in the top 10 on the Graph 500 list are specialized, data-crunching engines. That could, of course, change. A new Graph 500 list should be published next month, so we’ll have an opportunity to review the situation.

As big data applications succeed and grow, they will eventually need petascale and exascale computing resources. Thus, it would be useful to explore key big data applications in depth and extract an understanding of those attributes that place unique demands on system architectures. The benefits of doing this would be twofold:

  • It would provide a firm basis for tuning or adapting system architectures to big data at the exascale; and
  • It would provide the means to clarify the similarities and differences of number- and data-crunching at extreme scales to the broader HPC community.

Next >> Big Data Analytics Is Key

Big Data Analytics

The current key to success in big data is analytics. Data archiving, provenance, curation, protection and movement are serious issues, but they are currently known, under active study, and will probably be addressed in a more or less similar fashion across the globe. The discriminator for big data will be the hardware and software architecture and tools for analyzing the data efficiently and effectively. Note, in particular, that:

  • Big data will live in the cloud – either the cloud as we currently see it, or an evolved cloud, shaped to meet the needs of big data.
  • eScience will become a dominant mode of science and it will be a significant big data producer and consumer.
  • Visual analytics will be a must for big data.
  • While structured searches will remain a staple, unstructured searches and graph analytics applications may come to swamp them.
  • Although software like MapReduce and Hadoop and their embellishments are probably here to stay, new approaches and programming models for big data analytics will need to be developed and implemented for many applications – especially those involving unstructured queries and graph analytics.
  • Since big data will be impractical to move, the analytics may need to be pushed to the data, rather than pulling the data to the analytics, as is currently common practice.
  • Compute engines may need to live inside the data (and thus inside the cloud). In fact, depending on the nature of the end-user application, this could turn some big number crunching computers into also being big, dedicated, data-crunchers, using in-situ analytics.
  • While big data applications and exascale number-crunching applications may have some common requirements, like large memories and high bandwidth communications, computer architectures for big data will also need to accommodate unique requirements, like efficient non-local memory references and code execution that is difficult to predict.

Big Data Solutions for eScience

As defined by Wikipedia, eScience is computationally intensive science that is carried out in highly distributed network environments, or science that uses immense data sets that require grid or cloud computing. Current eSciences include particle physics, bioinformatics, earth sciences and social simulations.

In particular, particle physics has a well-developed eScience infrastructure because of its need for adequate computing facilities for the analysis of results and storage of data originating from the CERN Large Hadron Collider. 

An excellent example of a big data solution for eScience is NASA’s new NASA Earth Exchange (NEX).

This new facility is a virtual laboratory that will allow scientists to tackle global Earth science challenges with global high-resolution satellite observations. NEX combines Earth-system modeling, remote-sensing data from NASA and other agencies, and a scientific social networking platform to deliver a complete research environment. Users can explore and analyze large Earth science data sets, run and share modeling algorithms, collaborate on new or existing projects and exchange workflows and results within and among other science communities.

NEX will be based around the NASA Ames Pleiades system, the world’s largest SGI Altix ICE cluster.

As science applications continue to evolve, it is to be expected that other disciplines will acquire significant eScience aspects. Thus, eScience is a key area for providing big data solutions.

Big Data Solutions for Dispersed Data

As mentioned previously, 70 percent of future big data is expected to be generated by individuals. Some currently known sources for such data include:

  • Citizen Science – where individuals use their cognitive surplus to carry out science activities at home.
  • Quantified Self – where individuals gather and analyze extensive amounts of data about their bodies and well-being.
  • Aging at Home – where individuals use advanced sensor technologies to collect, analyze and make available to others that data which empowers them to remain in their homes rather than in an assisted living facility.

The whole area of dispersed data applications seems ripe for growth through the introduction of more intelligent nodes (in many varieties) and the local and remote big data computing, analytics and visualization to back them up.

Big Data is HPC

If your eyes haven’t glazed over by now, hopefully you’ve been persuaded that big data is another aspect of our rich and evolving world of HPC. Holding it at arm’s length makes a distinction that is increasingly without a difference. Furthermore, big data is rich in challenges that complement those posed by our usual science & engineering and cryptography & cryptanalysis applications. At the same time, HPC’s big iron is becoming very big and heterogeneous at many levels. Surely there’s room in there for big data.

About the Author

Gary M. Johnson is the founder of Computational Science Solutions, LLC, whose mission is to develop, advocate, and implement solutions for the global computational science and engineering community.

Dr. Johnson specializes in management of high performance computing, applied mathematics, and computational science research activities; advocacy, development, and management of high performance computing centers; development of national science and technology policy; and creation of education and research programs in computational engineering and science.

He has worked in Academia, Industry and Government. He has held full professorships at Colorado State University and George Mason University, been a researcher at United Technologies Research Center, and worked for the Department of Defense, NASA, and the Department of Energy.

He is a graduate of the U.S. Air Force Academy; holds advanced degrees from Caltech and the von Karman Institute; and has a Ph.D. in applied sciences from the University of Brussels.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

Battle Brews over Trump Intentions for Funding Science

February 27, 2017

The battle over science funding – how much and for what kinds of science – Read more…

By John Russell

Google Gets First Dibs on New Skylake Chips

February 27, 2017

As part of an ongoing effort to differentiate its public cloud services, Google made good this week on its intention to bring custom Xeon Skylake chips from Intel Corp. Read more…

By George Leopold

Thomas Sterling on CREST and Academia’s Role in HPC Research

February 27, 2017

The US advances in high performance computing over many decades have been a product of the combined engagement of research centers in industry, government labs, and academia. Read more…

By Thomas Sterling, Indiana University

Advancing Modular Supercomputing with DEEP and DEEP-ER Architectures

February 24, 2017

Knowing that the jump to exascale will require novel architectural approaches capable of delivering dramatic efficiency and performance gains, researchers around the world are hard at work on next-generation HPC systems. Read more…

By Sean Thielen

HPE Extreme Performance Solutions

Manufacturers Reaping the Benefits of Remote Visualization

Today’s manufacturers are operating in an ever-changing atmosphere, and finding new ways to boost productivity has never been more vital.

This is why manufacturers are ramping up their investments in high performance computing (HPC), a trend which has helped give rise to the “connected factory” and Industrial Internet of Things (IIoT) concepts that are proliferating throughout the industry today. Read more…

Weekly Twitter Roundup (Feb. 23, 2017)

February 23, 2017

Here at HPCwire, we aim to keep the HPC community apprised of the most relevant and interesting news items that get tweeted throughout the week. Read more…

By Thomas Ayres

HPE Server Shows Low Latency on STAC-N1 Test

February 22, 2017

The performance of trade and match servers can be a critical differentiator for financial trading houses. Read more…

By John Russell

HPC Financial Update (Feb. 2017)

February 22, 2017

In this recurring feature, we’ll provide you with financial highlights from companies in the HPC industry. Check back in regularly for an updated list with the most pertinent fiscal information. Read more…

By Thomas Ayres

Rethinking HPC Platforms for ‘Second Gen’ Applications

February 22, 2017

Just what constitutes HPC and how best to support it is a keen topic currently. Read more…

By John Russell

Thomas Sterling on CREST and Academia’s Role in HPC Research

February 27, 2017

The US advances in high performance computing over many decades have been a product of the combined engagement of research centers in industry, government labs, and academia. Read more…

By Thomas Sterling, Indiana University

Advancing Modular Supercomputing with DEEP and DEEP-ER Architectures

February 24, 2017

Knowing that the jump to exascale will require novel architectural approaches capable of delivering dramatic efficiency and performance gains, researchers around the world are hard at work on next-generation HPC systems. Read more…

By Sean Thielen

HPC Technique Propels Deep Learning at Scale

February 21, 2017

Researchers from Baidu’s Silicon Valley AI Lab (SVAIL) have adapted a well-known HPC communication technique to boost the speed and scale of their neural network training and now they are sharing their implementation with the larger deep learning community. Read more…

By Tiffany Trader

IDC: Will the Real Exascale Race Please Stand Up?

February 21, 2017

So the exascale race is on. And lots of organizations are in the pack. Government announcements from the US, China, India, Japan, and the EU indicate that they are working hard to make it happen – some sooner, some later. Read more…

By Bob Sorensen, IDC

TSUBAME3.0 Points to Future HPE Pascal-NVLink-OPA Server

February 17, 2017

Since our initial coverage of the TSUBAME3.0 supercomputer yesterday, more details have come to light on this innovative project. Of particular interest is a new board design for NVLink-equipped Pascal P100 GPUs that will create another entrant to the space currently occupied by Nvidia's DGX-1 system, IBM's "Minsky" platform and the Supermicro SuperServer (1028GQ-TXR). Read more…

By Tiffany Trader

Tokyo Tech’s TSUBAME3.0 Will Be First HPE-SGI Super

February 16, 2017

In a press event Friday afternoon local time in Japan, Tokyo Institute of Technology (Tokyo Tech) announced its plans for the TSUBAME3.0 supercomputer, which will be Japan’s “fastest AI supercomputer,” Read more…

By Tiffany Trader

Drug Developers Use Google Cloud HPC in the Fight Against ALS

February 16, 2017

Within the haystack of a lethal disease such as ALS (amyotrophic lateral sclerosis / Lou Gehrig’s Disease) there exists, somewhere, the needle that will pierce this therapy-resistant affliction. Read more…

By Doug Black

Azure Edges AWS in Linpack Benchmark Study

February 15, 2017

The “when will clouds be ready for HPC” question has ebbed and flowed for years. Read more…

By John Russell

For IBM/OpenPOWER: Success in 2017 = (Volume) Sales

January 11, 2017

To a large degree IBM and the OpenPOWER Foundation have done what they said they would – assembling a substantial and growing ecosystem and bringing Power-based products to market, all in about three years. Read more…

By John Russell

US, China Vie for Supercomputing Supremacy

November 14, 2016

The 48th edition of the TOP500 list is fresh off the presses and while there is no new number one system, as previously teased by China, there are a number of notable entrants from the US and around the world and significant trends to report on. Read more…

By Tiffany Trader

Lighting up Aurora: Behind the Scenes at the Creation of the DOE’s Upcoming 200 Petaflops Supercomputer

December 1, 2016

In April 2015, U.S. Department of Energy Undersecretary Franklin Orr announced that Intel would be the prime contractor for Aurora: Read more…

By Jan Rowell

IBM Wants to be “Red Hat” of Deep Learning

January 26, 2017

IBM today announced the addition of TensorFlow and Chainer deep learning frameworks to its PowerAI suite of deep learning tools, which already includes popular offerings such as Caffe, Theano, and Torch. Read more…

By John Russell

D-Wave SC16 Update: What’s Bo Ewald Saying These Days

November 18, 2016

Tucked in a back section of the SC16 exhibit hall, quantum computing pioneer D-Wave has been talking up its new 2000-qubit processor announced in September. Forget for a moment the criticism sometimes aimed at D-Wave. This small Canadian company has sold several machines including, for example, ones to Lockheed and NASA, and has worked with Google on mapping machine learning problems to quantum computing. In July Los Alamos National Laboratory took possession of a 1000-quibit D-Wave 2X system that LANL ordered a year ago around the time of SC15. Read more…

By John Russell

Enlisting Deep Learning in the War on Cancer

December 7, 2016

Sometime in Q2 2017 the first ‘results’ of the Joint Design of Advanced Computing Solutions for Cancer (JDACS4C) will become publicly available according to Rick Stevens. He leads one of three JDACS4C pilot projects pressing deep learning (DL) into service in the War on Cancer. Read more…

By John Russell

Tokyo Tech’s TSUBAME3.0 Will Be First HPE-SGI Super

February 16, 2017

In a press event Friday afternoon local time in Japan, Tokyo Institute of Technology (Tokyo Tech) announced its plans for the TSUBAME3.0 supercomputer, which will be Japan’s “fastest AI supercomputer,” Read more…

By Tiffany Trader

HPC Startup Advances Auto-Parallelization’s Promise

January 23, 2017

The shift from single core to multicore hardware has made finding parallelism in codes more important than ever, but that hasn’t made the task of parallel programming any easier. Read more…

By Tiffany Trader

Leading Solution Providers

CPU Benchmarking: Haswell Versus POWER8

June 2, 2015

With OpenPOWER activity ramping up and IBM’s prominent role in the upcoming DOE machines Summit and Sierra, it’s a good time to look at how the IBM POWER CPU stacks up against the x86 Xeon Haswell CPU from Intel. Read more…

By Tiffany Trader

BioTeam’s Berman Charts 2017 HPC Trends in Life Sciences

January 4, 2017

Twenty years ago high performance computing was nearly absent from life sciences. Today it’s used throughout life sciences and biomedical research. Genomics and the data deluge from modern lab instruments are the main drivers, but so is the longer-term desire to perform predictive simulation in support of Precision Medicine (PM). There’s even a specialized life sciences supercomputer, ‘Anton’ from D.E. Shaw Research, and the Pittsburgh Supercomputing Center is standing up its second Anton 2 and actively soliciting project proposals. There’s a lot going on. Read more…

By John Russell

Nvidia Sees Bright Future for AI Supercomputing

November 23, 2016

Graphics chipmaker Nvidia made a strong showing at SC16 in Salt Lake City last week. Read more…

By Tiffany Trader

TSUBAME3.0 Points to Future HPE Pascal-NVLink-OPA Server

February 17, 2017

Since our initial coverage of the TSUBAME3.0 supercomputer yesterday, more details have come to light on this innovative project. Of particular interest is a new board design for NVLink-equipped Pascal P100 GPUs that will create another entrant to the space currently occupied by Nvidia's DGX-1 system, IBM's "Minsky" platform and the Supermicro SuperServer (1028GQ-TXR). Read more…

By Tiffany Trader

IDG to Be Bought by Chinese Investors; IDC to Spin Out HPC Group

January 19, 2017

US-based publishing and investment firm International Data Group, Inc. (IDG) will be acquired by a pair of Chinese investors, China Oceanwide Holdings Group Co., Ltd. Read more…

By Tiffany Trader

Is Liquid Cooling Ready to Go Mainstream?

February 13, 2017

Lost in the frenzy of SC16 was a substantial rise in the number of vendors showing server oriented liquid cooling technologies. Three decades ago liquid cooling was pretty much the exclusive realm of the Cray-2 and IBM mainframe class products. That’s changing. We are now seeing an emergence of x86 class server products with exotic plumbing technology ranging from Direct-to-Chip to servers and storage completely immersed in a dielectric fluid. Read more…

By Steve Campbell

Dell Knights Landing Machine Sets New STAC Records

November 2, 2016

The Securities Technology Analysis Center, commonly known as STAC, has released a new report characterizing the performance of the Knight Landing-based Dell PowerEdge C6320p server on the STAC-A2 benchmarking suite, widely used by the financial services industry to test and evaluate computing platforms. The Dell machine has set new records for both the baseline Greeks benchmark and the large Greeks benchmark. Read more…

By Tiffany Trader

Intel and Trump Announce $7B for Fab 42 Targeting 7nm

February 8, 2017

In what may be an attempt by President Trump to reset his turbulent relationship with the high tech industry, he and Intel CEO Brian Krzanich today announced plans to invest more than $7 billion to complete Fab 42. Read more…

By John Russell

  • arrow
  • Click Here for More Headlines
  • arrow
Share This