Visit additional Tabor Communication Publications
September 09, 2011
The observer influences the events he observes by the mere act of observing them or by being there to observe them.
--Isaac Asimov, Foundation's Edge
Elements of science fiction have helped us venture guesses about what the future might look like—at least in terms of the technologies some suspect might be pervasive one day. Flying cars, robot housekeepers, and of course, supercomputers that can predict the future and answer humanity’s most pressing questions, are all staples.
This week news emerged that might bring the all-knowing "supercomputer as fortuneteller" trope into reality—or if nothing quite as dramatic, help us better understand the connections between the news and its tone in geographical context.
A recent project called “Culturomics 2.0: Forecasting Large-Scale Human Behavior Using Global News Media Tone in Time and Space” set about to find a way to use tone and geographical analyses methods to yield new insights about global society. If the lead researcher behind the project is correct, this could not only provide opportunities for societal research at global scale—but could also act as a warning bell before crises occur.
Kalev H. Leetaru, Assistant Director for Text and Digital Media Analytics at the Institute for Computing in the Humanities, Arts and Social Science at the University of Illinois and Center Affiliate at NCSA spearheaded the Culturomics 2.0 project. He claims that his analytics experiment has already allowed him to successfully forecast recent revolutions in Tunisia, Egypt, and Libya. Leetaru also says that he has been able to foresee stability in Saudi Arabia (at least through May 2011), and retroactively estimate Osama Bin Laden’s likely hiding place within a 200-kilometer radius.
Whereas initial Culturomics (1.0) studies focused on the frequency of a particular set of words from digitized books, he says that mere frequency isn’t enough to gain real-time, imminently useful information that reflects the modern world.
Shedding the word frequency element that defined version 1.0 of Culturomics, Leetaru set to take deep analytics to a new level by moving past frequency altogether and opting instead to sharpen the focus on tone, geography and the associations these two factors produced.
The project received funding from the National Science Foundation and was managed in part by the University of Tennessee’s Remote Data Analysis and Visualization Center (RDAV) and the National Institute for Computational Science (NICS). Leetaru was granted time on the large shared memory supercomputer Nautilus as part of the Extreme Science and Engineering Discovery Environment (XSEDE) program.
Leetaru says using a large shared memory system like the Nautilus was the key to achieving his research goals. The 1,024 Intel Nehalem core, 8.2 teraflop system with 4 terabytes available for big data workloads was manufactured by SGI as part of their UltraViolet product line. A system like this allows researchers more flexibility as they seek to take advantage of vast computing power to analyze “big data” in innovative ways.
Leetaru’s goals with this project represent a perfect example of a data-intensive problem in research. To arrive at his results, Leetaru needed to gather 100 million news articles stretching back half a century. From this point, the process required a staged approach, which began with a data mining algorithm that extracted important terms—people, places and events—to create a base network of 10 billion “nodes” in the network of news history.
With a mere 10 billion elements left following extraction, Leetaru next set about seeking out relationships that connected these nodes to begin building a second network. He said that when this was complete, he was left with a total of 100 trillion relationships, yielding a network that was about 2.4 petabytes in size.
Few machines have that kind of disk space let alone memory so he then found that to process the data, he needed to break the project up into pieces. He would look carefully at key pieces, generate that network on the fly using the shared memory system to begin the process of refining—a task he said wouldn’t be possible without Nautilus or another large shared memory system.
With the connections established, Leetaru then ran tools to seek out patterns to find interesting differences in tone in different countries or regions. Using 1500 dimensions of analysis that fall under the banner of “tone mining” which determines the positive or negative “score” of a dictionary of words from existing sources, Leetaru was able to build a profile of more profound connections.
These variances in tone of global news were matched with geographic mining efforts, which places the nodes and tones via an algorithm that seeks to determine where the news sources are talking about. Leetaru explains that this is not a simple algorithm since there are many cities called “Cairo” in the world. The algorithm must mine for contextual references to nearby places or elements to correctly place the coordinates.
The final element is the network analysis or modularity finding step. Leetaru takes his network and looks for nodes that are more tightly connected to each other than the rest of the network to find out how nations are related—an analytics project that yields a well-defined set of seven civilizations on Earth. To get this kind of network requires taking every city, every article that has ever referenced it, and each city then becomes a node with its own complex network of tones, meanings and potential for new findings.
With all of these stages in place, Leetaru says the possibilities are endless. One can watch change over time and create reproducible models—or even go back to look at past events to see how closely one can predict the end result. In the full paper, Leetaru hits on some of his successes showing how major crises have played out in a particular set of ways—offering a chance for researchers to predict the future.
Leetaru pointed to the benefits of using the shared memory system Nautilus with the example that has generated a lot of buzz this week—that his methods led to a retroactive map that pinpointed Bin Laden’s location within 200 km.
"One of the beauties of using a large shared memory machine is that for example I could see an interesting pattern (like the Bin Laden portion where I was assuming there was enough information to pinpoint where he was hiding) and then begin exploring different techniques, including writing quick little Perl scripts that would wrap a small network on the fly actually and process that material and basically make a quick chart or table or map."
He went on to note:
"With a large shared memory machine, you don’t have to worry about memory—I never had to worry about writing MPI code to distribute memory across nodes; it’s like it was infinite--with a quick script I could grab all locations that mentioned “Bin Laden” since he first started to appear in the news around 10 years ago, and map it over time or in different ways. It boiled down to writing easy Perl scripts, running in a matter of minutes—if I didn’t have all that memory it would have taken weeks or months with each iteration so one benefit is that leveraging that much hardware allows you to do simple things.”
Leetaru says that even as an undergraduate at NCSA working with some of the first iterations of web-scale web mining, he has been fascinated with the possibilities of deep analytics. While his goal with the Culturomics 2.0 project was to forecast large-scale human behavior using global news media tone in time and space but along the way he stumbled upon a few other unexpected findings, including the fact that indeed, the news is becoming “more negative” in terms of general tone and also that the United States tends to favor itself in its own news filings.
In this era of deep analytics that harness real-time news and sentiment, the Foundation series from Isaac Asimov is never far from the mind. For those who haven’t read the books, in a very small nutshell, mathematical formulas allow civilization to predict the future course of history…and madness ensues.
All arguments about potential for chaos or leaps forward for civilization aside, advances in analytics and high-performance computing like those produced on the Nautilus supercomputer have brought this series of classic science fiction tales into the realm of possibility.
Jun 19, 2013 |
Supercomputer architectures have evolved considerably over the last 20 years, particularly in the number of processors that are linked together. One aspect of HPC architecture that hasn't changed is the MPI programming model.
Jun 18, 2013 |
The world's largest supercomputers, like Tianhe-2, are great at traditional, compute-intensive HPC workloads, such as simulating atomic decay or modeling tornados. But data-intensive applications--such as mining big data sets for connections--is a different sort of workload, and runs best on a different sort of computer.
Jun 18, 2013 |
Researchers are finding innovative uses for Gordon, the 285 teraflop supercomputer housed at the San Diego Supercomputer Center (SDSC) that has a unique Flash-based storage system. Since going online, researchers have put the incredibly fast I/O to use on a wide variety of workloads, ranging from chemistry to political science.
Jun 17, 2013 |
The advent of low-power mobile processors and cloud delivery models is changing the economics of computing. But just as an economy car is good at different things than a full size truck, an HPC workload still has certain computing demands that neither the fastest smartphone nor the most elastic cloud cluster can fulfill.
Jun 14, 2013 |
For all the progress we've made in IT over the last 50 years, there's one area of life that has steadfastly eluded the grasp of computers: understanding human language. Now, researchers at the Texas Advanced Computing Center (TACC) are utilizing a Hadoop cluster on its Longhorn supercomputer to move the state of the art of language processing a little bit further.
05/10/2013 | Cleversafe, Cray, DDN, NetApp, & Panasas | From Wall Street to Hollywood, drug discovery to homeland security, companies and organizations of all sizes and stripes are coming face to face with the challenges – and opportunities – afforded by Big Data. Before anyone can utilize these extraordinary data repositories, however, they must first harness and manage their data stores, and do so utilizing technologies that underscore affordability, security, and scalability.
04/15/2013 | Bull | “50% of HPC users say their largest jobs scale to 120 cores or less.” How about yours? Are your codes ready to take advantage of today’s and tomorrow’s ultra-parallel HPC systems? Download this White Paper by Analysts Intersect360 Research to see what Bull and Intel’s Center for Excellence in Parallel Programming can do for your codes.
Join HPCwire Editor Nicole Hemsoth and Dr. David Bader from Georgia Tech as they take center stage on opening night at Atlanta's first Big Data Kick Off Week, filmed in front of a live audience. Nicole and David look at the evolution of HPC, today's big data challenges, discuss real world solutions, and reveal their predictions. Exactly what does the future holds for HPC?
Join our webinar to learn how IT managers can migrate to a more resilient, flexible and scalable solution that grows with the data center. Mellanox VMS is future-proof, efficient and brings significant CAPEX and OPEX savings. The VMS is available today.