From the Editor | Main Blog Index
October 21, 2010
In supercomputing these days, it's usually the big science applications (astrophysics, climate simulations, earthquake predictions and so on) that seem to garner the most attention. But a new area is quickly emerging onto the HPC scene under the general category of informatics or data-intensive computing. To be sure, informatics is not new at all, but its significance to the HPC realm is growing, mainly due to emerging application areas like cybersecurity, bioinformatics, and social networking.
The rise of social media, in particular, is injecting enormous amounts of data into the global information stream. Making sense of it with conventional computers and software is nearly impossible. With that in mind, a story in MIT Technology Review about using a supercomputer to analyze Twitter data caught my attention. In this case, the supercomputer was a Cray XMT machine operated by the DOE at Pacific Northwest National Lab (PNNL) as part of their CASS-MT infrastructure.
The application software used to drive this analysis was GraphCT, developed by researchers at Georgia Tech in collaboration with the PNNL folks. GraphCT is short for Graph Characterization Toolkit, and is designed to analyze really massive graph structures, like for example, the type of data that makes up social networks such as Twitter.
For those of you who have been hiding under a rock for the last few years, Twitter is a social media site for exchanging 140-character microblogs, aka tweets. As of April 2010, there were over 105 million registered users, generating an average of 55 million tweets a day. The purpose of Twitter is, of course... well, nobody knows for sure. But it does represent an amazing snapshot of what is capturing the attention of Web-connected humans on any given day. If only one could make sense of it.
Counting tweets or even searching them is a pretty simple task for a computer, but sifting out the Twitter leaders from the followers and figuring out the access patterns is a lot trickier. That's where GraphCT and Cray supercomputing comes in.
GraphCT is able to map the Twitter network data to a graph, and make use of certain metrics to assign importance to the user interactions. It measures something called "betweenness centrality," to rank the significance of tweeters.
Because of the size of the Twitter data and the highly multithreaded nature of the GraphCT software, the researchers couldn't rely on the vanilla Web servers that make up the Internet itself, or even conventional HPC computing gear. Fine-grained parallelism plus sparse memory access patterns necessitated a large-scale, global address space machine, built to tolerate high memory latency.
The Cray XMT, a proprietary SMP-type supercomputer is such a machine, and is in fact specifically designed for this application profile. I suspect the reason you don't hear more about the XMT is because most of them are probably deployed at those top secret three-letter government agencies, where data mining and analysis are job one.
The XMT at PNNL is a 128-processor system with 1 terabyte of memory. The distinguishing characteristic of this architecture is that each custom "Threadstorm" processor is capable of managing up to 128 threads simultaneously. Tolerance for high memory latencies is supported by efficient management of thread context at the hardware level.
The system's 1 TB of global RAM is enough to hold more than 4 billion vertices and 34 billion edges of a graph. To put that in perspective, one of the Twitter datasets from September 2009 was encapsulated in 735 thousand vertices and 1 million edges, requiring only about 30 MB of memory. Applying the GraphCT analysis, the data required less than 10 seconds to process. The researchers estimated that a much larger Twitter dataset of 61.6 million vertices and 1.47 billion edges would require only 105 minutes.
When the Georgia Tech and PNNL researchers ran the numbers, they found that relatively few Twitter accounts were responsible for a disproportionate amount of the traffic, at least on the particular datasets they analyzed. The largest dataset was made up of all public tweets from September 20th to 25th in 2009, containing the hashtag #atlflood (to capture tweets about the Atlanta flood event). In this case, at least, the most influential tweets originated with a few major media and government outlets.
We're likely to be hearing more about the graph applications in HPC in the near future. Data sets and data streams are outpacing the capabilities of conventional computers, and demand for digesting all these random bytes is building rapidly. Since the optimal architectures for this scale of data-intensive processing is apt to be quite different than that of conventional HPC platforms (which tend to be optimized for compute-intensive science codes), this could spur a lot more diversity in supercomputer designs.
To that end, a new group called the Graph 500 has developed a benchmark aimed at this category of applications, and intends to maintain a list of the top 500 most performant graph-capable systems. The first Graph 500 list is scheduled to be released at the upcoming Supercomputing Conference (SC10) in New Orleans next month.
In the meantime, if you're interested in giving GraphCT a whirl, a pre-1.0 release of the software can be downloaded for free from the Georgia Tech website. You'll just need a spare Cray XMT or POSIX-compliant machine to run it on.
Posted by Michael Feldman - October 21, 2010 @ 7:50 PM, Pacific Daylight Time
![]()
Michael Feldman is the editor of HPCwire.
No Recent Blog Comments
Large-scale, worldwide scientific initiatives rely on some cloud-based system to both coordinate efforts and manage computational efforts at peak times that cannot be contained within the combined in-house HPC resources. Last week at Google I/O, Brookhaven National Lab’s Sergey Panitkin discussed the role of the Google Compute Engine in providing computational support to ATLAS, a detector of high-energy particles at the Large Hadron Collider (LHC).
Read more...
The Xeon Phi coprocessor might be the new kid on the high performance block, but out of all first-rate kickers of the Intel tires, the Texas Advanced Computing Center (TACC) got the first real jab with its new top ten Stampede system.We talk with the center's Karl Schultz about the challenges of programming for Phi--but more specifically, the optimization...
Read more...
Although Horst Simon was named Deputy Director of Lawrence Berkeley National Laboratory, he maintains his strong ties to the scientific computing community as an editor of the TOP500 list and as an invited speaker at conferences.
Read more...
May 16, 2013 |
When it comes to cloud, long distances mean unacceptably high latencies. Researchers from the University of Bonn in Germany examined those latency issues of doing CFD modeling in the cloud by utilizing a common CFD and its utilization in HPC instance types including both CPU and GPU cores of Amazon EC2.
Read more...
May 15, 2013 |
Supercomputers at the Department of Energy’s National Energy Research Scientific Computing Center (NERSC) have worked on important computational problems such as collapse of the atomic state, the optimization of chemical catalysts, and now modeling popping bubbles.
Read more...
May 10, 2013 |
Program provides cash awards up to $10,000 for the best open-source end-user applications deployed on 100G network.
Read more...
May 09, 2013 |
The Japanese government has revealed its plans to best its previous K Computer efforts with what they hope will be the first exascale system...
Read more...
05/10/2013 | Cleversafe, Cray, DDN, NetApp, & Panasas | From Wall Street to Hollywood, drug discovery to homeland security, companies and organizations of all sizes and stripes are coming face to face with the challenges – and opportunities – afforded by Big Data. Before anyone can utilize these extraordinary data repositories, however, they must first harness and manage their data stores, and do so utilizing technologies that underscore affordability, security, and scalability.
04/15/2013 | Bull | “50% of HPC users say their largest jobs scale to 120 cores or less.” How about yours? Are your codes ready to take advantage of today’s and tomorrow’s ultra-parallel HPC systems? Download this White Paper by Analysts Intersect360 Research to see what Bull and Intel’s Center for Excellence in Parallel Programming can do for your codes.
In this demonstration of SGI DMF ZeroWatt disk solution, Dr. Eng Lim Goh, SGI CTO, discusses a function of SGI DMF software to reduce costs and power consumption in an exascale (Big Data) storage datacenter.
The Cray CS300-AC cluster supercomputer offers energy efficient, air-cooled design based on modular, industry-standard platforms featuring the latest processor and network technologies and a wide range of datacenter cooling requirements.