Visit additional Tabor Communication Publications
April 16, 2009
Late last week Yahoo announced that it had expanded the circle of universities with access to M45, the 4,000 core cluster that the company made available for "internet-scale" research in November of 2007. This announcement added the University of California at Berkeley, Cornell University, and the University of Massachusetts at Amherst to Carnegie Mellon on the list of universities using the cluster for investigations on what Yahoo characterizes as real problems at real scale.
Yahoo unveiled the M45 cluster and its partnership with Carnegie Mellon in conjunction with SC07. The cluster provides 4,000 cores, 3 TB of RAM, and about 1.5 petabytes of disk to researchers who are "pushing the boundaries of large-scale systems software research." I spoke with Ron Brachman, the head of academic relationships at Yahoo, who explained that the company isn't interested in just providing hours to users who need to run bigger versions of jobs. M45 is somewhat unique among large-scale computing resources in that users are specifically encouraged to consider the entire spectrum of system and application software, and how that stack can be improved for more effective computation. M45 users aren't just running big applications, they are experimenting with system and support software to learn about the fundamental aspects of effectively managing large-scale computation.
A primary feature of M45 is its support for Hadoop and Pig. Hadoop is an open source distributed file system and parallel execution environment (based on the MapReduce framework), targeted at data-intensive computing tasks. Yahoo programmers are primary contributors to the project -- which is hosted at the Apache Software Foundation and is free to all -- and the company uses the software to power much of its everyday production computing. Pig is a dataflow programming language developed at Yahoo and built on top of the Hadoop core. Pig is specifically targeted at analyzing large data sets in parallel.
Brachman comes to Yahoo after a career at AT&T labs and a tour as director of the Information Processing Technology Office at DARPA (the office with responsibility for the HPCS project). He clearly knows what it takes to build a research environment around large-scale computing, and this is evident in hearing him speak about his efforts as head of academic relations at Yahoo. Time and again he comes back to the idea of openness as he describes how Yahoo and the universities work together. You might expect that Yahoo, as a company with real bottom line concerns, might just be putting window dressing on a way to get universities to work for (nearly) free on its own problems. Not so, says Brachman. When I asked about specific university/Yahoo collaborations, he said that, while collaborations would be great if they grow organically out of research that the universities are interested in, "we wouldn't consider the project a failure if no direct collaborations develop as long as we and the rest of the world learn something new about computing at scale." Good stuff. He goes on to explain that the universities are free to publish their results in the open literature, and that the universities retain ownership of the intellectual property they develop during research efforts on the cluster.
University participants in the program are selected through a competitive process, and Brachman explains that as it evaluated proposals Yahoo was especially interested in people who wanted to work on systems issues, not just on novel applications. Yahoo provides training and technical support to the universities awarded time on the system, and users affiliated with the university are welcomed on the cluster after going through a screening process designed to ensure the project complies with export control guidelines set in place by the US government. Interestingly, for those of us used to running large-scale production computing resources, M45 doesn't provide a batch interface to its users. Brachman says that this was intentional, as resource allocation is one of the research issues being addressed, but he agrees that as the three new universities come on board, some additional steps may have to be taken to encourage cooperative behavior among the participants.
Research planned for M45 includes a significant focus on data-intensive applications, as you would probably expect not only from Yahoo's business focus but from the emphasis on Pig and Hadoop. Randal Bryant, dean of the School of Computer Science at Carnegie Mellon, described the research conducted over the past year on the cluster as something that was not possible before:
"Our researchers were able to extract and process documents from the Web in a way that was not possible before, changing the way we think about research problems. We were also able to conduct research over a corpus of 200 million Web pages, processing two orders of magnitude more data. We conducted systems software research, comparing, for example, the performance of the Hadoop file system and other parallel file systems. The simultaneous access to applications and systems software has been a real benefit and we look forward to our continued partnership with Yahoo and joint contributions to the cloud computing community."
The recently added partners cite analysis of "vast amounts of societal-scale information available on the Web, such as voting records, online news sources and polling data," large-scale biodiversity studies, and research on the 8.5 terabytes of scanned book text available in the Internet Archive as targets of activity in the coming months.
Brachman says that when they visit universities and interact with them on what they want from Yahoo, the answers always focus on developing an understanding of what real problems Yahoo faces every day. "Often, academia works on small approximations of real-world problems," he explains, "or even more typically on small sets of artificial data that is only representative of a given problem. Access to the M45 cluster brings the real world to campus."
S. Shankar Sastry, dean of engineering at the University of California at Berkeley, echoes this perspective. "There is a sense in academia that the quality of work you can produce is dependent on the equipments and instruments that you have in store. The Yahoo cluster is just a wonderful instrument that can transform the ability to work on various issues, because it is much larger than the kind of clusters that one has access to in universities. To build, maintain, use and operate a system like the Yahoo cluster is just not possible for us."
The M45 program is just one facet of Yahoo's recent portfolio of investments in high performance computing. Yahoo is also partnering with HP and Intel on the Open Cirrus project, and is supporting other efforts as well. Open Cirrus brings together scale-out computing resources ("cloud computing" resources, thus the cirrus reference...get it?) hosted on three continents in six datacenters owned by IDA, UIUC, the Karlsruhe Institute of Technology, HP Labs, Intel Research, and Yahoo.
This level of commitment puts Yahoo's reliance on supercomputing for its core business operations in stark relief. But from Brachman's point of view, it also exposes Yahoo's commitment to advancing the state of the practice in the field so that everyone can benefit from what it and its partners learn. "There are many sources of funding," he explains. "Where there is excitement is in providing resources for research that only a handful of organizations in the world can provide."
May 23, 2013 |
The study of climate change is one of those scientific problems where it is almost essential to model the entire Earth to attain accurate results and make worthwhile predictions. In an attempt to make climate science more accessible to smaller research facilities, NASA introduced what they call ‘Climate in a Box,’ a system they note acts as a desktop supercomputer.
May 22, 2013 |
At some point in the not-too-distant future, building powerful, miniature computing systems will be considered a hobby for high schoolers, just as robotics or even Lego-building are today. That could be made possible through recent advancements made with the Raspberry Pi computers.
May 16, 2013 |
When it comes to cloud, long distances mean unacceptably high latencies. Researchers from the University of Bonn in Germany examined those latency issues of doing CFD modeling in the cloud by utilizing a common CFD and its utilization in HPC instance types including both CPU and GPU cores of Amazon EC2.
May 15, 2013 |
Supercomputers at the Department of Energy’s National Energy Research Scientific Computing Center (NERSC) have worked on important computational problems such as collapse of the atomic state, the optimization of chemical catalysts, and now modeling popping bubbles.
05/10/2013 | Cleversafe, Cray, DDN, NetApp, & Panasas | From Wall Street to Hollywood, drug discovery to homeland security, companies and organizations of all sizes and stripes are coming face to face with the challenges – and opportunities – afforded by Big Data. Before anyone can utilize these extraordinary data repositories, however, they must first harness and manage their data stores, and do so utilizing technologies that underscore affordability, security, and scalability.
04/15/2013 | Bull | “50% of HPC users say their largest jobs scale to 120 cores or less.” How about yours? Are your codes ready to take advantage of today’s and tomorrow’s ultra-parallel HPC systems? Download this White Paper by Analysts Intersect360 Research to see what Bull and Intel’s Center for Excellence in Parallel Programming can do for your codes.
In this demonstration of SGI DMF ZeroWatt disk solution, Dr. Eng Lim Goh, SGI CTO, discusses a function of SGI DMF software to reduce costs and power consumption in an exascale (Big Data) storage datacenter.
The Cray CS300-AC cluster supercomputer offers energy efficient, air-cooled design based on modular, industry-standard platforms featuring the latest processor and network technologies and a wide range of datacenter cooling requirements.