HPC User Forum Tackles Leadership Computing And Storage

By Nicole Hemsoth

April 21, 2006

At its eighteenth meeting last week in Richmond, Virginia, the HPC User Forum welcomed a record 133 participants from the U.S., Europe and Japan, and treated them to examples of leadership computing and a technical discussion of HPC storage systems and data management.

The HPC community is invited to attend upcoming HPC User Forum meetings: May 29-30 in Bologna, Italy (hosted by CINECA); June 1-2 at ETH in Zurich; and September 18-20 in Denver, Colorado. The European meetings follow a different format and are free of charge. The meetings are co-sponsored by HPCwire.

IDC's Earl Joseph, who serves as executive director of the HPC User Forum, thanked the users for attending and providing 28 excellent end-user presentations, and the vendors who sponsored the meals, including HP, Intel, IBM, Linux Network, The Portland Group and Microsoft. He said the HPC purchasing power of the users at the meeting exceeded $1 billion. For the past three years, the HPC market has been the fastest-growing IT market IDC tracks, achieving 94 percent aggregate growth since 2002 and increasing 24 percent in 2005 alone to reach $9.2 billion in revenues. HP and IBM were virtually tied for first place, with Dell in third position. Clusters have been a disruptive force (71 percent growth in 2005) and now represent nearly half the market, while the capability- and enterprise-class segments continue their modest decline. The workgroup segment, for systems priced at $50,000 and under, has grown 200 percent since 2002.

IDC is interested in hearing about users' impressions of multi-core processors and accelerators. Email Earl Joseph, [email protected].

Steering Committee Chairman Paul Muzio, VP-Government Programs for Network Computing Services, Inc. and Support Infrastructure Director of the Army High Performance Computing Research Center, welcomed the Richmond participants. He noted that as the HPC industry approaches petascale computing, many users feel storage has lagged behind HPC system capabilities. The storage topic will be continued at the September meeting. Regarding the AHPCRC, he said research areas include light combat vehicles, vehicle armor, bunker busters, penetrators and fuzes. Body armor is a serious concern: it's heavy, warm and needs be more flexible.  The goal is to develop lighter, equally survivable armor.

Simon Szykman, director of the National Coordination Office for Networking and Information Technology Research and Development (NITRD), noted the Administration's strong support for supercomputing, evidenced by increases in President Bush's 2007 proposed budgets for NSF (12 percent), DOE-Science (35 percent) and NIST (10 percent), and calls for these budgets to be doubled over the next decade. Szykman applauded the HPC User Forum's focus on storage performance as well as cost, showing data that Moore's Law, Linpack, and hard drive capacity and cost have all been increasing at much higher rates than storage technology performance.

Doug Ball said Boeing uses CFD to design major portions of planes today, inside and out. Simulation saves time and money, and test data don't provide as much insight. HPC-based simulation has given Boeing competitive advantages in many areas, for example by reducing the spacing needed between planes at take-off. With reduced spacing, more takeoffs and landings are possible, plus flights are safer. Among remaining challenges: “fly” the Navier Stokes equations and not a database; computationally determine the acoustic signature of an airplane in one day; and true multidisciplinary design optimization.

Takeshi Yamaguchi said the main products of Aisin AW Co., Ltd. are automatic transmissions and GPS navigation systems. Aisin had 50 percent global market share for automatic transmissions in 2005 (5 million units) and developed the first six-speed automatic transmission for Lexus, VW and Porsche, as well as hybrid systems for Ford. Aisin had 40 percent global market share in 2005 for GPS systems. Aisin began using CFD analysis for torque converters in 1996 and most recently acquired a 48-processor Linux Networx system. Models with more than 10 million elements will be needed for detailed interior design of automatic transmissions.

Dolores Shaffer of Science and Technology Associates (SCA) reviewed the DARPA HPCS Program, noting that Phase II vendors will soon be submitting their proposals for Phase III. Some vendors are now trying to achieve petaflop computing in 2008. Users want petascale capability to perform more detailed simulations, with more data, and to do multi-physics multi-scale problems. The memory wall is growing, and system size is an increasing issue. She noted that more users are employing the HPC challenge benchmark suite in their procurements. Summer 2006 is the estimated start timeframe for Phase III, in which one or two vendors will be selected to finish their designs and build prototype systems.

Keith Gray, British Petroleum (BP), reported that exploring the Thunder Horse field in the Gulf of Mexico (potential reserves: 1 billion barrels) will require the same order of investment as a new Intel processor plant. HPC has reduced Thunder Horse migration from 3 weeks to 1 day and made BP the energy-industry leader in 64-bit computing. BP's strategy is to lease computers and remain agile enough to take advantage of platform breakthroughs.

Jeff Nichols, Oak Ridge National Laboratory, summarized two important global collaborations involving HPC: the Intergovernmental Panel on Climate Change (IPCC), and ITER, which is exploring plasma energy as a virtually unlimited source of clean electricity. ORNL is the lead U.S. organization for ITER and has done a substantial portion of the U.S. computational work for IPCC. ORNL plans to upgrade its HPC systems to 100 TF in 2006, 250 TF in 2007, and 1 PF perhaps as soon as 2008.

Liquid Computing's Mike Kemp gave a new HPC vendor update. The company plans its product launch in June 2006, with general availability in August 2006. The product initially will be sold with up to 12 chassis but is designed to scale to a petaflop. It can include co-processors (FPGAs and others) to complement best-in-class microprocessors. Up to 16 GB of interconnect per chassis. Latency ~2 usec between chassis across the network. Compute module: 4 Opterons with RAM.

Paul Muzio introduced the technical sessions on storage, noting that they would address shared file systems software, hierarchical storage systems software, and hardware media, and that users and vendors would be asked, “If this were 2008, what would you need for petascale HPC systems to be delivered in 2011?” He thanked Brad Blasing of AHPCRC-NSCI for organizing the storage sessions.

Henry Newman, Instrumental, said storage requirements for the 2010-11 timeframe were collected from NRO, NSA, DOE-SC, NNSA, HPCMP and NASA. This led to 12 different I/O scenarios. High speed I/O will be the Achilles heel of petascale computing. Why not use some of same technologies as earlier, such as vectors and multithreading, to address some of these I/O issues? If it's true for memory, it's true for I/O.

Brad Blasing said the AHPCRC has three different storage systems today (direct attach disk, home directory space, infrastructure space) and wants a single system by 2011, with 3 to 4 PBs, speed configurable to 100 GB/s, and able to retain up to 1 billion files for years.  Since 1985, the price for 1 GB of storage has dropped from $100,000 to about $3. Extrapolating to 2011, this would mean 2 to 3 PBs would cost about $1 million.

Michael Knowles said the Army Research Laboratory was initially using a failover system with Veritas software and now has standalone active servers. On the disk side, ARL is looking at going to a fourth generation of EMC disks. For the 2011 file system, ARL would want to move HSM straight out to the center fabric and get away from the multiple copies. This implies a scalable, redundant switched network.

According to Hank Heeb, Boeing uses an archival file service (HSM) for long-term retention and is trying to centralize on a single type of file system. By 2011, Boeing expects to grow to at least 500 TBs and have a high-speed, object-based parallel file system with HSM, with the ability to expand dynamically at any time.

Sharan Kalwani, General Motors Corporation, said today it's hard to meld together all the technologies and get 99.97 percent or better uptime. He expects to continue to work with many vendors. For CAE, crash is the biggest application and the data volume is exploding. GM ran 100 simulations/month 3 years ago, and is running 3,000/month today. This may grow to 10,000/month in a few years. Five years ago, 12 to 16 GB/s I/O performance was enough; today it's 160 GB/s. Scalability, security, robustness, data access and data content all are bigger issues today than five years ago.

Bill Thigpen, NASA Ames Research Laboratory, said moving data to/from Columbia has been a challenge. The lab is testing a new process for unattended file transfer. The file system is evolving from direct connect to disks, to a shared file system. In 2011, Ames expects to store 125 TB/day and retrieve 25 TB/day, with 25 PB archive RAID cache, 200 Gb/s across the network, a 200 PB tape media archive, and 1 billion files and directories.

Carol Pasti said NOAA/NCEP needs to deliver over 200,000 products/day (weather forecasts and analysis) on time. They are a large IBM shop and have 3 PBs of tape storage. Future requirements include improved write performance on smaller sized file systems; easier disk management; improved metadata access (GPFS multi-clustering will replace NFS); and an archive that looks/acts/feels like a file system from all nodes in the cluster.

According to Paul Buerger, the Ohio Supercomputer Center's current environment includes a 400 TB SAN, plus Ext3/NFS, PVFS2 (scratch space), SAN FS (special purpose projects) and a database server. The number of files has tripled in the past three years (tens of millions today). “If we extrapolate to 2011, we're up around 1 petabyte of storage need and about 200 million files.  Flexibility is the huge overriding future requirement, because our users don't ask us before doing what they do.”

Nick Nystrom noted that the Pittsburgh Supercomputer Center heads user support and security for the Teragrid. PSC has already begun to plan as far out at 2011, driven by requirements for efficient application I/O, checkpoint/restart and data movement across the Teragrid. By this timeframe, PSC expects to require 1 GB/s of bandwidth per TF of computation.

Bryan Banister, San Diego Supercomputer Center, said the center is part of the Teragrid and, as an academic center, run many types of applications. Earthquake simulations alone generate 47 TBs of data per week. The Storage Data Broker is a metadata resource. By 2011, he expects archival data to grow to more than 100 PBs.

Priya Vashista, University of Southern California, uses HPC to simulate the atomic and molecular behavior of materials, modeling millions of atoms at a time. For nanophase highly energetic materials (a.k.a. explosives), the goal is for the explosive to be stable yet provide maximum energy when ignited. For hypervelocity impact damage, HPC is used to study the cascade of processes that occur following impact. USC's Chevron-supported Center for Interactive Smart Oil Field Technologies uses HPC for reservoir management and model validation.

Major Kevin Benedict explained that the Maui High Performance Computing Center is part the Air Force Research Lab energy directorate and a distributed center for DOD HPCMP. Customers also include the Navy, Army, Marine Corps and others. Focus areas: image and signal processing (enhance electro-optical images from telescopes; detect very small satellite through telemetry and other methods; database fusion and management technologies); battlefield modeling and simulation; system and software integration.

DOE's Pacific Northwest National Laboratory, Kevin Regimbal reported, uses HPC, including its 11.8 TF HP Itanium system, for the NNSA-funded Energy-Smart Data Center (ESDC) program, which is trying to save energy (and improve performance) by cooling supercomputers more efficiently. PNNL is working with SprayCool (adaptive spray cooling). This could make computers much smaller, denser. PNNL converted HP RX2600 nodes to SprayCool and did this to a full rack of HP servers.

Brad Blasing moderated part 2 of the storage sessions, this time asking storage vendors that same questions users weighed in on earlier.

Sun/StorageTeK's Harriet Covertson reviewed Sun's StorEdge Shared QFS, which is fully integrated with Sun's HSM, SAM-FS. Sun is moving to object-based storage, which decouples the physical storage technology from applications and file systems. She said this is a paradigm shift.

Yung Yip reported that in 2011, Imation will be shipping a 4 TB cartridge and will move to 12 TBs in 2016. Transfer rate advances will not be as rosy, probably 300 MB/s in 5 years and 800 MB/s in 10. There are no unsurmountable challenges in implementing this roadmap. Immediate challenges include mechanical scaling and media noise and head-to-tape spacing. Imation is also working on tape/HDD hybrid cartridges.

IBM's Bob Curran said that today, cluster file systems are used well beyond big science. In 2011, disk drive heads will move to the TB range, but with no massive increase in rotational speeds (10 GB/s). General-purpose networks will operate at 30 GB/s or more, over greater distances. This will enable larger and faster file systems. New applications will probably drive where we need to go.

Dave Ellis said that Engenio is part of LSI Logic today. Data is growing, making huge demands on the data center. IT costs are constrained. I/O interfaces and storage protocols are the foundation for the future. Multicore processors are also an important factor.

David Fellinger, Data Direct Networks, said drives have become worse, because areal density has been increasing. The true performance of drives is limited not just by seek time but by error recovery. DDN solves this problem with state machine logic, so data never is handled by a CPU. DDN can do parallel processing of data and error recovery on the fly.

COPAN Systems' Aloke Guha said bandwidth is improving but not latency, so the only way to go is tiering (tiered options such as SSD-Disk-MAID-tape). HSM and tiered storage will evolve to become more content-aware. For large-scale backup, restore is the challenge. For long-term preservation, format is an issue. It's hard to read older tapes in non-current formats.

Jeffrey Denworth, Cluster File Systems, argued that Lustre can deploy storage in almost any way you want. Two key technologies for petascale systems are clustered metadata and client write-back metadata cache. These technologies also enable more enterprise-style functionality. He expects no completely analytical file system within the next 5 years.

Rich Altmaier said SGI's approach has been to provide various storage solutions. In the future, storage architectures should be cleanly layered; you need to assume indexing and tracking databases will contain bugs and data corruption; managing failures of the hardware elements and connectivity will be vital; and notions of virtualization are very valid.

HP's Scott Misage sees three dimensions: (1) users are driving requirements for file sizes and latencies; (2) there is geographical dispersion via grids, multinationals, utilities; and (3) system complexity is growing enormously. Exa-scale challenges: How do you manage 100K nodes storage system? How do you share 100K nodes? HP Labs is doing early work on self-managing storage systems.

IDC's Addison Snell reviewed dates and locations for HPC User Forum meetings for the rest of 2006 and early 2007.

Earl Joseph moderated a session on HPC collaborations. The initial speaker, Suzy Tichenor, Council on Competitiveness, said HPC is a key ingredient in America's innovation capacity and reviewed key findings from recent Council studies that showed HPC is essential to business survival. Lack of scalable, production-quality HPC application software is holding industry back from using HPC more aggressively for competitive advantage. There is an opportunity to accelerate application software development through partnerships.

Jeff Candy discussed General Atomics' work on plasma reactions in a tokamak (General Atomics is a participant in DOE's INCITE program). The ultimate goal is to obtain clean energy from fusion, and the largest project related to this is ITER.  The company's GYRO code calculates the turbulent flow of heat and particles in reactor plasmas.

John West, Engineer Research and Development Center (ERDC), said continuing to use the command line as the primary interface to most supercomputers “will cost us competitively in the U.S.”  What's needed to make HPC usage more pervasive, especially among new graduates, is a friendlier, more flexible GUI for which Google might serve as a model.

Kent Misegades said CEI's EnSight visualization tools and Harpoon meshers prove that an ISV dedicated exclusively to the HPC market can be very successful.  What ISVs need are early adopters to share risk; access to HPC systems and support from HW vendors; funded development in the absence of adequate market demand; IP rights; no open source requirement; and government acting to help, not compete, with ISVs.

Dolores Shaffer described opportunities for involvement and partnerships within the DARPA HPCS program. She invited people to make proposals to DARPA. If you have a really strong idea, join DARPA to make it happen.

Sukumar Chakravarthy, Metacomp Technologies, reviewed the American Rocket Car Project, whose goal is to recapture the land-speed vehicle record for the U.S. The American vehicle is 47 feet long and has 120,000 horsepower (powered by a hybrid rocket engine). Metacomp is using its own meshing technology and generated a 8.2 million-prism tetrahedral mesh. Their job is to analyze the design.

In the final session, Chevron, BP and NOAA discussed dealing with disasters, especially experiences related to Hurricane Katrina.

Mike Netzband, Chevron, stressed the importance of detailed advance preparation.  You need to recognize that information is a corporate asset and treat it accordingly; understand what information really is critical; create actionable plans to replicate, mirror, back up data; test the plans; and, most important, act at an appropriate time, rather than wait.

Keith Gray, British Petroleum, said the question becomes, what can you do on a reasonable budget? Based on BP's Katrina experience, the company will plan for staff needs (hotel reservations, evacuation routes, how to stay in touch); clarify expectations and plans; understand options for short term alternative sites; replicate source code; and work to improve near-line capabilities.

Gary Wohl said NOAA/NCEP thinks in terms of continuity of operations, rather than disaster recovery. NCEP's mission is all weather, all the time. NCEP needs to ensure its whole customer base, which includes first responders, that they can get the information they need to complete their missions. The center runs drills and has established two continuity of operations (COOP) sites. Future plans include increasing the physical separation between primary and backup sites.

Earl Joseph thanked everyone for attending and invited everyone to join in future HPC user Forum meeting that are posted at: www.hpcuserforum.com.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

Q&A with Nvidia’s Chief of DGX Systems on the DGX-GB200 Rack-scale System

March 27, 2024

Pictures of Nvidia's new flagship mega-server, the DGX GB200, on the GTC show floor got favorable reactions on social media for the sheer amount of computing power it brings to artificial intelligence.  Nvidia's DGX Read more…

Call for Participation in Workshop on Potential NSF CISE Quantum Initiative

March 26, 2024

Editor’s Note: Next month there will be a workshop to discuss what a quantum initiative led by NSF’s Computer, Information Science and Engineering (CISE) directorate could entail. The details are posted below in a Ca Read more…

Waseda U. Researchers Reports New Quantum Algorithm for Speeding Optimization

March 25, 2024

Optimization problems cover a wide range of applications and are often cited as good candidates for quantum computing. However, the execution time for constrained combinatorial optimization applications on quantum device Read more…

NVLink: Faster Interconnects and Switches to Help Relieve Data Bottlenecks

March 25, 2024

Nvidia’s new Blackwell architecture may have stolen the show this week at the GPU Technology Conference in San Jose, California. But an emerging bottleneck at the network layer threatens to make bigger and brawnier pro Read more…

Who is David Blackwell?

March 22, 2024

During GTC24, co-founder and president of NVIDIA Jensen Huang unveiled the Blackwell GPU. This GPU itself is heavily optimized for AI work, boasting 192GB of HBM3E memory as well as the the ability to train 1 trillion pa Read more…

Nvidia Appoints Andy Grant as EMEA Director of Supercomputing, Higher Education, and AI

March 22, 2024

Nvidia recently appointed Andy Grant as Director, Supercomputing, Higher Education, and AI for Europe, the Middle East, and Africa (EMEA). With over 25 years of high-performance computing (HPC) experience, Grant brings a Read more…

Q&A with Nvidia’s Chief of DGX Systems on the DGX-GB200 Rack-scale System

March 27, 2024

Pictures of Nvidia's new flagship mega-server, the DGX GB200, on the GTC show floor got favorable reactions on social media for the sheer amount of computing po Read more…

NVLink: Faster Interconnects and Switches to Help Relieve Data Bottlenecks

March 25, 2024

Nvidia’s new Blackwell architecture may have stolen the show this week at the GPU Technology Conference in San Jose, California. But an emerging bottleneck at Read more…

Who is David Blackwell?

March 22, 2024

During GTC24, co-founder and president of NVIDIA Jensen Huang unveiled the Blackwell GPU. This GPU itself is heavily optimized for AI work, boasting 192GB of HB Read more…

Nvidia Looks to Accelerate GenAI Adoption with NIM

March 19, 2024

Today at the GPU Technology Conference, Nvidia launched a new offering aimed at helping customers quickly deploy their generative AI applications in a secure, s Read more…

The Generative AI Future Is Now, Nvidia’s Huang Says

March 19, 2024

We are in the early days of a transformative shift in how business gets done thanks to the advent of generative AI, according to Nvidia CEO and cofounder Jensen Read more…

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

March 18, 2024

Nvidia's latest and fastest GPU, codenamed Blackwell, is here and will underpin the company's AI plans this year. The chip offers performance improvements from Read more…

Nvidia Showcases Quantum Cloud, Expanding Quantum Portfolio at GTC24

March 18, 2024

Nvidia’s barrage of quantum news at GTC24 this week includes new products, signature collaborations, and a new Nvidia Quantum Cloud for quantum developers. Wh Read more…

Houston We Have a Solution: Addressing the HPC and Tech Talent Gap

March 15, 2024

Generations of Houstonian teachers, counselors, and parents have either worked in the aerospace industry or know people who do - the prospect of entering the fi Read more…

Alibaba Shuts Down its Quantum Computing Effort

November 30, 2023

In case you missed it, China’s e-commerce giant Alibaba has shut down its quantum computing research effort. It’s not entirely clear what drove the change. Read more…

Nvidia H100: Are 550,000 GPUs Enough for This Year?

August 17, 2023

The GPU Squeeze continues to place a premium on Nvidia H100 GPUs. In a recent Financial Times article, Nvidia reports that it expects to ship 550,000 of its lat Read more…

Shutterstock 1285747942

AMD’s Horsepower-packed MI300X GPU Beats Nvidia’s Upcoming H200

December 7, 2023

AMD and Nvidia are locked in an AI performance battle – much like the gaming GPU performance clash the companies have waged for decades. AMD has claimed it Read more…

DoD Takes a Long View of Quantum Computing

December 19, 2023

Given the large sums tied to expensive weapon systems – think $100-million-plus per F-35 fighter – it’s easy to forget the U.S. Department of Defense is a Read more…

Synopsys Eats Ansys: Does HPC Get Indigestion?

February 8, 2024

Recently, it was announced that Synopsys is buying HPC tool developer Ansys. Started in Pittsburgh, Pa., in 1970 as Swanson Analysis Systems, Inc. (SASI) by John Swanson (and eventually renamed), Ansys serves the CAE (Computer Aided Engineering)/multiphysics engineering simulation market. Read more…

Choosing the Right GPU for LLM Inference and Training

December 11, 2023

Accelerating the training and inference processes of deep learning models is crucial for unleashing their true potential and NVIDIA GPUs have emerged as a game- Read more…

Intel’s Server and PC Chip Development Will Blur After 2025

January 15, 2024

Intel's dealing with much more than chip rivals breathing down its neck; it is simultaneously integrating a bevy of new technologies such as chiplets, artificia Read more…

Baidu Exits Quantum, Closely Following Alibaba’s Earlier Move

January 5, 2024

Reuters reported this week that Baidu, China’s giant e-commerce and services provider, is exiting the quantum computing development arena. Reuters reported � Read more…

Leading Solution Providers

Contributors

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

October 30, 2023

With long lead times for the NVIDIA H100 and A100 GPUs, many organizations are looking at the new NVIDIA L40S GPU, which it’s a new GPU optimized for AI and g Read more…

Shutterstock 1179408610

Google Addresses the Mysteries of Its Hypercomputer 

December 28, 2023

When Google launched its Hypercomputer earlier this month (December 2023), the first reaction was, "Say what?" It turns out that the Hypercomputer is Google's t Read more…

AMD MI3000A

How AMD May Get Across the CUDA Moat

October 5, 2023

When discussing GenAI, the term "GPU" almost always enters the conversation and the topic often moves toward performance and access. Interestingly, the word "GPU" is assumed to mean "Nvidia" products. (As an aside, the popular Nvidia hardware used in GenAI are not technically... Read more…

Shutterstock 1606064203

Meta’s Zuckerberg Puts Its AI Future in the Hands of 600,000 GPUs

January 25, 2024

In under two minutes, Meta's CEO, Mark Zuckerberg, laid out the company's AI plans, which included a plan to build an artificial intelligence system with the eq Read more…

Google Introduces ‘Hypercomputer’ to Its AI Infrastructure

December 11, 2023

Google ran out of monikers to describe its new AI system released on December 7. Supercomputer perhaps wasn't an apt description, so it settled on Hypercomputer Read more…

China Is All In on a RISC-V Future

January 8, 2024

The state of RISC-V in China was discussed in a recent report released by the Jamestown Foundation, a Washington, D.C.-based think tank. The report, entitled "E Read more…

Intel Won’t Have a Xeon Max Chip with New Emerald Rapids CPU

December 14, 2023

As expected, Intel officially announced its 5th generation Xeon server chips codenamed Emerald Rapids at an event in New York City, where the focus was really o Read more…

IBM Quantum Summit: Two New QPUs, Upgraded Qiskit, 10-year Roadmap and More

December 4, 2023

IBM kicks off its annual Quantum Summit today and will announce a broad range of advances including its much-anticipated 1121-qubit Condor QPU, a smaller 133-qu Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire