HPCwire

Leading HPC
Solution Providers


























HPCwire >> Features

The Emergence of Interconnect Driven Server Architectures


Mike Kemp, CTO, Liquid ComputingBack to the future

Technology convergence and consolidation is a natural part of industry's evolution. The storage arena provides a very recent example. Distributed computing storage used to be primarily based on direct attached storage. A disk physically attached to the server bus represented the vast majority of storage. As distributed environments began to grow this configuration became a real challenge. Multiple discrete disks could not meet the performance and image sizes needed for big applications, the necessary software adjuncts added significant ownership costs, and the management burden became overbearing as backups and security auditing was very time consuming. This drove the emergence of Storage Area Networks (SAN) and Network Attached Storage (NAS) devices that allowed commodity disks, robust communications gear and management functionality delivered to the market as an entire system from vendors such as EMC, Hitachi, DataDirect Networks and Network Appliance.

Computing and communications convergence

Distributed computing servers are on the march towards being highly centralized. In an effort to increase computing density and simplify administration, consolidated rack mounted servers became a logical alternative. Over the past 18 months, the industry has moved towards even denser configurations with blade servers.

A blade server is architecturally identical to a rack mount or a tower server but with a different form factor. The main differentiator for a blade is the shared power infrastructure and common management software across blade elements. Blades are interesting to some enterprise customers because they reduce data center footprint and provide more homogeneous configurations.

High performance computing (HPC) users view tower servers, rack mount servers and blades as essentially the same offering from a compute power perspective. All three of these server architectures can be used as a clustered resource. All that is required is a job scheduler to distribute MPI, UPC or other parallelized codes over the compute nodes.

New processors from vendors such as AMD offer a tremendous increase in compute power and flexibility by allowing a more scalable compute infrastructure to be built. Powerful technologies such as AMD's HyperTransport provide the foundation to build high bandwidth, interprocessor communications, low latency clusters and extended coherent memory configurations. The challenges facing many HPC users is unlocking this power and eliminating the underlying communications bottlenecks and control limitations imposed by legacy server architectures.

Standards based high performance interfaces such as HyperTransport, has created a convergence opportunity to directly connect multi-core processor interfaces into a high performance, fault tolerant interconnect. This convergence eliminates an entire set of performance bottlenecks, consolidates many functions and redefines existing boundaries related to scalability, system control and total cost of system ownership.

Adjunct interconnects have big limitations

While clusters present the promise of being able to improve distributed scalable computing, the biggest challenge with clusters is also the characteristic that defines them. In order to get the maximum output from a large set of independent compute resources, the communications between them must be very low latency and never create a performance bottleneck. Unfortunately, current adjunct interconnect mechanisms are limited in bandwidth and cause programmers to have to cope with limited connectivity inside a cluster. IBM and the Department of Energy recognized these limitations and embarked on the Blue Gene project to overcome performance and scalability issues with adjunct interconnects.

General purpose compute nodes don't cut it
 
The average cluster node is simply a general purpose server that must satisfy many different design goals. Today's server incumbents must ensure that their server designs perform well as a file/print, web and database server, all at a reasonably price. The traditional server design is a compromise across all of these potential roles. This results in mediocre server performance that does not service a high performance application very well at all.

The traditional server PCI bus is a single bus that must service the needs of the interconnect host channel adapter, the video card, or any other card needed inside the server. Even 4-way high performance servers are throttled by the general purpose bus they are forced to view the network through. The performance bottlenecks within traditional clusters are a constant frustration to MPI programmers as the resulting applications are significantly throttled by the compute infrastructure they are forced on top of. Developers typically need to spend as much time with their codes to alleviate communications congestion around hot-spots within the cluster network, as they do solving scientific or engineering problems. Besides being a distraction from core HPC research, it further reduces code portability to other standards based computing environments.

The solution

High performance computing users need a native interconnect driven server architecture that merges computing and communications resources into a powerful, yet standards based system platform. This converged system is optimized to bring the full power of standards based interfaces, like HyperTransport and leading edge processors, directly into an integrated communications network. Once the compute and communications characteristics are manageable as integral elements of the system, today's HPC bottlenecks are removed and a brand new set of computing characteristics emerge.

Copious affordable bandwidth

Copious amounts of affordable inter-processor communications remove the need to code to restrictive low bandwidth levels. Bandwidths of over 10 GB/s are enough to remove the programmatic restrictions currently experienced by many HPC users. The bandwidth must be non-blocking with a Constant Bi-sectional Bandwidth (CBB) of at least one. This means that any processor is guaranteed full bandwidth to any other processor regardless of the any traffic occurring at the time between other processors, and more importantly, the communications performance is independent of the processor locality in the network. This is rarely achieved in traditional clustered environments and causes unpredictable performance results, especially when collective calls are made over a production cluster.

Many interconnects deliver a CBB of between 0.1 to 0.3 due to the exorbitant cost of layered switch setups required to achieve a CBB of one for bandwidths greater than 2 GB/s. This only guarantees that any two processing nodes that are located at either end of the network may only be given 10 to 30 percent of the maximum node-to-node bandwidth available due to traffic overhead. The result is congestion and associated congestion based latency increases. This kind of architecture again forces the computational scientist to develop algorithms with processor locality in mind. To eliminate communications bottlenecks, a high performance, fault tolerant interconnect, with a CBB of one and a routing pattern that is based on an intelligent load balancing algorithm versus a source routing algorithm is the key to achieving high bandwidth, low latency communications. An interconnect driven server architecture makes this intelligent system balancing transparent to the user.

The end of I/O anarchy

Traditional server nodes have not had any ability to set Input/Output (I/O) Quality of Service (QoS) for specific applications. Port-based configurations are possible on SAN switches but are statically defined and very inflexible. Converged platforms allow for definition of QoS at the I/O layer as I/O capacity can be pooled and then allocated to specific applications based on priority needs. For example, seventy percent of the available I/O capacity can be assigned to one application with three other applications each assigned 10 percent of the remaining I/O capacity. As more I/O capacity gets added, it can be assigned the corresponding ratios though software control. This I/O QoS can be defined from the I/O Gateway to the individual compute nodes or to a virtual machine. The QoS parameters can be defined in software and can be changed on a dynamic basis. Policy driven resource management enables system administrators to properly govern the shared resources under heavy load or when component faults occur. This is a central design characteristic of an interconnect driven server architecture. The result is a balanced system where I/O utilization can be engineered, rather than random use depending on the characteristics of the applications.

Let the processors work

Another challenge facing clusters today is the significant loss of CPU memory resulting from protocol memory overhead needed to support large cluster node counts, and loss of CPU power due to the need to use processor cycles to manage the interconnect. Inefficient memory usage is due to dedicated cluster node buffers, which are traditionally allocated to each node-to-node connection. Converged systems can share buffer communication architectures to create one pool of buffer resources for interconnected nodes and yields a much more efficient memory resource usage than traditional systems. In addition, the communications component of the converged system is optimized to minimize the CPU load required to service the interconnect. The result is that more memory and CPU cycles are available to service the HPC application.

Tools and control

Clusters are a loosely coupled distributed set of systems that are renowned for a lack of management capabilities. Most of the nodes are standalone operating systems with very little management software in place. This is due to the high cost of third-party management tools, the lack of specialized cluster management tools and the difficulties in managing heterogeneous environments. An interconnect driven server architecture provides a 'big brother' function that watches for problems in both the compute and interconnect components and provides continuous real-time diagnostics. This preemptive management and control watchdog can only be present in a converged system where the compute and communication resources operate in optimized harmony with known operational measurements.

Summary

HPC users are embracing commodity processors and memory, but they are lacking the means to achieve the performance, scalability and control needed for today's HPC challenges. Traditional SMPs and general-purpose compute nodes are yielding to the groundbreaking capabilities of the interconnect driven server architecture being heralded by companies such as Liquid Computing. This new architecture will deliver on the inherent benefits of computing and communications convergence - it can truly exploit the expanding capabilities of today's microprocessors. The converged system will extend the speed and low latency of HyperTransport to other compute nodes through a dedicated, embedded interconnect. It will also improve the scalability issues associated with traditional hardware based cache-coherent memory management. Developers can also expect MPI, UPC and other parallel languages to be highly optimized for their interconnected application requirements. The embedded management of an interconnect driven server system will yield a homogeneity and lower cost of ownership model that has not been seen before in computing.

-----

Mike Kemp is the CTO and co-founder of Liquid Computing. Mike holds a joint honors degree in the disciplines of electrical engineering and computing science from the University of Newcastle Upon Tyne, United Kingdom, and has over 24 years of practical industry experience building large scale computer systems with Nortel and DARPA. He has a track record of innovation and bringing technology to market and he has published several papers on large scale switching and interconnect technologies. During his career, he has been responsible for the generation of intellectual property and associated patents in a variety of areas including multiprocessor systems, scalable communications and scalable high availability switching. Mike has developed many highly efficient interconnect and communications systems for a variety of products.


Article Tools

  • Print This Article
  • Contact the Author

Share & Save Options

Discussion

There are 0 discussion items posted.  

Sponsored Links



Top Headlines

Oracle and HP's Database Machine Predicated on Voltaire

Oct 06 | The Register | Does the HP Oracle Database Machine represent InfiniBand's big chance to break out its HPC niche? Read more...

3D Imaging Spreads to Fashion and Beyond

Oct 06 | BusinessWeek | A body scan can save a lot of time in the fitting room, and fields from medicine to architecture are adopting 3D computing applications. Read more...

Structural Engineers and Computer Scientists Hope to Integrate Disciplines to 'Revolutionize Building Construction'

Oct 03 | UCSD News | Despite the evolution of computer science over the past 30 years, structural engineering -- hindered by a reluctance to adapt to digital innovations -- has remained relatively unchanged as a discipline. Read more...

Credit Crisis Spreads a Pall Over Silicon Valley

Oct 02 | New York Times | Silcon Valley is starting to feel the effects of the credit crunch. Read more...

Google: 'The World's Most Efficient Data Centers'

Oct 01 | Data Center Knowledge | Google today disclosed details of its data center energy usage, confirming that it operates some of the most efficient facilities in the world. Read more...

Featured Whitepapers

Panasas® Tiered Parity™ Architecture

Sep 04 | | Disk drives are approximately 250 times denser today than a decade ago. This is good news for users who are creating, manipulating and storing more data than ever before. It gives them an opportunity to derive more value from their stored data and lowers the capital acquisition and operating expense associated with that data.

Multimedia

Video White Paper: Architecting a Better Network Storage Solution

BlueArc's Titan architecture represents an evolutionary step in file servers by creating a hardware-based file system that can scale bandwidth, IOPS, and overall data capacity well beyond conventional software-based devices. With its ability to virtualize a massive storage pool of up to four usable petabytes of tiered storage, Titan can scale with growing data requirements, offering a competitive advantage for businesses, researchers, or other enterprises seeking to better manage data growth while still ensuring optimal performance.

High Performance on Wall Street

Newsletters

Stay informed! Subscribe to HPCWire email Newsletters.

Get updates and insights on the High Productivity Computing industry delivered driectly to your inbox.






Featured Events

LCI Workshop
SIFMA
HP-CAST
2008 Virtualization Conference & Expo
Symposium 2009

HPC Job Bank