Pure Vectors/Massive SIMD’s providing High Bytes/Flops for efficient and balanced supercomputing for all applications

February 10, 2020

Supercomputing today has turned its vision into more reliable directions such as: 1. Power efficiency or eco-friendly 2. Affordable computing for all 3. Heterogeneous system architecture 4. Free or open-source software 5. Balanced systems or in other words achieving low TCO. In a single sentence “to achieve High Bytes/Flops” in addition to just flops only.

Achieving a balanced system has been the initial goal for the best designers of supercomputers from the past (“Anyone can build a fast CPU. The trick is to build a fast system.”: Seymour Cray). This is in contrast to the much recent past of just achieving fast TFLOPs and not being shy of becoming a power hungry system.

Various presentations from weather, oil and gas, DoD and DoE at conferences and workshops have revealed that for earth system prediction requires balanced computing or high bytes per flops. However, vendors have been focusing on the petaFLOPs while neglecting the other important parameters.

The reason for the new reliable direction is both for business and our environment is because current humans is a generation of scientists from an early age. Huge amounts of information or data through open source on internet is available for all to experiment, invent something new and betterment of the human race and our planet. For performing these experiments, supercomputers today in the modern form are available from clouds or data centers to accelerators cards in the desktops or laptops for all. However, it is the responsibility of the system designers to provide it at a cost, power, form factor and speed with accuracy, which does not affect the very reason it is used: to prevent or global warming.

For example, in a typical modern day, as we pull out our mobile phones every morning to check the weather today or price of oil, market trends etc., the huge demand from these compute intensive and data intensive applications to deliver fast and accurate results is the basic expectation of the user. In another news section, we also get to read the effect of use of fossil fuels for generating power, running supercomputers and moving cars or emissions is degrading the climate health of our planet or “Weather supercomputer used to predict climate change is one of Britain’s worst polluters”[1].

The goal is of achieving best performance or accuracy, speed and efficiency with availability to all is the need of the hour. The challenge is taken up by two major architectures existing today, scalar and SIMT or GPUs. Both have a huge capability of becoming massively parallel or fast systems to overcome the mentioned challenges. Ironically, the software or algorithms required to run over these architecture do not follow the rules of parallelism strictly. A typical algorithm or software is a mix of sequential and parallel codes. This inhibits the architectures to show work on its full potential. But the scalar as well as SIMT or GPUs have exactly evolved for this situations by adopting SIMD or pipelined architecture or Vectors on their chips. It is the oldest and well proven form of achieving parallelism or fast system in an environment, where algorithms have a mix of sequential and parallel operations or codes. In other words, it all started with vectors and will continue to be vectors for achieving fast systems.

Vector processors or SIMD, have ruled the supercomputing or HPC domain since, the start of thoughts about parallelism or fast and accurate supercomputing. Every computing hardware architecture uses SIMD or vectors today from ARM, Intel, AMD, Nvidia, IBM and even RISC-V in their own way. However, none of them are pure SIMD or pure vectors. Processors hardware architectures today have to tradeoff between scalar (SISD) and vector (SIMD) operations for achieving the ultimate goal of a desired system with a capability to solve all applications with full compute and power efficiency. Lately everybody has realized one golden truth, one hardware architecture or a single chip cannot run all software applications or algorithms with same full efficiency and performance. We now see a big trend towards, huge investments on interconnects, power efficient processor architecture and heterogeneous system architectures like CPU+GPU. The major reason for the same is the amount of data required by software and algorithms is ranging from small to huge. For small and moderate data intensive and highly compute intensive applications CPU+GPU have been showing promising results. Extrapolating the same results for highly data intensive applications is not possible with CPU+GPU. No wonder the processor designer now are investing a lot on interconnects. But, looking at the history of supercomputing, the missing piece to solving this puzzle is a massive SIMD on a single chip or a pure vector processor or Vector Engine (VE) or SX-Aurora TSUBASA (one instruction controls operations for 256 elements) and delivering high bytes per flops. Pure vectors or massive SIMD will complete the trinity or CPU+GPU+VE and is required for solving supercomputing needs of today.

Figure 1 Mapping software applications to processor architectures

The graph in the figure above, explains various applications, which require massive data processing belongs to the top left area of moderate compute but massive data processing. The overall system achieved with this trinity, will cover the entire application space from traditional HPC or simulations to HPDA or AI/Machine Learning. Apparently, pure vector architectures requires large memory closer to the processing cores and was the major reason to phase out of the market almost two decades ago. The Vector Engine (VE) or SX-Aurora TSUBASA, a dedicated modular (PCIe based AIC form factor) World’s Highest Memory Bandwidth Vector processor is introduced by NEC to overcome road blocks earlier of old vector processors and bring back era of pure Vector Processors.

Figure 2 Deep inside out of NEC SX-Aurora TSUBASA

NEC VE has big multiple Vector processor cores (8 cores per card with 307 Gflops DP per core) along with a Scalar processor core on the same chip with High Bandwidth (HBM2 with 1.35 TB/s) and huge size on-chip memory (48GB) along with huge Low Level Cache of 16 MB (with 3 TB/s memory bandwidth). Hence, a fully equipped processor reduced memory latency or no bottlenecks leading to industry leading sustained or balanced supercomputing performance. For example, 64 VE nodes or cards in a 42U rack can deliver more than 157 TF and power under 30 KWatts. The application offload model of VE, allows the entire application to run on the card. Additionally, the models proposed in the figure below, allows application execution along with CPU and GPU to cater to the challenges.

Figure 3 NEC VE OSS called VEOS allows different execution models for different application requirements [2]
Additionally, NEC VE software supports Hybrid MPI and Direct IO (VE to VE and VE to IO) for supporting running applications over CPU and file transfer independent of the processing of application over VE. NEC supercomputers are famous from over 35 years for delivering the most efficient or delivering high bytes per flops. Continuing the same SX-Aurora TSUBASA or Vector Engine (VE) legacy is delivering the highest efficiency at the HPCG website results of November-2019[3]. Delivering almost 6 percent fraction of peak performance. Showcasing the capability required to live up to the challenges of supercomputing today. Verticals like Oil and Gas, Weather and Statistical Machine Learning are the areas requiring heavy data processing. VE card or the pure vector processors are the missing piece in the trinity (CPU+GPU+VE) and achieving the balanced system with high bytes per Flops and low TCO.  More than 12000 VE cards have been preferred by customers across the globe by more than 100 hundred customers within 2 years of the relaunch or resurrection of massive SIMD or vector processors called the NEC SX-Aurora TUSBASA or pure vectors, the missing piece in the “Trinity achieving the most efficient and balanced supercomputing systems for all applications and users”.

For more details and trying the system please visit our website [4] and join our forum today[5]. Hence, concluding the same in order to achieve the challenges of achieving best performance or accuracy, speed and efficiency with availability to all users and applications. Following the current trend, where latest processors from CPU to GPU moving towards providing power efficient specific scalar solutions rather than providing all in one chip solution, which was tried and not succeeded. Adding Vector processing or VE to this class of processors or the trinity of CPU+GPU+VE along with fast interconnects and will meet and overcome the challenge. At NEC, we have already tested CPU+VE and our future step is towards the ultimate trinity of CPU+GPU+VE. Hence, VE or vector processor (massive SIMD) will deliver the missing high bytes per flops, which is the need of the hour.


References:

 

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

Kathy Yelick on Post-Exascale Challenges

April 18, 2024

With the exascale era underway, the HPC community is already turning its attention to zettascale computing, the next of the 1,000-fold performance leaps that have occurred about once a decade. With this in mind, the ISC Read more…

2024 Winter Classic: Texas Two Step

April 18, 2024

Texas Tech University. Their middle name is ‘tech’, so it’s no surprise that they’ve been fielding not one, but two teams in the last three Winter Classic cluster competitions. Their teams, dubbed Matador and Red Read more…

2024 Winter Classic: The Return of Team Fayetteville

April 18, 2024

Hailing from Fayetteville, NC, Fayetteville State University stayed under the radar in their first Winter Classic competition in 2022. Solid students for sure, but not a lot of HPC experience. All good. They didn’t Read more…

Software Specialist Horizon Quantum to Build First-of-a-Kind Hardware Testbed

April 18, 2024

Horizon Quantum Computing, a Singapore-based quantum software start-up, announced today it would build its own testbed of quantum computers, starting with use of Rigetti’s Novera 9-qubit QPU. The approach by a quantum Read more…

2024 Winter Classic: Meet Team Morehouse

April 17, 2024

Morehouse College? The university is well-known for their long list of illustrious graduates, the rigor of their academics, and the quality of the instruction. They were one of the first schools to sign up for the Winter Read more…

MLCommons Launches New AI Safety Benchmark Initiative

April 16, 2024

MLCommons, organizer of the popular MLPerf benchmarking exercises (training and inference), is starting a new effort to benchmark AI Safety, one of the most pressing needs and hurdles to widespread AI adoption. The sudde Read more…

Kathy Yelick on Post-Exascale Challenges

April 18, 2024

With the exascale era underway, the HPC community is already turning its attention to zettascale computing, the next of the 1,000-fold performance leaps that ha Read more…

Software Specialist Horizon Quantum to Build First-of-a-Kind Hardware Testbed

April 18, 2024

Horizon Quantum Computing, a Singapore-based quantum software start-up, announced today it would build its own testbed of quantum computers, starting with use o Read more…

MLCommons Launches New AI Safety Benchmark Initiative

April 16, 2024

MLCommons, organizer of the popular MLPerf benchmarking exercises (training and inference), is starting a new effort to benchmark AI Safety, one of the most pre Read more…

Exciting Updates From Stanford HAI’s Seventh Annual AI Index Report

April 15, 2024

As the AI revolution marches on, it is vital to continually reassess how this technology is reshaping our world. To that end, researchers at Stanford’s Instit Read more…

Intel’s Vision Advantage: Chips Are Available Off-the-Shelf

April 11, 2024

The chip market is facing a crisis: chip development is now concentrated in the hands of the few. A confluence of events this week reminded us how few chips Read more…

The VC View: Quantonation’s Deep Dive into Funding Quantum Start-ups

April 11, 2024

Yesterday Quantonation — which promotes itself as a one-of-a-kind venture capital (VC) company specializing in quantum science and deep physics  — announce Read more…

Nvidia’s GTC Is the New Intel IDF

April 9, 2024

After many years, Nvidia's GPU Technology Conference (GTC) was back in person and has become the conference for those who care about semiconductors and AI. I Read more…

Google Announces Homegrown ARM-based CPUs 

April 9, 2024

Google sprang a surprise at the ongoing Google Next Cloud conference by introducing its own ARM-based CPU called Axion, which will be offered to customers in it Read more…

Nvidia H100: Are 550,000 GPUs Enough for This Year?

August 17, 2023

The GPU Squeeze continues to place a premium on Nvidia H100 GPUs. In a recent Financial Times article, Nvidia reports that it expects to ship 550,000 of its lat Read more…

Synopsys Eats Ansys: Does HPC Get Indigestion?

February 8, 2024

Recently, it was announced that Synopsys is buying HPC tool developer Ansys. Started in Pittsburgh, Pa., in 1970 as Swanson Analysis Systems, Inc. (SASI) by John Swanson (and eventually renamed), Ansys serves the CAE (Computer Aided Engineering)/multiphysics engineering simulation market. Read more…

Intel’s Server and PC Chip Development Will Blur After 2025

January 15, 2024

Intel's dealing with much more than chip rivals breathing down its neck; it is simultaneously integrating a bevy of new technologies such as chiplets, artificia Read more…

Choosing the Right GPU for LLM Inference and Training

December 11, 2023

Accelerating the training and inference processes of deep learning models is crucial for unleashing their true potential and NVIDIA GPUs have emerged as a game- Read more…

Baidu Exits Quantum, Closely Following Alibaba’s Earlier Move

January 5, 2024

Reuters reported this week that Baidu, China’s giant e-commerce and services provider, is exiting the quantum computing development arena. Reuters reported � Read more…

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

October 30, 2023

With long lead times for the NVIDIA H100 and A100 GPUs, many organizations are looking at the new NVIDIA L40S GPU, which it’s a new GPU optimized for AI and g Read more…

Shutterstock 1179408610

Google Addresses the Mysteries of Its Hypercomputer 

December 28, 2023

When Google launched its Hypercomputer earlier this month (December 2023), the first reaction was, "Say what?" It turns out that the Hypercomputer is Google's t Read more…

AMD MI3000A

How AMD May Get Across the CUDA Moat

October 5, 2023

When discussing GenAI, the term "GPU" almost always enters the conversation and the topic often moves toward performance and access. Interestingly, the word "GPU" is assumed to mean "Nvidia" products. (As an aside, the popular Nvidia hardware used in GenAI are not technically... Read more…

Leading Solution Providers

Contributors

Shutterstock 1606064203

Meta’s Zuckerberg Puts Its AI Future in the Hands of 600,000 GPUs

January 25, 2024

In under two minutes, Meta's CEO, Mark Zuckerberg, laid out the company's AI plans, which included a plan to build an artificial intelligence system with the eq Read more…

DoD Takes a Long View of Quantum Computing

December 19, 2023

Given the large sums tied to expensive weapon systems – think $100-million-plus per F-35 fighter – it’s easy to forget the U.S. Department of Defense is a Read more…

China Is All In on a RISC-V Future

January 8, 2024

The state of RISC-V in China was discussed in a recent report released by the Jamestown Foundation, a Washington, D.C.-based think tank. The report, entitled "E Read more…

Shutterstock 1285747942

AMD’s Horsepower-packed MI300X GPU Beats Nvidia’s Upcoming H200

December 7, 2023

AMD and Nvidia are locked in an AI performance battle – much like the gaming GPU performance clash the companies have waged for decades. AMD has claimed it Read more…

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

March 18, 2024

Nvidia's latest and fastest GPU, codenamed Blackwell, is here and will underpin the company's AI plans this year. The chip offers performance improvements from Read more…

Eyes on the Quantum Prize – D-Wave Says its Time is Now

January 30, 2024

Early quantum computing pioneer D-Wave again asserted – that at least for D-Wave – the commercial quantum era has begun. Speaking at its first in-person Ana Read more…

GenAI Having Major Impact on Data Culture, Survey Says

February 21, 2024

While 2023 was the year of GenAI, the adoption rates for GenAI did not match expectations. Most organizations are continuing to invest in GenAI but are yet to Read more…

The GenAI Datacenter Squeeze Is Here

February 1, 2024

The immediate effect of the GenAI GPU Squeeze was to reduce availability, either direct purchase or cloud access, increase cost, and push demand through the roof. A secondary issue has been developing over the last several years. Even though your organization secured several racks... Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire