Pure Vectors/Massive SIMD’s providing High Bytes/Flops for efficient and balanced supercomputing for all applications

February 10, 2020

Supercomputing today has turned its vision into more reliable directions such as: 1. Power efficiency or eco-friendly 2. Affordable computing for all 3. Heterogeneous system architecture 4. Free or open-source software 5. Balanced systems or in other words achieving low TCO. In a single sentence “to achieve High Bytes/Flops” in addition to just flops only.

Achieving a balanced system has been the initial goal for the best designers of supercomputers from the past (“Anyone can build a fast CPU. The trick is to build a fast system.”: Seymour Cray). This is in contrast to the much recent past of just achieving fast TFLOPs and not being shy of becoming a power hungry system.

Various presentations from weather, oil and gas, DoD and DoE at conferences and workshops have revealed that for earth system prediction requires balanced computing or high bytes per flops. However, vendors have been focusing on the petaFLOPs while neglecting the other important parameters.

The reason for the new reliable direction is both for business and our environment is because current humans is a generation of scientists from an early age. Huge amounts of information or data through open source on internet is available for all to experiment, invent something new and betterment of the human race and our planet. For performing these experiments, supercomputers today in the modern form are available from clouds or data centers to accelerators cards in the desktops or laptops for all. However, it is the responsibility of the system designers to provide it at a cost, power, form factor and speed with accuracy, which does not affect the very reason it is used: to prevent or global warming.

For example, in a typical modern day, as we pull out our mobile phones every morning to check the weather today or price of oil, market trends etc., the huge demand from these compute intensive and data intensive applications to deliver fast and accurate results is the basic expectation of the user. In another news section, we also get to read the effect of use of fossil fuels for generating power, running supercomputers and moving cars or emissions is degrading the climate health of our planet or “Weather supercomputer used to predict climate change is one of Britain’s worst polluters”[1].

The goal is of achieving best performance or accuracy, speed and efficiency with availability to all is the need of the hour. The challenge is taken up by two major architectures existing today, scalar and SIMT or GPUs. Both have a huge capability of becoming massively parallel or fast systems to overcome the mentioned challenges. Ironically, the software or algorithms required to run over these architecture do not follow the rules of parallelism strictly. A typical algorithm or software is a mix of sequential and parallel codes. This inhibits the architectures to show work on its full potential. But the scalar as well as SIMT or GPUs have exactly evolved for this situations by adopting SIMD or pipelined architecture or Vectors on their chips. It is the oldest and well proven form of achieving parallelism or fast system in an environment, where algorithms have a mix of sequential and parallel operations or codes. In other words, it all started with vectors and will continue to be vectors for achieving fast systems.

Vector processors or SIMD, have ruled the supercomputing or HPC domain since, the start of thoughts about parallelism or fast and accurate supercomputing. Every computing hardware architecture uses SIMD or vectors today from ARM, Intel, AMD, Nvidia, IBM and even RISC-V in their own way. However, none of them are pure SIMD or pure vectors. Processors hardware architectures today have to tradeoff between scalar (SISD) and vector (SIMD) operations for achieving the ultimate goal of a desired system with a capability to solve all applications with full compute and power efficiency. Lately everybody has realized one golden truth, one hardware architecture or a single chip cannot run all software applications or algorithms with same full efficiency and performance. We now see a big trend towards, huge investments on interconnects, power efficient processor architecture and heterogeneous system architectures like CPU+GPU. The major reason for the same is the amount of data required by software and algorithms is ranging from small to huge. For small and moderate data intensive and highly compute intensive applications CPU+GPU have been showing promising results. Extrapolating the same results for highly data intensive applications is not possible with CPU+GPU. No wonder the processor designer now are investing a lot on interconnects. But, looking at the history of supercomputing, the missing piece to solving this puzzle is a massive SIMD on a single chip or a pure vector processor or Vector Engine (VE) or SX-Aurora TSUBASA (one instruction controls operations for 256 elements) and delivering high bytes per flops. Pure vectors or massive SIMD will complete the trinity or CPU+GPU+VE and is required for solving supercomputing needs of today.

Figure 1 Mapping software applications to processor architectures

The graph in the figure above, explains various applications, which require massive data processing belongs to the top left area of moderate compute but massive data processing. The overall system achieved with this trinity, will cover the entire application space from traditional HPC or simulations to HPDA or AI/Machine Learning. Apparently, pure vector architectures requires large memory closer to the processing cores and was the major reason to phase out of the market almost two decades ago. The Vector Engine (VE) or SX-Aurora TSUBASA, a dedicated modular (PCIe based AIC form factor) World’s Highest Memory Bandwidth Vector processor is introduced by NEC to overcome road blocks earlier of old vector processors and bring back era of pure Vector Processors.

Figure 2 Deep inside out of NEC SX-Aurora TSUBASA

NEC VE has big multiple Vector processor cores (8 cores per card with 307 Gflops DP per core) along with a Scalar processor core on the same chip with High Bandwidth (HBM2 with 1.35 TB/s) and huge size on-chip memory (48GB) along with huge Low Level Cache of 16 MB (with 3 TB/s memory bandwidth). Hence, a fully equipped processor reduced memory latency or no bottlenecks leading to industry leading sustained or balanced supercomputing performance. For example, 64 VE nodes or cards in a 42U rack can deliver more than 157 TF and power under 30 KWatts. The application offload model of VE, allows the entire application to run on the card. Additionally, the models proposed in the figure below, allows application execution along with CPU and GPU to cater to the challenges.

Figure 3 NEC VE OSS called VEOS allows different execution models for different application requirements [2]
Additionally, NEC VE software supports Hybrid MPI and Direct IO (VE to VE and VE to IO) for supporting running applications over CPU and file transfer independent of the processing of application over VE. NEC supercomputers are famous from over 35 years for delivering the most efficient or delivering high bytes per flops. Continuing the same SX-Aurora TSUBASA or Vector Engine (VE) legacy is delivering the highest efficiency at the HPCG website results of November-2019[3]. Delivering almost 6 percent fraction of peak performance. Showcasing the capability required to live up to the challenges of supercomputing today. Verticals like Oil and Gas, Weather and Statistical Machine Learning are the areas requiring heavy data processing. VE card or the pure vector processors are the missing piece in the trinity (CPU+GPU+VE) and achieving the balanced system with high bytes per Flops and low TCO.  More than 12000 VE cards have been preferred by customers across the globe by more than 100 hundred customers within 2 years of the relaunch or resurrection of massive SIMD or vector processors called the NEC SX-Aurora TUSBASA or pure vectors, the missing piece in the “Trinity achieving the most efficient and balanced supercomputing systems for all applications and users”.

For more details and trying the system please visit our website [4] and join our forum today[5]. Hence, concluding the same in order to achieve the challenges of achieving best performance or accuracy, speed and efficiency with availability to all users and applications. Following the current trend, where latest processors from CPU to GPU moving towards providing power efficient specific scalar solutions rather than providing all in one chip solution, which was tried and not succeeded. Adding Vector processing or VE to this class of processors or the trinity of CPU+GPU+VE along with fast interconnects and will meet and overcome the challenge. At NEC, we have already tested CPU+VE and our future step is towards the ultimate trinity of CPU+GPU+VE. Hence, VE or vector processor (massive SIMD) will deliver the missing high bytes per flops, which is the need of the hour.


References:

 

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

MLPerf Inference 4.0 Results Showcase GenAI; Nvidia Still Dominates

March 28, 2024

There were no startling surprises in the latest MLPerf Inference benchmark (4.0) results released yesterday. Two new workloads — Llama 2 and Stable Diffusion XL — were added to the benchmark suite as MLPerf continues Read more…

Q&A with Nvidia’s Chief of DGX Systems on the DGX-GB200 Rack-scale System

March 27, 2024

Pictures of Nvidia's new flagship mega-server, the DGX GB200, on the GTC show floor got favorable reactions on social media for the sheer amount of computing power it brings to artificial intelligence.  Nvidia's DGX Read more…

Call for Participation in Workshop on Potential NSF CISE Quantum Initiative

March 26, 2024

Editor’s Note: Next month there will be a workshop to discuss what a quantum initiative led by NSF’s Computer, Information Science and Engineering (CISE) directorate could entail. The details are posted below in a Ca Read more…

Waseda U. Researchers Reports New Quantum Algorithm for Speeding Optimization

March 25, 2024

Optimization problems cover a wide range of applications and are often cited as good candidates for quantum computing. However, the execution time for constrained combinatorial optimization applications on quantum device Read more…

NVLink: Faster Interconnects and Switches to Help Relieve Data Bottlenecks

March 25, 2024

Nvidia’s new Blackwell architecture may have stolen the show this week at the GPU Technology Conference in San Jose, California. But an emerging bottleneck at the network layer threatens to make bigger and brawnier pro Read more…

Who is David Blackwell?

March 22, 2024

During GTC24, co-founder and president of NVIDIA Jensen Huang unveiled the Blackwell GPU. This GPU itself is heavily optimized for AI work, boasting 192GB of HBM3E memory as well as the the ability to train 1 trillion pa Read more…

MLPerf Inference 4.0 Results Showcase GenAI; Nvidia Still Dominates

March 28, 2024

There were no startling surprises in the latest MLPerf Inference benchmark (4.0) results released yesterday. Two new workloads — Llama 2 and Stable Diffusion Read more…

Q&A with Nvidia’s Chief of DGX Systems on the DGX-GB200 Rack-scale System

March 27, 2024

Pictures of Nvidia's new flagship mega-server, the DGX GB200, on the GTC show floor got favorable reactions on social media for the sheer amount of computing po Read more…

NVLink: Faster Interconnects and Switches to Help Relieve Data Bottlenecks

March 25, 2024

Nvidia’s new Blackwell architecture may have stolen the show this week at the GPU Technology Conference in San Jose, California. But an emerging bottleneck at Read more…

Who is David Blackwell?

March 22, 2024

During GTC24, co-founder and president of NVIDIA Jensen Huang unveiled the Blackwell GPU. This GPU itself is heavily optimized for AI work, boasting 192GB of HB Read more…

Nvidia Looks to Accelerate GenAI Adoption with NIM

March 19, 2024

Today at the GPU Technology Conference, Nvidia launched a new offering aimed at helping customers quickly deploy their generative AI applications in a secure, s Read more…

The Generative AI Future Is Now, Nvidia’s Huang Says

March 19, 2024

We are in the early days of a transformative shift in how business gets done thanks to the advent of generative AI, according to Nvidia CEO and cofounder Jensen Read more…

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

March 18, 2024

Nvidia's latest and fastest GPU, codenamed Blackwell, is here and will underpin the company's AI plans this year. The chip offers performance improvements from Read more…

Nvidia Showcases Quantum Cloud, Expanding Quantum Portfolio at GTC24

March 18, 2024

Nvidia’s barrage of quantum news at GTC24 this week includes new products, signature collaborations, and a new Nvidia Quantum Cloud for quantum developers. Wh Read more…

Alibaba Shuts Down its Quantum Computing Effort

November 30, 2023

In case you missed it, China’s e-commerce giant Alibaba has shut down its quantum computing research effort. It’s not entirely clear what drove the change. Read more…

Nvidia H100: Are 550,000 GPUs Enough for This Year?

August 17, 2023

The GPU Squeeze continues to place a premium on Nvidia H100 GPUs. In a recent Financial Times article, Nvidia reports that it expects to ship 550,000 of its lat Read more…

Shutterstock 1285747942

AMD’s Horsepower-packed MI300X GPU Beats Nvidia’s Upcoming H200

December 7, 2023

AMD and Nvidia are locked in an AI performance battle – much like the gaming GPU performance clash the companies have waged for decades. AMD has claimed it Read more…

DoD Takes a Long View of Quantum Computing

December 19, 2023

Given the large sums tied to expensive weapon systems – think $100-million-plus per F-35 fighter – it’s easy to forget the U.S. Department of Defense is a Read more…

Synopsys Eats Ansys: Does HPC Get Indigestion?

February 8, 2024

Recently, it was announced that Synopsys is buying HPC tool developer Ansys. Started in Pittsburgh, Pa., in 1970 as Swanson Analysis Systems, Inc. (SASI) by John Swanson (and eventually renamed), Ansys serves the CAE (Computer Aided Engineering)/multiphysics engineering simulation market. Read more…

Choosing the Right GPU for LLM Inference and Training

December 11, 2023

Accelerating the training and inference processes of deep learning models is crucial for unleashing their true potential and NVIDIA GPUs have emerged as a game- Read more…

Intel’s Server and PC Chip Development Will Blur After 2025

January 15, 2024

Intel's dealing with much more than chip rivals breathing down its neck; it is simultaneously integrating a bevy of new technologies such as chiplets, artificia Read more…

Baidu Exits Quantum, Closely Following Alibaba’s Earlier Move

January 5, 2024

Reuters reported this week that Baidu, China’s giant e-commerce and services provider, is exiting the quantum computing development arena. Reuters reported � Read more…

Leading Solution Providers

Contributors

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

October 30, 2023

With long lead times for the NVIDIA H100 and A100 GPUs, many organizations are looking at the new NVIDIA L40S GPU, which it’s a new GPU optimized for AI and g Read more…

Shutterstock 1179408610

Google Addresses the Mysteries of Its Hypercomputer 

December 28, 2023

When Google launched its Hypercomputer earlier this month (December 2023), the first reaction was, "Say what?" It turns out that the Hypercomputer is Google's t Read more…

AMD MI3000A

How AMD May Get Across the CUDA Moat

October 5, 2023

When discussing GenAI, the term "GPU" almost always enters the conversation and the topic often moves toward performance and access. Interestingly, the word "GPU" is assumed to mean "Nvidia" products. (As an aside, the popular Nvidia hardware used in GenAI are not technically... Read more…

Shutterstock 1606064203

Meta’s Zuckerberg Puts Its AI Future in the Hands of 600,000 GPUs

January 25, 2024

In under two minutes, Meta's CEO, Mark Zuckerberg, laid out the company's AI plans, which included a plan to build an artificial intelligence system with the eq Read more…

Google Introduces ‘Hypercomputer’ to Its AI Infrastructure

December 11, 2023

Google ran out of monikers to describe its new AI system released on December 7. Supercomputer perhaps wasn't an apt description, so it settled on Hypercomputer Read more…

China Is All In on a RISC-V Future

January 8, 2024

The state of RISC-V in China was discussed in a recent report released by the Jamestown Foundation, a Washington, D.C.-based think tank. The report, entitled "E Read more…

Intel Won’t Have a Xeon Max Chip with New Emerald Rapids CPU

December 14, 2023

As expected, Intel officially announced its 5th generation Xeon server chips codenamed Emerald Rapids at an event in New York City, where the focus was really o Read more…

IBM Quantum Summit: Two New QPUs, Upgraded Qiskit, 10-year Roadmap and More

December 4, 2023

IBM kicks off its annual Quantum Summit today and will announce a broad range of advances including its much-anticipated 1121-qubit Condor QPU, a smaller 133-qu Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire