Supercomputing today has turned its vision into more reliable directions such as: 1. Power efficiency or eco-friendly 2. Affordable computing for all 3. Heterogeneous system architecture 4. Free or open-source software 5. Balanced systems or in other words achieving low TCO. In a single sentence “to achieve High Bytes/Flops” in addition to just flops only.
Achieving a balanced system has been the initial goal for the best designers of supercomputers from the past (“Anyone can build a fast CPU. The trick is to build a fast system.”: Seymour Cray). This is in contrast to the much recent past of just achieving fast TFLOPs and not being shy of becoming a power hungry system.
Various presentations from weather, oil and gas, DoD and DoE at conferences and workshops have revealed that for earth system prediction requires balanced computing or high bytes per flops. However, vendors have been focusing on the petaFLOPs while neglecting the other important parameters.
The reason for the new reliable direction is both for business and our environment is because current humans is a generation of scientists from an early age. Huge amounts of information or data through open source on internet is available for all to experiment, invent something new and betterment of the human race and our planet. For performing these experiments, supercomputers today in the modern form are available from clouds or data centers to accelerators cards in the desktops or laptops for all. However, it is the responsibility of the system designers to provide it at a cost, power, form factor and speed with accuracy, which does not affect the very reason it is used: to prevent or global warming.
For example, in a typical modern day, as we pull out our mobile phones every morning to check the weather today or price of oil, market trends etc., the huge demand from these compute intensive and data intensive applications to deliver fast and accurate results is the basic expectation of the user. In another news section, we also get to read the effect of use of fossil fuels for generating power, running supercomputers and moving cars or emissions is degrading the climate health of our planet or “Weather supercomputer used to predict climate change is one of Britain’s worst polluters”.
The goal is of achieving best performance or accuracy, speed and efficiency with availability to all is the need of the hour. The challenge is taken up by two major architectures existing today, scalar and SIMT or GPUs. Both have a huge capability of becoming massively parallel or fast systems to overcome the mentioned challenges. Ironically, the software or algorithms required to run over these architecture do not follow the rules of parallelism strictly. A typical algorithm or software is a mix of sequential and parallel codes. This inhibits the architectures to show work on its full potential. But the scalar as well as SIMT or GPUs have exactly evolved for this situations by adopting SIMD or pipelined architecture or Vectors on their chips. It is the oldest and well proven form of achieving parallelism or fast system in an environment, where algorithms have a mix of sequential and parallel operations or codes. In other words, it all started with vectors and will continue to be vectors for achieving fast systems.
Vector processors or SIMD, have ruled the supercomputing or HPC domain since, the start of thoughts about parallelism or fast and accurate supercomputing. Every computing hardware architecture uses SIMD or vectors today from ARM, Intel, AMD, Nvidia, IBM and even RISC-V in their own way. However, none of them are pure SIMD or pure vectors. Processors hardware architectures today have to tradeoff between scalar (SISD) and vector (SIMD) operations for achieving the ultimate goal of a desired system with a capability to solve all applications with full compute and power efficiency. Lately everybody has realized one golden truth, one hardware architecture or a single chip cannot run all software applications or algorithms with same full efficiency and performance. We now see a big trend towards, huge investments on interconnects, power efficient processor architecture and heterogeneous system architectures like CPU+GPU. The major reason for the same is the amount of data required by software and algorithms is ranging from small to huge. For small and moderate data intensive and highly compute intensive applications CPU+GPU have been showing promising results. Extrapolating the same results for highly data intensive applications is not possible with CPU+GPU. No wonder the processor designer now are investing a lot on interconnects. But, looking at the history of supercomputing, the missing piece to solving this puzzle is a massive SIMD on a single chip or a pure vector processor or Vector Engine (VE) or SX-Aurora TSUBASA (one instruction controls operations for 256 elements) and delivering high bytes per flops. Pure vectors or massive SIMD will complete the trinity or CPU+GPU+VE and is required for solving supercomputing needs of today.
The graph in the figure above, explains various applications, which require massive data processing belongs to the top left area of moderate compute but massive data processing. The overall system achieved with this trinity, will cover the entire application space from traditional HPC or simulations to HPDA or AI/Machine Learning. Apparently, pure vector architectures requires large memory closer to the processing cores and was the major reason to phase out of the market almost two decades ago. The Vector Engine (VE) or SX-Aurora TSUBASA, a dedicated modular (PCIe based AIC form factor) World’s Highest Memory Bandwidth Vector processor is introduced by NEC to overcome road blocks earlier of old vector processors and bring back era of pure Vector Processors.
NEC VE has big multiple Vector processor cores (8 cores per card with 307 Gflops DP per core) along with a Scalar processor core on the same chip with High Bandwidth (HBM2 with 1.35 TB/s) and huge size on-chip memory (48GB) along with huge Low Level Cache of 16 MB (with 3 TB/s memory bandwidth). Hence, a fully equipped processor reduced memory latency or no bottlenecks leading to industry leading sustained or balanced supercomputing performance. For example, 64 VE nodes or cards in a 42U rack can deliver more than 157 TF and power under 30 KWatts. The application offload model of VE, allows the entire application to run on the card. Additionally, the models proposed in the figure below, allows application execution along with CPU and GPU to cater to the challenges.
Additionally, NEC VE software supports Hybrid MPI and Direct IO (VE to VE and VE to IO) for supporting running applications over CPU and file transfer independent of the processing of application over VE. NEC supercomputers are famous from over 35 years for delivering the most efficient or delivering high bytes per flops. Continuing the same SX-Aurora TSUBASA or Vector Engine (VE) legacy is delivering the highest efficiency at the HPCG website results of November-2019. Delivering almost 6 percent fraction of peak performance. Showcasing the capability required to live up to the challenges of supercomputing today. Verticals like Oil and Gas, Weather and Statistical Machine Learning are the areas requiring heavy data processing. VE card or the pure vector processors are the missing piece in the trinity (CPU+GPU+VE) and achieving the balanced system with high bytes per Flops and low TCO. More than 12000 VE cards have been preferred by customers across the globe by more than 100 hundred customers within 2 years of the relaunch or resurrection of massive SIMD or vector processors called the NEC SX-Aurora TUSBASA or pure vectors, the missing piece in the “Trinity achieving the most efficient and balanced supercomputing systems for all applications and users”.
For more details and trying the system please visit our website  and join our forum today. Hence, concluding the same in order to achieve the challenges of achieving best performance or accuracy, speed and efficiency with availability to all users and applications. Following the current trend, where latest processors from CPU to GPU moving towards providing power efficient specific scalar solutions rather than providing all in one chip solution, which was tried and not succeeded. Adding Vector processing or VE to this class of processors or the trinity of CPU+GPU+VE along with fast interconnects and will meet and overcome the challenge. At NEC, we have already tested CPU+VE and our future step is towards the ultimate trinity of CPU+GPU+VE. Hence, VE or vector processor (massive SIMD) will deliver the missing high bytes per flops, which is the need of the hour.
- Entry number 158, 149 and 49; https://www.hpcg-benchmark.org/custom/index.html?lid=155&slid=302
- Please contact on: https://www.nec.com/en/global/solutions/hpc/sx/vector_engine.html
- Join our forum at: https://www.hpc.nec/