PGAS Use will Rise on New H/W Trends, Says Reinders

By James Reinders

May 25, 2017

In this contributed feature, James Reinders explores how modern hardware designs are unlocking potential for the PGAS programming language.

If you have not already tried using PGAS, it is time to consider adding PGAS to the programming techniques you know. Partitioned Global Array Space, commonly known as PGAS, has been around for decades in academic circles but has seen extremely limited use in production applications. PGAS methods include UPC, UPC++, Coarray Fortran, OpenSHM and the latest MPI standard.

Developments in hardware design are giving a boost to PGAS performance that will lead to more widespread usage in the next few years. How much more, of course, remains to be seen. In this article, I’ll explain why PGAS has increased interest and support, show some sample code to illustrate PGAS approaches, and explain why Intel Xeon Phi processors offer an easy way to explore PGAS with performance at a scale not previously available.

PGAS defined

PGAS programming models offer a partitioned global shared memory capability, via a programming language or API, whether special support exists in hardware or not. Four keys in this definition:

  • Global address space – any thread can read/write remote data
  • Partitioned – data is designated as local or global, this is NOT hidden from us – this is critical so we can write our code for locality to enable scaling
  • via a programming language or API – PGAS does not fake that all memory is shared via techniques such as copies on page faults, etc. Instead, PGAS always has an interface that a programmer uses to access this “shared memory” capability. A compiler (with a language interface) or a library (with an API) does whatever magic is needed.
  • whether special support exists in hardware or not – as a programmer, I do not care if there is hardware support other than my craving for performance!

PGAS rising

Discussion of PGAS has been around for decades. It has been steadily growing in practicality for more and more of us, and it is ripe for a fresh look by all of us programmers. I see at least three factors that are coming together which will lead to more widespread usage in the upcoming years.

Factor 1: Hardware support for more and more cores connected coherently. In the 1990s, hardware support for the distributed shared memory model emerged with research projects including Stanford DASH and MIT Alewife, and commercial products including the SGI Origin, Cray T3D/E and Sequent NUMA-Q.  Today’s Intel Xeon Phi processor has many architectural similarities to these early efforts designed specifically for a single-chip implementation. The number of threads of execution is nearly identical, and the performance much higher owing largely to a couple of decades of technological advances. This trend not only empowers PGAS, it also enables exploring PGAS today at a scale and performance level never before possible.

Factor 2: Low latency interconnects. Many disadvantages of PGAS are being addresses by low latency interconnects, partly driven by exascale development. The Cray Aries interconnect has driven latencies low enough that PGAS is quite popular in some circles, and Cray’s investments in UPC, Fortran, Coarray C++, SHMEM and Chapel reflect their continued investments in PGAS. Other interconnects, including Intel Omni-Path Architecture, stand to extend this trend. A key to lower latency is driving functionality out of the software stack and into the interconnect hardware where it can be performed more quickly and independently. This is a trend that greatly empowers PGAS.

Factor 3: Software support growing. The old adage “where there’s smoke there’s fire” might be enough to convince us PGAS is on the rise because software support for PGAS is leading the way. When the U.S. government ran a competition for proposals for high production “next generation” programming languages for high performance computing needs (High Productivity Computing Systems HPCS), three competitors were awarded contracts to develop their proposals of Fortress, Chapel, and X10. It is interesting to note that all three included some form of PGAS support. Today, we also see considerable interest and activity in SHMEM (notably OpenSHMEM), UPC, UPC++ and Coarray Fortran (the latter being a part of the Fortran standard since Fortran 2008). Even MPI 3.0 offers PGAS capabilities. Any software engineer will tell you that “hardware is nothing without software.” It appears that the hardware support for PGAS will not go unsupported, making this a key factor in empowering PGAS.

OpenSHMEM

OpenSHMEM is both an effort to standardize an API and a reference implementation as a library. This means that reads and writes of globally addressable data are performed with functions rather than simple assignments that we will see in language implementations.  Library calls may not be as elegant, but they leave us free to use any compiler we like. OpenSHMEM is a relatively restrictive programming model because of the desire to map its functionality directly to hardware. One important limitation is that all globally addressable data must be symmetric, which means that the same global variables or data buffers are allocated by all threads. Any static data is also guaranteed to be symmetric. This ensures that the layout of remotely accessible memory is the same for all threads, and enables efficient implementations.

#include <shmem.h>

int main(void) {

shmem_init();

if (num_pes()<2) shmem_global_exit(1);

/* allocate from the global heap */

int *A = shmem_malloc(sizeof(int));

int B = 134;

/* store local B at PE 0 into A at processing element (PE) 1 */

if (my_pe()==0) shmem_int_put(A,&B,1,1);

/* global synchronization of execution and data */

shmem_barrier_all();

/* observe the result of the store */

if (my_pe()==1) printf(“A@1=%d\n”,*A);

/* global synchronization to make sure the print is done */

shmem_barrier_all();

shmem_free(A);

shmem_finalize();

return 0;

}

A simple OpenSHMEM program, written according to the OpenSHMEM 1.2 specification.  C standard library headers are omitted.

UPC 

Unified Parallel C (UPC) is an extension to C99. The key language extension is the shared type qualifier. Data objects that are declared with the shared qualifier are accessible by all threads, even if those threads are running on different hosts. An optional layout qualifier can also be provided as part of the shared array type to indicate how the elements of the array are distributed across threads. Because UPC is a language and has compiler support, the assignment operator (=) can be used to perform remote memory access. Pointers to shared data can also themselves be shared, allowing us to create distributed, shared linked data structures (e.g., lists, trees, or graphs). Because compilers may not always recognize bulk data transfers, UPC provides functions (upc_memput, upc_memget, upc_memcpy) that explicitly copy data into and out of globally addressable memory. UPC can allocate globally addressable data in a non-symmetric and non-collective manner, which increases the flexibility of the model and can help to enable alternatives to the conventional bulk-synchronization style of parallelism.

#include <upc.h>

int main(void) {

if (THREADS<2) upc_global_exit(1);

/* allocate from the shared heap */

shared int *A = upc_all_alloc(THREADS,sizeof(int));

int B = 134;

/* store local B at PE 0 into A at processing element (PE) 1 */

if (MYTHREAD==0) A[1] = B;

/* global synchronization of execution and data */

upc_barrier;

/* observe the result of the store */

if (MYTHREAD==1) printf(“A@1=%d\n”,A[1]);

upc_all_free(A);

return 0;

}

A simple UPC program, written according to the version 1.3 specification.  C standard library headers are omitted.

Fortran Coarrays

The concept of Fortran Coarrays, developed as an extension to Fortran 95, was standardized in Fortran 2008. An optional codimension attribute can be added to Fortran arrays, allowing remote access to the array instances across all threads. When using a Coarray, an additional codimension is specified using square brackets to indicate the image in which the array locations will be accessed.

program main

implicit none

integer, allocatable :: A(:)[:]

integer :: B

if (num_images()<2) call abort;

! allocate from the shared heap

allocate(A(1)[*])

B = 134;

! store local B at processing element (PE) 0 into A at PE 1

if (this_image().eq.0) A(1)[1] = B;

! global synchronization of execution and data

sync all

! observe the result of the store

if (this_image().eq.1) print*,’A@1=’,A(1)[1]

! make sure the print is done

sync all

deallocate(A)

end program main

A simple Fortran program, written according to the 2008 specification.

MPI‑3 RMA

The MPI community first introduced one-sided communication, also known as Remote Memory Access (RMA), in the MPI 2.0 standard.  MPI RMA defines library functions for exposing memory for remote access through RMA windows. Experiences with the limitations of MPI 2.0 RMA led to the introduction in MPI 3.0 of new atomic operations, synchronization methods, methods for allocating and exposing remotely accessible memory, a new memory model for cache-coherent architectures, plus several other features. The MPI‑3 RMA interface remains large and complex partly because it aims to support a wider range of usages than most PGAS models. MPI RMA may end up most used as an implementation layer for other PGAS models such as Global Arrays, OpenSHMEM, or Fortran coarrays, as there is at least one implementation of each of these using MPI‑3 RMA under the hood.

#include <mpi.h>

int main(void) {

MPI_Init(NULL,NULL);

int me,np;

MPI_Comm_rank(MPI_COMM_WORLD,&me);

MPI_Comm_size(MPI_COMM_WORLD,&np);

if (np<2) MPI_Abort(MPI_COMM_WORLD,1);

/* allocate from the shared heap */

int * Abuf;

MPI_Win Awin;

MPI_Win_allocate(sizeof(int),sizeof(int),MPI_INFO_NULL,

MPI_COMM_WORLD,&Abuf,&Awin);

MPI_Win_lock_all(MPI_MODE_NOCHECK,Awin);

int B = 134;

/* store local B at processing element (PE) 0 into A at PE 1 */

if (me==0) {

MPI_Put(&B,1,MPI_INT,1,0,1,MPI_INT,Awin);

MPI_Win_flush_local(1,Awin);

}

/* global synchronization of execution and data */

MPI_Win_flush_all(Awin);

MPI_Barrier(MPI_COMM_WORLD);

/* observe the result of the store */

if (me==1) printf(“A@1=%d\n”,*Abuf);

MPI_Win_unlock_all(Awin);

MPI_Win_free(&Awin);

MPI_Finalize();

return 0;

}

A simple MPI RMA program, written according to the version 3.1 specification.  C standard library headers are omitted.

PGAS and Intel Xeon Phi processors

With up to 72 cores that share memory, an Intel Xeon Phi processor is a perfect device to explore PGAS with performance at a scale not previously so widely available. Since we do care about performance, running PGAS on a shared memory device with so many cores is a fantastic proxy for future machines that will offer increased support for performance using PGAS across larger and larger systems.


Code examples and figures are adopted from Chapter 16 (PGAS Programming Models) of the book Intel Xeon Phi Processor High Performance Programming – Intel Xeon Phi processor Edition, used with permission. Jeff Hammond and James Dinan were the primary contributors to the book chapter and the examples used in the chapter and in this article. I owe both of them a great deal of gratitude for all their help.

About the Author

James Reinders likes fast computers and the software tools to make them speedy. Last year, James concluded a 10,001 day career at Intel where he contributed to projects including the world’s first TeraFLOPS supercomputer (ASCI Red), compilers and architecture work for a number of Intel processors and parallel systems. James is the founding editor of The Parallel Universe magazine and has been the driving force behind books on VTune (2005), TBB (2007), Structured Parallel Programming (2012), Intel Xeon Phi coprocessor programming (2013), Multithreading for Visual Effects (2014), High Performance Parallelism Pearls Volume One (2014) and Volume Two (2015), and Intel Xeon Phi processor (2016). James resides in Oregon, where he enjoys both gardening and HPC and HPDA consulting.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

March 18, 2024

Nvidia's latest and fastest GPU, code-named Blackwell, is here and will underpin the company's AI plans this year. The chip offers performance improvements from its predecessors, including the red-hot H100 and A100 GPUs. Read more…

Nvidia Showcases Quantum Cloud, Expanding Quantum Portfolio at GTC24

March 18, 2024

Nvidia’s barrage of quantum news at GTC24 this week includes new products, signature collaborations, and a new Nvidia Quantum Cloud for quantum developers. While Nvidia may not spring to mind when thinking of the quant Read more…

2024 Winter Classic: Meet the HPE Mentors

March 18, 2024

The latest installment of the 2024 Winter Classic Studio Update Show features our interview with the HPE mentor team who introduced our student teams to the joys (and potential sorrows) of the HPL (LINPACK) and accompany Read more…

Houston We Have a Solution: Addressing the HPC and Tech Talent Gap

March 15, 2024

Generations of Houstonian teachers, counselors, and parents have either worked in the aerospace industry or know people who do - the prospect of entering the field was normalized for boys in 1969 when the Apollo 11 missi Read more…

Apple Buys DarwinAI Deepening its AI Push According to Report

March 14, 2024

Apple has purchased Canadian AI startup DarwinAI according to a Bloomberg report today. Apparently the deal was done early this year but still hasn’t been publicly announced according to the report. Apple is preparing Read more…

Survey of Rapid Training Methods for Neural Networks

March 14, 2024

Artificial neural networks are computing systems with interconnected layers that process and learn from data. During training, neural networks utilize optimization algorithms to iteratively refine their parameters until Read more…

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

March 18, 2024

Nvidia's latest and fastest GPU, code-named Blackwell, is here and will underpin the company's AI plans this year. The chip offers performance improvements from Read more…

Nvidia Showcases Quantum Cloud, Expanding Quantum Portfolio at GTC24

March 18, 2024

Nvidia’s barrage of quantum news at GTC24 this week includes new products, signature collaborations, and a new Nvidia Quantum Cloud for quantum developers. Wh Read more…

Houston We Have a Solution: Addressing the HPC and Tech Talent Gap

March 15, 2024

Generations of Houstonian teachers, counselors, and parents have either worked in the aerospace industry or know people who do - the prospect of entering the fi Read more…

Survey of Rapid Training Methods for Neural Networks

March 14, 2024

Artificial neural networks are computing systems with interconnected layers that process and learn from data. During training, neural networks utilize optimizat Read more…

PASQAL Issues Roadmap to 10,000 Qubits in 2026 and Fault Tolerance in 2028

March 13, 2024

Paris-based PASQAL, a developer of neutral atom-based quantum computers, yesterday issued a roadmap for delivering systems with 10,000 physical qubits in 2026 a Read more…

India Is an AI Powerhouse Waiting to Happen, but Challenges Await

March 12, 2024

The Indian government is pushing full speed ahead to make the country an attractive technology base, especially in the hot fields of AI and semiconductors, but Read more…

Charles Tahan Exits National Quantum Coordination Office

March 12, 2024

(March 1, 2024) My first official day at the White House Office of Science and Technology Policy (OSTP) was June 15, 2020, during the depths of the COVID-19 loc Read more…

AI Bias In the Spotlight On International Women’s Day

March 11, 2024

What impact does AI bias have on women and girls? What can people do to increase female participation in the AI field? These are some of the questions the tech Read more…

Alibaba Shuts Down its Quantum Computing Effort

November 30, 2023

In case you missed it, China’s e-commerce giant Alibaba has shut down its quantum computing research effort. It’s not entirely clear what drove the change. Read more…

Nvidia H100: Are 550,000 GPUs Enough for This Year?

August 17, 2023

The GPU Squeeze continues to place a premium on Nvidia H100 GPUs. In a recent Financial Times article, Nvidia reports that it expects to ship 550,000 of its lat Read more…

Analyst Panel Says Take the Quantum Computing Plunge Now…

November 27, 2023

Should you start exploring quantum computing? Yes, said a panel of analysts convened at Tabor Communications HPC and AI on Wall Street conference earlier this y Read more…

DoD Takes a Long View of Quantum Computing

December 19, 2023

Given the large sums tied to expensive weapon systems – think $100-million-plus per F-35 fighter – it’s easy to forget the U.S. Department of Defense is a Read more…

Shutterstock 1285747942

AMD’s Horsepower-packed MI300X GPU Beats Nvidia’s Upcoming H200

December 7, 2023

AMD and Nvidia are locked in an AI performance battle – much like the gaming GPU performance clash the companies have waged for decades. AMD has claimed it Read more…

Synopsys Eats Ansys: Does HPC Get Indigestion?

February 8, 2024

Recently, it was announced that Synopsys is buying HPC tool developer Ansys. Started in Pittsburgh, Pa., in 1970 as Swanson Analysis Systems, Inc. (SASI) by John Swanson (and eventually renamed), Ansys serves the CAE (Computer Aided Engineering)/multiphysics engineering simulation market. Read more…

Intel’s Server and PC Chip Development Will Blur After 2025

January 15, 2024

Intel's dealing with much more than chip rivals breathing down its neck; it is simultaneously integrating a bevy of new technologies such as chiplets, artificia Read more…

Baidu Exits Quantum, Closely Following Alibaba’s Earlier Move

January 5, 2024

Reuters reported this week that Baidu, China’s giant e-commerce and services provider, is exiting the quantum computing development arena. Reuters reported � Read more…

Leading Solution Providers

Contributors

Choosing the Right GPU for LLM Inference and Training

December 11, 2023

Accelerating the training and inference processes of deep learning models is crucial for unleashing their true potential and NVIDIA GPUs have emerged as a game- Read more…

Training of 1-Trillion Parameter Scientific AI Begins

November 13, 2023

A US national lab has started training a massive AI brain that could ultimately become the must-have computing resource for scientific researchers. Argonne N Read more…

Shutterstock 1179408610

Google Addresses the Mysteries of Its Hypercomputer 

December 28, 2023

When Google launched its Hypercomputer earlier this month (December 2023), the first reaction was, "Say what?" It turns out that the Hypercomputer is Google's t Read more…

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

October 30, 2023

With long lead times for the NVIDIA H100 and A100 GPUs, many organizations are looking at the new NVIDIA L40S GPU, which it’s a new GPU optimized for AI and g Read more…

AMD MI3000A

How AMD May Get Across the CUDA Moat

October 5, 2023

When discussing GenAI, the term "GPU" almost always enters the conversation and the topic often moves toward performance and access. Interestingly, the word "GPU" is assumed to mean "Nvidia" products. (As an aside, the popular Nvidia hardware used in GenAI are not technically... Read more…

Shutterstock 1606064203

Meta’s Zuckerberg Puts Its AI Future in the Hands of 600,000 GPUs

January 25, 2024

In under two minutes, Meta's CEO, Mark Zuckerberg, laid out the company's AI plans, which included a plan to build an artificial intelligence system with the eq Read more…

Google Introduces ‘Hypercomputer’ to Its AI Infrastructure

December 11, 2023

Google ran out of monikers to describe its new AI system released on December 7. Supercomputer perhaps wasn't an apt description, so it settled on Hypercomputer Read more…

China Is All In on a RISC-V Future

January 8, 2024

The state of RISC-V in China was discussed in a recent report released by the Jamestown Foundation, a Washington, D.C.-based think tank. The report, entitled "E Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire