If you have not already tried using PGAS, it is time to consider adding PGAS to the programming techniques you know. Partitioned Global Array Space, commonly known as PGAS, has been around for decades in academic circles but has seen extremely limited use in production applications. PGAS methods include UPC, UPC++, Coarray Fortran, OpenSHM and the latest MPI standard.
Developments in hardware design are giving a boost to PGAS performance that will lead to more widespread usage in the next few years. How much more, of course, remains to be seen. In this article, I’ll explain why PGAS has increased interest and support, show some sample code to illustrate PGAS approaches, and explain why Intel Xeon Phi processors offer an easy way to explore PGAS with performance at a scale not previously available.
PGAS defined
PGAS programming models offer a partitioned global shared memory capability, via a programming language or API, whether special support exists in hardware or not. Four keys in this definition:
- Global address space – any thread can read/write remote data
- Partitioned – data is designated as local or global, this is NOT hidden from us – this is critical so we can write our code for locality to enable scaling
- via a programming language or API – PGAS does not fake that all memory is shared via techniques such as copies on page faults, etc. Instead, PGAS always has an interface that a programmer uses to access this “shared memory” capability. A compiler (with a language interface) or a library (with an API) does whatever magic is needed.
- whether special support exists in hardware or not – as a programmer, I do not care if there is hardware support other than my craving for performance!
PGAS rising
Discussion of PGAS has been around for decades. It has been steadily growing in practicality for more and more of us, and it is ripe for a fresh look by all of us programmers. I see at least three factors that are coming together which will lead to more widespread usage in the upcoming years.
Factor 1: Hardware support for more and more cores connected coherently. In the 1990s, hardware support for the distributed shared memory model emerged with research projects including Stanford DASH and MIT Alewife, and commercial products including the SGI Origin, Cray T3D/E and Sequent NUMA-Q. Today’s Intel Xeon Phi processor has many architectural similarities to these early efforts designed specifically for a single-chip implementation. The number of threads of execution is nearly identical, and the performance much higher owing largely to a couple of decades of technological advances. This trend not only empowers PGAS, it also enables exploring PGAS today at a scale and performance level never before possible.
Factor 2: Low latency interconnects. Many disadvantages of PGAS are being addresses by low latency interconnects, partly driven by exascale development. The Cray Aries interconnect has driven latencies low enough that PGAS is quite popular in some circles, and Cray’s investments in UPC, Fortran, Coarray C++, SHMEM and Chapel reflect their continued investments in PGAS. Other interconnects, including Intel Omni-Path Architecture, stand to extend this trend. A key to lower latency is driving functionality out of the software stack and into the interconnect hardware where it can be performed more quickly and independently. This is a trend that greatly empowers PGAS.
Factor 3: Software support growing. The old adage “where there’s smoke there’s fire” might be enough to convince us PGAS is on the rise because software support for PGAS is leading the way. When the U.S. government ran a competition for proposals for high production “next generation” programming languages for high performance computing needs (High Productivity Computing Systems HPCS), three competitors were awarded contracts to develop their proposals of Fortress, Chapel, and X10. It is interesting to note that all three included some form of PGAS support. Today, we also see considerable interest and activity in SHMEM (notably OpenSHMEM), UPC, UPC++ and Coarray Fortran (the latter being a part of the Fortran standard since Fortran 2008). Even MPI 3.0 offers PGAS capabilities. Any software engineer will tell you that “hardware is nothing without software.” It appears that the hardware support for PGAS will not go unsupported, making this a key factor in empowering PGAS.
OpenSHMEM
OpenSHMEM is both an effort to standardize an API and a reference implementation as a library. This means that reads and writes of globally addressable data are performed with functions rather than simple assignments that we will see in language implementations. Library calls may not be as elegant, but they leave us free to use any compiler we like. OpenSHMEM is a relatively restrictive programming model because of the desire to map its functionality directly to hardware. One important limitation is that all globally addressable data must be symmetric, which means that the same global variables or data buffers are allocated by all threads. Any static data is also guaranteed to be symmetric. This ensures that the layout of remotely accessible memory is the same for all threads, and enables efficient implementations.
#include <shmem.h> int main(void) { shmem_init(); if (num_pes()<2) shmem_global_exit(1); /* allocate from the global heap */ int *A = shmem_malloc(sizeof(int)); int B = 134; /* store local B at PE 0 into A at processing element (PE) 1 */ if (my_pe()==0) shmem_int_put(A,&B,1,1); /* global synchronization of execution and data */ shmem_barrier_all(); /* observe the result of the store */ if (my_pe()==1) printf(“[email protected]=%d\n”,*A); /* global synchronization to make sure the print is done */ shmem_barrier_all(); shmem_free(A); shmem_finalize(); return 0; }
A simple OpenSHMEM program, written according to the OpenSHMEM 1.2 specification. C standard library headers are omitted.
UPC
Unified Parallel C (UPC) is an extension to C99. The key language extension is the shared type qualifier. Data objects that are declared with the shared qualifier are accessible by all threads, even if those threads are running on different hosts. An optional layout qualifier can also be provided as part of the shared array type to indicate how the elements of the array are distributed across threads. Because UPC is a language and has compiler support, the assignment operator (=) can be used to perform remote memory access. Pointers to shared data can also themselves be shared, allowing us to create distributed, shared linked data structures (e.g., lists, trees, or graphs). Because compilers may not always recognize bulk data transfers, UPC provides functions (upc_memput, upc_memget, upc_memcpy) that explicitly copy data into and out of globally addressable memory. UPC can allocate globally addressable data in a non-symmetric and non-collective manner, which increases the flexibility of the model and can help to enable alternatives to the conventional bulk-synchronization style of parallelism.
#include <upc.h> int main(void) { if (THREADS<2) upc_global_exit(1); /* allocate from the shared heap */ shared int *A = upc_all_alloc(THREADS,sizeof(int)); int B = 134; /* store local B at PE 0 into A at processing element (PE) 1 */ if (MYTHREAD==0) A[1] = B; /* global synchronization of execution and data */ upc_barrier; /* observe the result of the store */ if (MYTHREAD==1) printf(“[email protected]=%d\n”,A[1]); upc_all_free(A); return 0; }
A simple UPC program, written according to the version 1.3 specification. C standard library headers are omitted.
Fortran Coarrays
The concept of Fortran Coarrays, developed as an extension to Fortran 95, was standardized in Fortran 2008. An optional codimension attribute can be added to Fortran arrays, allowing remote access to the array instances across all threads. When using a Coarray, an additional codimension is specified using square brackets to indicate the image in which the array locations will be accessed.
program main implicit none integer, allocatable :: A(:)[:] integer :: B if (num_images()<2) call abort; ! allocate from the shared heap allocate(A(1)[*]) B = 134; ! store local B at processing element (PE) 0 into A at PE 1 if (this_image().eq.0) A(1)[1] = B; ! global synchronization of execution and data sync all ! observe the result of the store if (this_image().eq.1) print*,’[email protected]=’,A(1)[1] ! make sure the print is done sync all deallocate(A) end program main
A simple Fortran program, written according to the 2008 specification.
MPI‑3 RMA
The MPI community first introduced one-sided communication, also known as Remote Memory Access (RMA), in the MPI 2.0 standard. MPI RMA defines library functions for exposing memory for remote access through RMA windows. Experiences with the limitations of MPI 2.0 RMA led to the introduction in MPI 3.0 of new atomic operations, synchronization methods, methods for allocating and exposing remotely accessible memory, a new memory model for cache-coherent architectures, plus several other features. The MPI‑3 RMA interface remains large and complex partly because it aims to support a wider range of usages than most PGAS models. MPI RMA may end up most used as an implementation layer for other PGAS models such as Global Arrays, OpenSHMEM, or Fortran coarrays, as there is at least one implementation of each of these using MPI‑3 RMA under the hood.
#include <mpi.h> int main(void) { MPI_Init(NULL,NULL); int me,np; MPI_Comm_rank(MPI_COMM_WORLD,&me); MPI_Comm_size(MPI_COMM_WORLD,&np); if (np<2) MPI_Abort(MPI_COMM_WORLD,1); /* allocate from the shared heap */ int * Abuf; MPI_Win Awin; MPI_Win_allocate(sizeof(int),sizeof(int),MPI_INFO_NULL, MPI_COMM_WORLD,&Abuf,&Awin); MPI_Win_lock_all(MPI_MODE_NOCHECK,Awin); int B = 134; /* store local B at processing element (PE) 0 into A at PE 1 */ if (me==0) { MPI_Put(&B,1,MPI_INT,1,0,1,MPI_INT,Awin); MPI_Win_flush_local(1,Awin); } /* global synchronization of execution and data */ MPI_Win_flush_all(Awin); MPI_Barrier(MPI_COMM_WORLD); /* observe the result of the store */ if (me==1) printf(“[email protected]=%d\n”,*Abuf); MPI_Win_unlock_all(Awin); MPI_Win_free(&Awin); MPI_Finalize(); return 0; }
A simple MPI RMA program, written according to the version 3.1 specification. C standard library headers are omitted.
PGAS and Intel Xeon Phi processors
With up to 72 cores that share memory, an Intel Xeon Phi processor is a perfect device to explore PGAS with performance at a scale not previously so widely available. Since we do care about performance, running PGAS on a shared memory device with so many cores is a fantastic proxy for future machines that will offer increased support for performance using PGAS across larger and larger systems.
Code examples and figures are adopted from Chapter 16 (PGAS Programming Models) of the book Intel Xeon Phi Processor High Performance Programming – Intel Xeon Phi processor Edition, used with permission. Jeff Hammond and James Dinan were the primary contributors to the book chapter and the examples used in the chapter and in this article. I owe both of them a great deal of gratitude for all their help.
About the Author
James Reinders likes fast computers and the software tools to make them speedy. Last year, James concluded a 10,001 day career at Intel where he contributed to projects including the world’s first TeraFLOPS supercomputer (ASCI Red), compilers and architecture work for a number of Intel processors and parallel systems. James is the founding editor of The Parallel Universe magazine and has been the driving force behind books on VTune (2005), TBB (2007), Structured Parallel Programming (2012), Intel Xeon Phi coprocessor programming (2013), Multithreading for Visual Effects (2014), High Performance Parallelism Pearls Volume One (2014) and Volume Two (2015), and Intel Xeon Phi processor (2016). James resides in Oregon, where he enjoys both gardening and HPC and HPDA consulting.