PGAS Use will Rise on New H/W Trends, Says Reinders

By James Reinders

May 25, 2017

In this contributed feature, James Reinders explores how modern hardware designs are unlocking potential for the PGAS programming language.

If you have not already tried using PGAS, it is time to consider adding PGAS to the programming techniques you know. Partitioned Global Array Space, commonly known as PGAS, has been around for decades in academic circles but has seen extremely limited use in production applications. PGAS methods include UPC, UPC++, Coarray Fortran, OpenSHM and the latest MPI standard.

Developments in hardware design are giving a boost to PGAS performance that will lead to more widespread usage in the next few years. How much more, of course, remains to be seen. In this article, I’ll explain why PGAS has increased interest and support, show some sample code to illustrate PGAS approaches, and explain why Intel Xeon Phi processors offer an easy way to explore PGAS with performance at a scale not previously available.

PGAS defined

PGAS programming models offer a partitioned global shared memory capability, via a programming language or API, whether special support exists in hardware or not. Four keys in this definition:

  • Global address space – any thread can read/write remote data
  • Partitioned – data is designated as local or global, this is NOT hidden from us – this is critical so we can write our code for locality to enable scaling
  • via a programming language or API – PGAS does not fake that all memory is shared via techniques such as copies on page faults, etc. Instead, PGAS always has an interface that a programmer uses to access this “shared memory” capability. A compiler (with a language interface) or a library (with an API) does whatever magic is needed.
  • whether special support exists in hardware or not – as a programmer, I do not care if there is hardware support other than my craving for performance!

PGAS rising

Discussion of PGAS has been around for decades. It has been steadily growing in practicality for more and more of us, and it is ripe for a fresh look by all of us programmers. I see at least three factors that are coming together which will lead to more widespread usage in the upcoming years.

Factor 1: Hardware support for more and more cores connected coherently. In the 1990s, hardware support for the distributed shared memory model emerged with research projects including Stanford DASH and MIT Alewife, and commercial products including the SGI Origin, Cray T3D/E and Sequent NUMA-Q.  Today’s Intel Xeon Phi processor has many architectural similarities to these early efforts designed specifically for a single-chip implementation. The number of threads of execution is nearly identical, and the performance much higher owing largely to a couple of decades of technological advances. This trend not only empowers PGAS, it also enables exploring PGAS today at a scale and performance level never before possible.

Factor 2: Low latency interconnects. Many disadvantages of PGAS are being addresses by low latency interconnects, partly driven by exascale development. The Cray Aries interconnect has driven latencies low enough that PGAS is quite popular in some circles, and Cray’s investments in UPC, Fortran, Coarray C++, SHMEM and Chapel reflect their continued investments in PGAS. Other interconnects, including Intel Omni-Path Architecture, stand to extend this trend. A key to lower latency is driving functionality out of the software stack and into the interconnect hardware where it can be performed more quickly and independently. This is a trend that greatly empowers PGAS.

Factor 3: Software support growing. The old adage “where there’s smoke there’s fire” might be enough to convince us PGAS is on the rise because software support for PGAS is leading the way. When the U.S. government ran a competition for proposals for high production “next generation” programming languages for high performance computing needs (High Productivity Computing Systems HPCS), three competitors were awarded contracts to develop their proposals of Fortress, Chapel, and X10. It is interesting to note that all three included some form of PGAS support. Today, we also see considerable interest and activity in SHMEM (notably OpenSHMEM), UPC, UPC++ and Coarray Fortran (the latter being a part of the Fortran standard since Fortran 2008). Even MPI 3.0 offers PGAS capabilities. Any software engineer will tell you that “hardware is nothing without software.” It appears that the hardware support for PGAS will not go unsupported, making this a key factor in empowering PGAS.

OpenSHMEM

OpenSHMEM is both an effort to standardize an API and a reference implementation as a library. This means that reads and writes of globally addressable data are performed with functions rather than simple assignments that we will see in language implementations.  Library calls may not be as elegant, but they leave us free to use any compiler we like. OpenSHMEM is a relatively restrictive programming model because of the desire to map its functionality directly to hardware. One important limitation is that all globally addressable data must be symmetric, which means that the same global variables or data buffers are allocated by all threads. Any static data is also guaranteed to be symmetric. This ensures that the layout of remotely accessible memory is the same for all threads, and enables efficient implementations.

#include <shmem.h>

int main(void) {

shmem_init();

if (num_pes()<2) shmem_global_exit(1);

/* allocate from the global heap */

int *A = shmem_malloc(sizeof(int));

int B = 134;

/* store local B at PE 0 into A at processing element (PE) 1 */

if (my_pe()==0) shmem_int_put(A,&B,1,1);

/* global synchronization of execution and data */

shmem_barrier_all();

/* observe the result of the store */

if (my_pe()==1) printf(“[email protected]=%d\n”,*A);

/* global synchronization to make sure the print is done */

shmem_barrier_all();

shmem_free(A);

shmem_finalize();

return 0;

}

A simple OpenSHMEM program, written according to the OpenSHMEM 1.2 specification.  C standard library headers are omitted.

UPC 

Unified Parallel C (UPC) is an extension to C99. The key language extension is the shared type qualifier. Data objects that are declared with the shared qualifier are accessible by all threads, even if those threads are running on different hosts. An optional layout qualifier can also be provided as part of the shared array type to indicate how the elements of the array are distributed across threads. Because UPC is a language and has compiler support, the assignment operator (=) can be used to perform remote memory access. Pointers to shared data can also themselves be shared, allowing us to create distributed, shared linked data structures (e.g., lists, trees, or graphs). Because compilers may not always recognize bulk data transfers, UPC provides functions (upc_memput, upc_memget, upc_memcpy) that explicitly copy data into and out of globally addressable memory. UPC can allocate globally addressable data in a non-symmetric and non-collective manner, which increases the flexibility of the model and can help to enable alternatives to the conventional bulk-synchronization style of parallelism.

#include <upc.h>

int main(void) {

if (THREADS<2) upc_global_exit(1);

/* allocate from the shared heap */

shared int *A = upc_all_alloc(THREADS,sizeof(int));

int B = 134;

/* store local B at PE 0 into A at processing element (PE) 1 */

if (MYTHREAD==0) A[1] = B;

/* global synchronization of execution and data */

upc_barrier;

/* observe the result of the store */

if (MYTHREAD==1) printf(“[email protected]=%d\n”,A[1]);

upc_all_free(A);

return 0;

}

A simple UPC program, written according to the version 1.3 specification.  C standard library headers are omitted.

Fortran Coarrays

The concept of Fortran Coarrays, developed as an extension to Fortran 95, was standardized in Fortran 2008. An optional codimension attribute can be added to Fortran arrays, allowing remote access to the array instances across all threads. When using a Coarray, an additional codimension is specified using square brackets to indicate the image in which the array locations will be accessed.

program main

implicit none

integer, allocatable :: A(:)[:]

integer :: B

if (num_images()<2) call abort;

! allocate from the shared heap

allocate(A(1)[*])

B = 134;

! store local B at processing element (PE) 0 into A at PE 1

if (this_image().eq.0) A(1)[1] = B;

! global synchronization of execution and data

sync all

! observe the result of the store

if (this_image().eq.1) print*,’[email protected]=’,A(1)[1]

! make sure the print is done

sync all

deallocate(A)

end program main

A simple Fortran program, written according to the 2008 specification.

MPI‑3 RMA

The MPI community first introduced one-sided communication, also known as Remote Memory Access (RMA), in the MPI 2.0 standard.  MPI RMA defines library functions for exposing memory for remote access through RMA windows. Experiences with the limitations of MPI 2.0 RMA led to the introduction in MPI 3.0 of new atomic operations, synchronization methods, methods for allocating and exposing remotely accessible memory, a new memory model for cache-coherent architectures, plus several other features. The MPI‑3 RMA interface remains large and complex partly because it aims to support a wider range of usages than most PGAS models. MPI RMA may end up most used as an implementation layer for other PGAS models such as Global Arrays, OpenSHMEM, or Fortran coarrays, as there is at least one implementation of each of these using MPI‑3 RMA under the hood.

#include <mpi.h>

int main(void) {

MPI_Init(NULL,NULL);

int me,np;

MPI_Comm_rank(MPI_COMM_WORLD,&me);

MPI_Comm_size(MPI_COMM_WORLD,&np);

if (np<2) MPI_Abort(MPI_COMM_WORLD,1);

/* allocate from the shared heap */

int * Abuf;

MPI_Win Awin;

MPI_Win_allocate(sizeof(int),sizeof(int),MPI_INFO_NULL,

MPI_COMM_WORLD,&Abuf,&Awin);

MPI_Win_lock_all(MPI_MODE_NOCHECK,Awin);

int B = 134;

/* store local B at processing element (PE) 0 into A at PE 1 */

if (me==0) {

MPI_Put(&B,1,MPI_INT,1,0,1,MPI_INT,Awin);

MPI_Win_flush_local(1,Awin);

}

/* global synchronization of execution and data */

MPI_Win_flush_all(Awin);

MPI_Barrier(MPI_COMM_WORLD);

/* observe the result of the store */

if (me==1) printf(“[email protected]=%d\n”,*Abuf);

MPI_Win_unlock_all(Awin);

MPI_Win_free(&Awin);

MPI_Finalize();

return 0;

}

A simple MPI RMA program, written according to the version 3.1 specification.  C standard library headers are omitted.

PGAS and Intel Xeon Phi processors

With up to 72 cores that share memory, an Intel Xeon Phi processor is a perfect device to explore PGAS with performance at a scale not previously so widely available. Since we do care about performance, running PGAS on a shared memory device with so many cores is a fantastic proxy for future machines that will offer increased support for performance using PGAS across larger and larger systems.


Code examples and figures are adopted from Chapter 16 (PGAS Programming Models) of the book Intel Xeon Phi Processor High Performance Programming – Intel Xeon Phi processor Edition, used with permission. Jeff Hammond and James Dinan were the primary contributors to the book chapter and the examples used in the chapter and in this article. I owe both of them a great deal of gratitude for all their help.

About the Author

James Reinders likes fast computers and the software tools to make them speedy. Last year, James concluded a 10,001 day career at Intel where he contributed to projects including the world’s first TeraFLOPS supercomputer (ASCI Red), compilers and architecture work for a number of Intel processors and parallel systems. James is the founding editor of The Parallel Universe magazine and has been the driving force behind books on VTune (2005), TBB (2007), Structured Parallel Programming (2012), Intel Xeon Phi coprocessor programming (2013), Multithreading for Visual Effects (2014), High Performance Parallelism Pearls Volume One (2014) and Volume Two (2015), and Intel Xeon Phi processor (2016). James resides in Oregon, where he enjoys both gardening and HPC and HPDA consulting.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

SCA23: Pawsey’s Mark Stickells on Sustainable Australian Supercomputing

March 17, 2023

“While the need for supercomputing is great, we have, in my view, reached a tipping point,” said Mark Stickells, executive director of Australia’s Pawsey Supercomputing Centre, as he opened his keynote (“Energy E Read more…

Optical I/O Technology Needed for Zettascale, Say Top Chipmakers

March 16, 2023

Optical I/O is being singled out by top companies to push computing beyond exascale and into zettascale. The technology was singled out in a recent speech by AMD CEO Lisa Su as a critical technology to reach zettascale c Read more…

Tasty CHIPS – New MEC Program to Expand US Prototyping Capabilities Gains Steam

March 16, 2023

Sometime later this year, perhaps around July, the Department of Defense is expected to announce the sites and focus of up to nine hubs associated with the Microelectronics Commons (MEC) program. Funded and broadly descr Read more…

2023 Winter Classic: Mentor Interview, HPE

March 14, 2023

In our most recent update, “Triumph and Tragedy with HPL/HPCG”, we detailed how our dozen 2023 Winter Classic Invitational cluster competition teams dealt with their Linpack/HPCG module, mentored by HPE. In this episode of our incredibly popular 2023 Winter Classic Studio Update Show, we... Read more…

Leibniz QIC’s Mission to Coax Qubits and Bits to Work Together

March 14, 2023

Four years after passing the U.S. National Quantum Initiative Act and decades after early quantum development and commercialization efforts started – think D-Wave Systems and IBM, for example – the U.S. quantum lands Read more…

AWS Solution Channel

Shutterstock 1679096101

Building a 4x faster and more scalable algorithm using AWS Batch for Amazon Logistics

Amazon Logistics’ science team created an algorithm to improve the efficiency of their supply-chain by improving planning decisions. Initially the algorithm was implemented in a sequential way using a monolithic architecture executed on a single high performance computational node on AWS Cloud. Read more…

 

Get the latest on AI innovation at NVIDIA GTC

Join Microsoft at NVIDIA GTC, a free online global technology conference, March 20 – 23 to learn how organizations of any size can power AI innovation with purpose-built cloud infrastructure from Microsoft. Read more…

Pawsey Supercomputing Targets Detailed Regional Climate Projections

March 13, 2023

The Pawsey Supercomputing Centre in Australia is putting its shiny new Setonix supercomputer (ranked fourth on the most recent Top500 list) to work on an important climate change research project. The project, led by Jat Read more…

SCA23: Pawsey’s Mark Stickells on Sustainable Australian Supercomputing

March 17, 2023

“While the need for supercomputing is great, we have, in my view, reached a tipping point,” said Mark Stickells, executive director of Australia’s Pawsey Read more…

Optical I/O Technology Needed for Zettascale, Say Top Chipmakers

March 16, 2023

Optical I/O is being singled out by top companies to push computing beyond exascale and into zettascale. The technology was singled out in a recent speech by AM Read more…

Tasty CHIPS – New MEC Program to Expand US Prototyping Capabilities Gains Steam

March 16, 2023

Sometime later this year, perhaps around July, the Department of Defense is expected to announce the sites and focus of up to nine hubs associated with the Micr Read more…

Leibniz QIC’s Mission to Coax Qubits and Bits to Work Together

March 14, 2023

Four years after passing the U.S. National Quantum Initiative Act and decades after early quantum development and commercialization efforts started – think D- Read more…

Intel Hopes to Stop Server Beating from AMD Next Year

March 13, 2023

After getting bruised in servers by AMD, Intel hopes to stop the bleeding in the server market with next year's chip offerings. The difference-making products will be Sierra Forest and Granite Rapids, which are due out in 2024, said Dave Zinsner, chief financial officer at Intel, last week at the Morgan Stanley Technology, Media and Telecom conference. Read more…

White House Budget Request Includes Funding for Leadership-Class Computing Facility

March 10, 2023

The U.S. government is dedicating a record amount of $25 billion as part of the 2024 budget to emerging technologies as the country looks to counter the technology threat from China. The budget includes billions of dollars earmarked to boost the supercomputing infrastructure, semiconductors, and cutting-edge technologies such as artificial intelligence and quantum computing. The technology... Read more…

Inside NCSA’s Nightingale Cluster, Designed for Sensitive Data

March 10, 2023

The emergence of Covid in 2020 saw an explosion in HPC-powered health research. As the pandemic raged on, though, one limiting factor became increasingly clear: Read more…

Top HPC Players: It’s Time to Get Serious About Security

March 9, 2023

Time’s up: nearly everyone agrees it’s about time to become serious about bringing security safeguards to high-performance computing systems, which has been Read more…

CORNELL I-WAY DEMONSTRATION PITS PARASITE AGAINST VICTIM

October 6, 1995

Ithaca, NY --Visitors to this year's Supercomputing '95 (SC'95) conference will witness a life-and-death struggle between parasite and victim, using virtual Read more…

SGI POWERS VIRTUAL OPERATING ROOM USED IN SURGEON TRAINING

October 6, 1995

Surgery simulations to date have largely been created through the development of dedicated applications requiring considerable programming and computer graphi Read more…

U.S. Will Relax Export Restrictions on Supercomputers

October 6, 1995

New York, NY -- U.S. President Bill Clinton has announced that he will definitely relax restrictions on exports of high-performance computers, giving a boost Read more…

Dutch HPC Center Will Have 20 GFlop, 76-Node SP2 Online by 1996

October 6, 1995

Amsterdam, the Netherlands -- SARA, (Stichting Academisch Rekencentrum Amsterdam), Academic Computing Services of Amsterdam recently announced that it has pur Read more…

Cray Delivers J916 Compact Supercomputer to Solvay Chemical

October 6, 1995

Eagan, Minn. -- Cray Research Inc. has delivered a Cray J916 low-cost compact supercomputer and Cray's UniChem client/server computational chemistry software Read more…

NEC Laboratory Reviews First Year of Cooperative Projects

October 6, 1995

Sankt Augustin, Germany -- NEC C&C (Computers and Communication) Research Laboratory at the GMD Technopark has wrapped up its first year of operation. Read more…

Sun and Sybase Say SQL Server 11 Benchmarks at 4544.60 tpmC

October 6, 1995

Mountain View, Calif. -- Sun Microsystems, Inc. and Sybase, Inc. recently announced the first benchmark results for SQL Server 11. The result represents a n Read more…

New Study Says Parallel Processing Market Will Reach $14B in 1999

October 6, 1995

Mountain View, Calif. -- A study by the Palo Alto Management Group (PAMG) indicates the market for parallel processing systems will increase at more than 4 Read more…

Leading Solution Providers

Contributors

CORNELL I-WAY DEMONSTRATION PITS PARASITE AGAINST VICTIM

October 6, 1995

Ithaca, NY --Visitors to this year's Supercomputing '95 (SC'95) conference will witness a life-and-death struggle between parasite and victim, using virtual Read more…

SGI POWERS VIRTUAL OPERATING ROOM USED IN SURGEON TRAINING

October 6, 1995

Surgery simulations to date have largely been created through the development of dedicated applications requiring considerable programming and computer graphi Read more…

U.S. Will Relax Export Restrictions on Supercomputers

October 6, 1995

New York, NY -- U.S. President Bill Clinton has announced that he will definitely relax restrictions on exports of high-performance computers, giving a boost Read more…

Dutch HPC Center Will Have 20 GFlop, 76-Node SP2 Online by 1996

October 6, 1995

Amsterdam, the Netherlands -- SARA, (Stichting Academisch Rekencentrum Amsterdam), Academic Computing Services of Amsterdam recently announced that it has pur Read more…

Cray Delivers J916 Compact Supercomputer to Solvay Chemical

October 6, 1995

Eagan, Minn. -- Cray Research Inc. has delivered a Cray J916 low-cost compact supercomputer and Cray's UniChem client/server computational chemistry software Read more…

NEC Laboratory Reviews First Year of Cooperative Projects

October 6, 1995

Sankt Augustin, Germany -- NEC C&C (Computers and Communication) Research Laboratory at the GMD Technopark has wrapped up its first year of operation. Read more…

Sun and Sybase Say SQL Server 11 Benchmarks at 4544.60 tpmC

October 6, 1995

Mountain View, Calif. -- Sun Microsystems, Inc. and Sybase, Inc. recently announced the first benchmark results for SQL Server 11. The result represents a n Read more…

New Study Says Parallel Processing Market Will Reach $14B in 1999

October 6, 1995

Mountain View, Calif. -- A study by the Palo Alto Management Group (PAMG) indicates the market for parallel processing systems will increase at more than 4 Read more…

SC22 Booth Videos

AMD @ SC22
Altair @ SC22
AWS @ SC22
Ayar Labs @ SC22
CoolIT @ SC22
Cornelis Networks @ SC22
DDN @ SC22
Dell Technologies @ SC22
HPE @ SC22
Intel @ SC22
Intelligent Light @ SC22
Lancium @ SC22
Lenovo @ SC22
Microsoft and NVIDIA @ SC22
One Stop Systems @ SC22
Penguin Solutions @ SC22
QCT @ SC22
Supermicro @ SC22
Tuxera @ SC22
Tyan Computer @ SC22
  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire