PGAS Use will Rise on New H/W Trends, Says Reinders

By James Reinders

May 25, 2017

In this contributed feature, James Reinders explores how modern hardware designs are unlocking potential for the PGAS programming language.

If you have not already tried using PGAS, it is time to consider adding PGAS to the programming techniques you know. Partitioned Global Array Space, commonly known as PGAS, has been around for decades in academic circles but has seen extremely limited use in production applications. PGAS methods include UPC, UPC++, Coarray Fortran, OpenSHM and the latest MPI standard.

Developments in hardware design are giving a boost to PGAS performance that will lead to more widespread usage in the next few years. How much more, of course, remains to be seen. In this article, I’ll explain why PGAS has increased interest and support, show some sample code to illustrate PGAS approaches, and explain why Intel Xeon Phi processors offer an easy way to explore PGAS with performance at a scale not previously available.

PGAS defined

PGAS programming models offer a partitioned global shared memory capability, via a programming language or API, whether special support exists in hardware or not. Four keys in this definition:

  • Global address space – any thread can read/write remote data
  • Partitioned – data is designated as local or global, this is NOT hidden from us – this is critical so we can write our code for locality to enable scaling
  • via a programming language or API – PGAS does not fake that all memory is shared via techniques such as copies on page faults, etc. Instead, PGAS always has an interface that a programmer uses to access this “shared memory” capability. A compiler (with a language interface) or a library (with an API) does whatever magic is needed.
  • whether special support exists in hardware or not – as a programmer, I do not care if there is hardware support other than my craving for performance!

PGAS rising

Discussion of PGAS has been around for decades. It has been steadily growing in practicality for more and more of us, and it is ripe for a fresh look by all of us programmers. I see at least three factors that are coming together which will lead to more widespread usage in the upcoming years.

Factor 1: Hardware support for more and more cores connected coherently. In the 1990s, hardware support for the distributed shared memory model emerged with research projects including Stanford DASH and MIT Alewife, and commercial products including the SGI Origin, Cray T3D/E and Sequent NUMA-Q.  Today’s Intel Xeon Phi processor has many architectural similarities to these early efforts designed specifically for a single-chip implementation. The number of threads of execution is nearly identical, and the performance much higher owing largely to a couple of decades of technological advances. This trend not only empowers PGAS, it also enables exploring PGAS today at a scale and performance level never before possible.

Factor 2: Low latency interconnects. Many disadvantages of PGAS are being addresses by low latency interconnects, partly driven by exascale development. The Cray Aries interconnect has driven latencies low enough that PGAS is quite popular in some circles, and Cray’s investments in UPC, Fortran, Coarray C++, SHMEM and Chapel reflect their continued investments in PGAS. Other interconnects, including Intel Omni-Path Architecture, stand to extend this trend. A key to lower latency is driving functionality out of the software stack and into the interconnect hardware where it can be performed more quickly and independently. This is a trend that greatly empowers PGAS.

Factor 3: Software support growing. The old adage “where there’s smoke there’s fire” might be enough to convince us PGAS is on the rise because software support for PGAS is leading the way. When the U.S. government ran a competition for proposals for high production “next generation” programming languages for high performance computing needs (High Productivity Computing Systems HPCS), three competitors were awarded contracts to develop their proposals of Fortress, Chapel, and X10. It is interesting to note that all three included some form of PGAS support. Today, we also see considerable interest and activity in SHMEM (notably OpenSHMEM), UPC, UPC++ and Coarray Fortran (the latter being a part of the Fortran standard since Fortran 2008). Even MPI 3.0 offers PGAS capabilities. Any software engineer will tell you that “hardware is nothing without software.” It appears that the hardware support for PGAS will not go unsupported, making this a key factor in empowering PGAS.

OpenSHMEM

OpenSHMEM is both an effort to standardize an API and a reference implementation as a library. This means that reads and writes of globally addressable data are performed with functions rather than simple assignments that we will see in language implementations.  Library calls may not be as elegant, but they leave us free to use any compiler we like. OpenSHMEM is a relatively restrictive programming model because of the desire to map its functionality directly to hardware. One important limitation is that all globally addressable data must be symmetric, which means that the same global variables or data buffers are allocated by all threads. Any static data is also guaranteed to be symmetric. This ensures that the layout of remotely accessible memory is the same for all threads, and enables efficient implementations.

#include <shmem.h>

int main(void) {

shmem_init();

if (num_pes()<2) shmem_global_exit(1);

/* allocate from the global heap */

int *A = shmem_malloc(sizeof(int));

int B = 134;

/* store local B at PE 0 into A at processing element (PE) 1 */

if (my_pe()==0) shmem_int_put(A,&B,1,1);

/* global synchronization of execution and data */

shmem_barrier_all();

/* observe the result of the store */

if (my_pe()==1) printf(“A@1=%d\n”,*A);

/* global synchronization to make sure the print is done */

shmem_barrier_all();

shmem_free(A);

shmem_finalize();

return 0;

}

A simple OpenSHMEM program, written according to the OpenSHMEM 1.2 specification.  C standard library headers are omitted.

UPC 

Unified Parallel C (UPC) is an extension to C99. The key language extension is the shared type qualifier. Data objects that are declared with the shared qualifier are accessible by all threads, even if those threads are running on different hosts. An optional layout qualifier can also be provided as part of the shared array type to indicate how the elements of the array are distributed across threads. Because UPC is a language and has compiler support, the assignment operator (=) can be used to perform remote memory access. Pointers to shared data can also themselves be shared, allowing us to create distributed, shared linked data structures (e.g., lists, trees, or graphs). Because compilers may not always recognize bulk data transfers, UPC provides functions (upc_memput, upc_memget, upc_memcpy) that explicitly copy data into and out of globally addressable memory. UPC can allocate globally addressable data in a non-symmetric and non-collective manner, which increases the flexibility of the model and can help to enable alternatives to the conventional bulk-synchronization style of parallelism.

#include <upc.h>

int main(void) {

if (THREADS<2) upc_global_exit(1);

/* allocate from the shared heap */

shared int *A = upc_all_alloc(THREADS,sizeof(int));

int B = 134;

/* store local B at PE 0 into A at processing element (PE) 1 */

if (MYTHREAD==0) A[1] = B;

/* global synchronization of execution and data */

upc_barrier;

/* observe the result of the store */

if (MYTHREAD==1) printf(“A@1=%d\n”,A[1]);

upc_all_free(A);

return 0;

}

A simple UPC program, written according to the version 1.3 specification.  C standard library headers are omitted.

Fortran Coarrays

The concept of Fortran Coarrays, developed as an extension to Fortran 95, was standardized in Fortran 2008. An optional codimension attribute can be added to Fortran arrays, allowing remote access to the array instances across all threads. When using a Coarray, an additional codimension is specified using square brackets to indicate the image in which the array locations will be accessed.

program main

implicit none

integer, allocatable :: A(:)[:]

integer :: B

if (num_images()<2) call abort;

! allocate from the shared heap

allocate(A(1)[*])

B = 134;

! store local B at processing element (PE) 0 into A at PE 1

if (this_image().eq.0) A(1)[1] = B;

! global synchronization of execution and data

sync all

! observe the result of the store

if (this_image().eq.1) print*,’A@1=’,A(1)[1]

! make sure the print is done

sync all

deallocate(A)

end program main

A simple Fortran program, written according to the 2008 specification.

MPI‑3 RMA

The MPI community first introduced one-sided communication, also known as Remote Memory Access (RMA), in the MPI 2.0 standard.  MPI RMA defines library functions for exposing memory for remote access through RMA windows. Experiences with the limitations of MPI 2.0 RMA led to the introduction in MPI 3.0 of new atomic operations, synchronization methods, methods for allocating and exposing remotely accessible memory, a new memory model for cache-coherent architectures, plus several other features. The MPI‑3 RMA interface remains large and complex partly because it aims to support a wider range of usages than most PGAS models. MPI RMA may end up most used as an implementation layer for other PGAS models such as Global Arrays, OpenSHMEM, or Fortran coarrays, as there is at least one implementation of each of these using MPI‑3 RMA under the hood.

#include <mpi.h>

int main(void) {

MPI_Init(NULL,NULL);

int me,np;

MPI_Comm_rank(MPI_COMM_WORLD,&me);

MPI_Comm_size(MPI_COMM_WORLD,&np);

if (np<2) MPI_Abort(MPI_COMM_WORLD,1);

/* allocate from the shared heap */

int * Abuf;

MPI_Win Awin;

MPI_Win_allocate(sizeof(int),sizeof(int),MPI_INFO_NULL,

MPI_COMM_WORLD,&Abuf,&Awin);

MPI_Win_lock_all(MPI_MODE_NOCHECK,Awin);

int B = 134;

/* store local B at processing element (PE) 0 into A at PE 1 */

if (me==0) {

MPI_Put(&B,1,MPI_INT,1,0,1,MPI_INT,Awin);

MPI_Win_flush_local(1,Awin);

}

/* global synchronization of execution and data */

MPI_Win_flush_all(Awin);

MPI_Barrier(MPI_COMM_WORLD);

/* observe the result of the store */

if (me==1) printf(“A@1=%d\n”,*Abuf);

MPI_Win_unlock_all(Awin);

MPI_Win_free(&Awin);

MPI_Finalize();

return 0;

}

A simple MPI RMA program, written according to the version 3.1 specification.  C standard library headers are omitted.

PGAS and Intel Xeon Phi processors

With up to 72 cores that share memory, an Intel Xeon Phi processor is a perfect device to explore PGAS with performance at a scale not previously so widely available. Since we do care about performance, running PGAS on a shared memory device with so many cores is a fantastic proxy for future machines that will offer increased support for performance using PGAS across larger and larger systems.


Code examples and figures are adopted from Chapter 16 (PGAS Programming Models) of the book Intel Xeon Phi Processor High Performance Programming – Intel Xeon Phi processor Edition, used with permission. Jeff Hammond and James Dinan were the primary contributors to the book chapter and the examples used in the chapter and in this article. I owe both of them a great deal of gratitude for all their help.

About the Author

James Reinders likes fast computers and the software tools to make them speedy. Last year, James concluded a 10,001 day career at Intel where he contributed to projects including the world’s first TeraFLOPS supercomputer (ASCI Red), compilers and architecture work for a number of Intel processors and parallel systems. James is the founding editor of The Parallel Universe magazine and has been the driving force behind books on VTune (2005), TBB (2007), Structured Parallel Programming (2012), Intel Xeon Phi coprocessor programming (2013), Multithreading for Visual Effects (2014), High Performance Parallelism Pearls Volume One (2014) and Volume Two (2015), and Intel Xeon Phi processor (2016). James resides in Oregon, where he enjoys both gardening and HPC and HPDA consulting.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

Empowering High-Performance Computing for Artificial Intelligence

April 19, 2024

Artificial intelligence (AI) presents some of the most challenging demands in information technology, especially concerning computing power and data movement. As a result of these challenges, high-performance computing Read more…

Kathy Yelick on Post-Exascale Challenges

April 18, 2024

With the exascale era underway, the HPC community is already turning its attention to zettascale computing, the next of the 1,000-fold performance leaps that have occurred about once a decade. With this in mind, the ISC Read more…

2024 Winter Classic: Texas Two Step

April 18, 2024

Texas Tech University. Their middle name is ‘tech’, so it’s no surprise that they’ve been fielding not one, but two teams in the last three Winter Classic cluster competitions. Their teams, dubbed Matador and Red Read more…

2024 Winter Classic: The Return of Team Fayetteville

April 18, 2024

Hailing from Fayetteville, NC, Fayetteville State University stayed under the radar in their first Winter Classic competition in 2022. Solid students for sure, but not a lot of HPC experience. All good. They didn’t Read more…

Software Specialist Horizon Quantum to Build First-of-a-Kind Hardware Testbed

April 18, 2024

Horizon Quantum Computing, a Singapore-based quantum software start-up, announced today it would build its own testbed of quantum computers, starting with use of Rigetti’s Novera 9-qubit QPU. The approach by a quantum Read more…

2024 Winter Classic: Meet Team Morehouse

April 17, 2024

Morehouse College? The university is well-known for their long list of illustrious graduates, the rigor of their academics, and the quality of the instruction. They were one of the first schools to sign up for the Winter Read more…

Kathy Yelick on Post-Exascale Challenges

April 18, 2024

With the exascale era underway, the HPC community is already turning its attention to zettascale computing, the next of the 1,000-fold performance leaps that ha Read more…

Software Specialist Horizon Quantum to Build First-of-a-Kind Hardware Testbed

April 18, 2024

Horizon Quantum Computing, a Singapore-based quantum software start-up, announced today it would build its own testbed of quantum computers, starting with use o Read more…

MLCommons Launches New AI Safety Benchmark Initiative

April 16, 2024

MLCommons, organizer of the popular MLPerf benchmarking exercises (training and inference), is starting a new effort to benchmark AI Safety, one of the most pre Read more…

Exciting Updates From Stanford HAI’s Seventh Annual AI Index Report

April 15, 2024

As the AI revolution marches on, it is vital to continually reassess how this technology is reshaping our world. To that end, researchers at Stanford’s Instit Read more…

Intel’s Vision Advantage: Chips Are Available Off-the-Shelf

April 11, 2024

The chip market is facing a crisis: chip development is now concentrated in the hands of the few. A confluence of events this week reminded us how few chips Read more…

The VC View: Quantonation’s Deep Dive into Funding Quantum Start-ups

April 11, 2024

Yesterday Quantonation — which promotes itself as a one-of-a-kind venture capital (VC) company specializing in quantum science and deep physics  — announce Read more…

Nvidia’s GTC Is the New Intel IDF

April 9, 2024

After many years, Nvidia's GPU Technology Conference (GTC) was back in person and has become the conference for those who care about semiconductors and AI. I Read more…

Google Announces Homegrown ARM-based CPUs 

April 9, 2024

Google sprang a surprise at the ongoing Google Next Cloud conference by introducing its own ARM-based CPU called Axion, which will be offered to customers in it Read more…

Nvidia H100: Are 550,000 GPUs Enough for This Year?

August 17, 2023

The GPU Squeeze continues to place a premium on Nvidia H100 GPUs. In a recent Financial Times article, Nvidia reports that it expects to ship 550,000 of its lat Read more…

Synopsys Eats Ansys: Does HPC Get Indigestion?

February 8, 2024

Recently, it was announced that Synopsys is buying HPC tool developer Ansys. Started in Pittsburgh, Pa., in 1970 as Swanson Analysis Systems, Inc. (SASI) by John Swanson (and eventually renamed), Ansys serves the CAE (Computer Aided Engineering)/multiphysics engineering simulation market. Read more…

Intel’s Server and PC Chip Development Will Blur After 2025

January 15, 2024

Intel's dealing with much more than chip rivals breathing down its neck; it is simultaneously integrating a bevy of new technologies such as chiplets, artificia Read more…

Choosing the Right GPU for LLM Inference and Training

December 11, 2023

Accelerating the training and inference processes of deep learning models is crucial for unleashing their true potential and NVIDIA GPUs have emerged as a game- Read more…

Baidu Exits Quantum, Closely Following Alibaba’s Earlier Move

January 5, 2024

Reuters reported this week that Baidu, China’s giant e-commerce and services provider, is exiting the quantum computing development arena. Reuters reported � Read more…

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

October 30, 2023

With long lead times for the NVIDIA H100 and A100 GPUs, many organizations are looking at the new NVIDIA L40S GPU, which it’s a new GPU optimized for AI and g Read more…

Shutterstock 1179408610

Google Addresses the Mysteries of Its Hypercomputer 

December 28, 2023

When Google launched its Hypercomputer earlier this month (December 2023), the first reaction was, "Say what?" It turns out that the Hypercomputer is Google's t Read more…

AMD MI3000A

How AMD May Get Across the CUDA Moat

October 5, 2023

When discussing GenAI, the term "GPU" almost always enters the conversation and the topic often moves toward performance and access. Interestingly, the word "GPU" is assumed to mean "Nvidia" products. (As an aside, the popular Nvidia hardware used in GenAI are not technically... Read more…

Leading Solution Providers

Contributors

Shutterstock 1606064203

Meta’s Zuckerberg Puts Its AI Future in the Hands of 600,000 GPUs

January 25, 2024

In under two minutes, Meta's CEO, Mark Zuckerberg, laid out the company's AI plans, which included a plan to build an artificial intelligence system with the eq Read more…

China Is All In on a RISC-V Future

January 8, 2024

The state of RISC-V in China was discussed in a recent report released by the Jamestown Foundation, a Washington, D.C.-based think tank. The report, entitled "E Read more…

Shutterstock 1285747942

AMD’s Horsepower-packed MI300X GPU Beats Nvidia’s Upcoming H200

December 7, 2023

AMD and Nvidia are locked in an AI performance battle – much like the gaming GPU performance clash the companies have waged for decades. AMD has claimed it Read more…

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

March 18, 2024

Nvidia's latest and fastest GPU, codenamed Blackwell, is here and will underpin the company's AI plans this year. The chip offers performance improvements from Read more…

DoD Takes a Long View of Quantum Computing

December 19, 2023

Given the large sums tied to expensive weapon systems – think $100-million-plus per F-35 fighter – it’s easy to forget the U.S. Department of Defense is a Read more…

Eyes on the Quantum Prize – D-Wave Says its Time is Now

January 30, 2024

Early quantum computing pioneer D-Wave again asserted – that at least for D-Wave – the commercial quantum era has begun. Speaking at its first in-person Ana Read more…

GenAI Having Major Impact on Data Culture, Survey Says

February 21, 2024

While 2023 was the year of GenAI, the adoption rates for GenAI did not match expectations. Most organizations are continuing to invest in GenAI but are yet to Read more…

The GenAI Datacenter Squeeze Is Here

February 1, 2024

The immediate effect of the GenAI GPU Squeeze was to reduce availability, either direct purchase or cloud access, increase cost, and push demand through the roof. A secondary issue has been developing over the last several years. Even though your organization secured several racks... Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire