PGAS replacing MPI in supercomputer lang's

7.26: news.adda/concurrency/PGAS (partitioned global address space):

. PGAS is a parallel programming model
(partitioned global address space) [7.29:
where each processor
has its own local memory
and the sharable portion of local space
can be reached from other processors
by pointer rather than by
the slower MPI (message passing interface) .
. since a shared space has an
"(affinity for a particular processor),
things can be arranged so that
local shares can be
accessed quicker than remote shares,
thereby "(exploiting
locality of reference).]

The PGAS model is the basis of
Unified Parallel C,
and all the lang's funded by DARPA`HPCS
(High Productivitiy Computing Systems)
{ Sun`Fortress, Cray`Chapel, IBM`X10 }.
[@] adda/concurrency/NUCC/lang's for supercomputer concurrency
The pgas model -- also known as
the distributed shared address space model,[7.29:
provides more of both performance
and ease-of-programming
than MPI (Message Passing Interface)
which uses function calls
to communicate across clustered processors .]

As in the shared-memory model,
one thread may directly read and write
memory allocated to another.
At the same time, [7.29:
the concept of local yet sharable]
is essential for performance .

7.26: news.adda/lang"upc (Unified Parallel C):
The UPC language evolved from experiences with
three other earlier languages
that proposed parallel extensions to ISO C 99:
AC, Split-C, and Parallel C Preprocessor (PCP). [7.29:

AC (Distributed Data Access):
. AC modifies C to support
a shared address space
with physically distributed memory.
. the nodes of a massively parallel processor
can access remote memory
without message passing.
AC provides support for distributed arrays
as well as pointers to distributed data.
Simple array references
and pointer dereferencing
are sufficient to generate
low-overhead remote reads and writes .
. supports efficient access to a
global address space
on distributed memory multiprocessors.
It retains the "small language" character of C
and supports careful engineering
and optimization of programs
by providing a simple, predictable
cost model. -- in stark contrast to
languages that rely on extensive
compile-time transformations
to obtain performance on parallel machines.
Split-C programs do what the
programmer specifies;
the compiler takes care of
addressing and communication,
as well as code generation.
Thus, the ability to exploit
parallelism or locality
is not limited by the compiler's
recognition capability,
nor is there need to second guess
the compiler transformations
while optimizing the program.
The language provides a small set of
global access primitives
and parallel storage layout declarations.
These seem to capture
most of the useful elements
of shared memory, message passing,
and data parallel programming
in a common, familiar context.
Parallel C Preprocessor (pcp):
. a parallel extension of C for multiprocessors
(eg, scalable massively parallel machine, the BBN TC2000)
for sharing memory between processors.
The programming model is split-join
rather than fork-join.
Concurrency is exploited to use a
fixed number of processors more efficiently
rather than to exploit more processors
as in the fork-join model.
Team splitting, a mechanism to
split the processors into subteams
to handle parallel subtasks,
provides an efficient mechanism
for exploiting nested concurrency.
We have found the split-join model
to have an inherent
implementation advantage,
compared to the fork-join model,
when the number of processors becomes large .]
GCC Unified Parallel C (GCC UPC):
UPC 1.2 specification compliant, Based on GNU GCC 4.3.2
Fast bit packed shared pointer support
Configurable shared pointer representation
Pthreads support
GASP support, a performance tool interface
for Global Address Space Languages
Run-time support for UPC collectives
Support for uniprocessor
and symmetric multiprocessor systems
Support for UPC thread affinity
via linux scheduling affinity and NUMA package
Compatible with Berkeley UPC run-time version 2.8 and up
Support for many large scale machines and clusters
in conjunction with Berkeley UPC run-time
Binary packages for x86_64, ia64, x86, mips
Binary packages for Linux Fedora, SuSe, CentOS, Mac OS X, IRIX
. for uniprocessor and symmetric multiprocessor systems:
Intel x86_64 Linux (Fedora Core 11)
Intel ia64 (Itanium) Linux (SuSe SEL 11)
Intel x86 Linux (CentOS 5.3)
Intel x86 Apple Mac OS X (Leopard 10.5.7+ and Snow Leopard 10.6)
. Programming in the pgas Model at SC2003 (pdf) .

Programming With the Distributed Shared-Memory Model at SC2001 (pdf):
. Recent developments have resulted in
viable distributed shared-memory languages
for a balance between ease-of-programming
and performance.
As in the shared-memory model,
programmers need not explicitly specify
data accesses.
Meanwhile, programmers can exploit data locality
using a model that enables the placement of data
close to the threads that process them,
to reduce remote memory accesses.
. fundamental concepts associated with
this programming model include
execution models, synchronization,
workload distribution,
and memory consistency.
We then introduce the syntax and semantics
of three parallel programming language
instances with growing interest:
Cray's CAF(Co-Array FORTRAN),
Berkeley's Titanium JAVA
and (IDA, LLNL, UCB) 1999` UPC (Unified Parallel C)
-- upc`history (pdf):
. IDA Center for Computing Sciences
University of California at Berkeley,
Lawrence Livermore National Lab,
... and the consortium refined the design:
Academia: GWU, MTU, UCB
Vendors: Compaq, CSC, Cray, Etnus, HP, IBM, Intrepid, SGI, Sun,
It will be shown through
experimental case studies
that optimized distributed shared memory
can be competitive with
message passing codes,
without significant departure from the
ease of programming
provided by the shared memory model .
. the openMP model is
all the threads on shared mem';
it doesn't allow locality exploitation;
modifying shared data
may require synchronization
(locks, semaphores) .
. in contrast,
the upc model is
distributed shared mem';
it differs from threads sharing mem' by
each thread has its own partition slice;
and within that slice, there's a
{shared, private} division:
slice#i has affinity to thread#i
-- that means, a thread tries to
keep most of its own obj's
within its own slice;
but, it's share.pointers can target
any other thread's sharable slice .
"(exploit locality of references) .
. message-passing as a
sharing mechanism
isn't a good fit for many app's in
math, science, and data mining .
. a dimension of {value, pointer} types is
{ shared -- can point in shared mem' .
, private -- points only to thread's private mem .
} -- both can access either {dynamic, static } mem .
. all scalar (non-array) shared objects
have affinity with thread#0 .
. pointers have a self and a target,
both of which can
optionally be in shared mem';
but it's unwise
to have a shared pointer
accessing a private value;
upc disallows casting of
private pointer to shared .
. casting of shared to private
is defined only if the shared pointer
has affinity with
the thread performing the cast,
since the cast doesn't preserve
the pointer`thread.identifier .

attributes of shared pointers:
. int upc_threadof(shared void *ptr);
-- thread that has affinity to pointer
int upc_phaseof(shared void *ptr);
-- pointer's index (pos' within the block)
void* upc_addrfield(shared void *ptr);
-- address of the targeted block
other upc-specific attributes:
upc_localsizeof(type-name or expression);
-- size of the local portion of a shared object.
upc_blocksizeof(type-name or expression);
-- the blocking factor associated with the argument.
upc_elemsizeof(type-name or expression);
-- size (in bytes) of the left-most type
that is not an array.
Berkeley UPC intro:
array` affinity granularities:
# cyclic (per element)
- successive elements of the array
have affinity with successive threads.
# blocked-cyclic (user-defined)
- the array is divided into
user-defined size blocks
and the blocks are cyclically distributed
among threads.
# blocked (run-time)
- each thread has affinity to a
tile of the array.
The size of the contiguous part
is determined in such a way
that the array is
"evenly" distributed among threads.

To define the interaction between
memory accesses to shared data,
UPC provides two user-controlled
consistency models { strict, relaxed }:
# "strict" model:
the program executes in a
Lamport`sequential consistency model .
This means that
it appears to all threads that the
strict references within the same thread
appear in the program order,
relative to all other accesses.
# "relaxed" model:
it appears to the issuing thread
that all shared references within the thread
appear in the program order.

The UPC execution model
is similar to the SIMD used by the
(Single Instruction Multiple Data)
message passing style (MPI or PVM).
-- an explicitly parallel model.
In UPC terms, the execution vehicle
for a program is called a thread.
The language defines a private variable
- MYTHREAD - to distinguish between
the threads of an UPC program.
The language does not define any
correspondence between
a UPC thread and its OS-level
nor does it define any
mapping to physical CPU's.
Because of this,
UPC threads can be implemented
either as full-fledged OS processes
or as threads (user or kernel level).
On a parallel system,
the UPC program running with shared data
will contain at least
one UPC thread per physical processor .