Showing posts with label PGAS. Show all posts
Showing posts with label PGAS. Show all posts

2010-07-29

supercomputer power promised in the CHaPeL (Cascade Hi'Productivity Lang)

7.26: news.adda/lang"Chapel(Cascade Hi'Productivity Lang'):

2005.9 PGAS Programming Models Conference/Chapel (pdf):
. the final session of the
Parallel Global Address Space(PGAS)
Programming Models Conference was
devoted to DARPA's HPCS program
(High Productivity Computing Systems):
Cascade[chapel]
, X10[verbose java]
, Fortress[greek]
, StarP[slow matlab].
Locality Control Through Domains:
[7.29:
. domains are array subscript objects,
specifying the size and shape of arrays,
the represent a set of subscripts;
and so, applying a domain to an array
selects a set of array elements .
(recall that term"domain is part of
function: domain-->codomain terminology)
. domains make it easy to work with
sparse arrays, hash tables, graphs,
and interior slices of arrays .]
. domains can be distributed across locales,
which generally correspond to CPUs in Chapel.
This gives Chapel its fundamentally
global character, as contrasted to
the process-centric nature of MPI or CAF,
for example.
When operations are performed on
an array whose domain is distributed,
any needed IPC is implicitly carried out,
-- (inter-processor communication) --
without the need for function calls.
Chapel provides much more generality
in the distribution of domains
than high performance Fortran, or UPC,
but, will not take the place of
the complex domain decomposition tools
required to distribute data for
optimum load balance and communication
in most practical parallel programs.
[7.29:
. HPC systems today are overwhelmingly
distributed memory systems,
and the applications tend to require
highly irregular communication
between the CPUs.
This means that good performance
depends on effective locality management,
which minimizes the IPC costs .
whereas,
productive levels of abstraction
will degrades performance.
Chapel simply allows mixing
a variety of abstraction levels .]

Cray's Chapel intro (pdf):
Chapel (Cascade High Productivity Language)
is Cray's programming language for
supercomputers, like the Cascade system;
part of the Cray Cascade project,
a participant in DARPA's HPCS program
(High Productivity Computing Systems ) .
influences:
. iterators:
CLU, Ruby, Python
. latent types:
ML, Scala, Matlab, Perl, Python, C#
. OOP, type safety:
Java, C#:
. generic programming/templates:
C++
. data parallelism, index sets, distributed arrays:
ZPL (Z-level Programming Language)
HPF (High-Performance Fortran),
. task parallelism, synchronization:
Cray's MTA's extensions to Fortran and C.
(Multi-Threaded Architecture)

Global-view vs Fragmented models:
. in Fragmented programming models,
the Programmer's point-of-view
is a single processor/thread;
. the PGAS lang" UPC (unified parallel C),
has a fragmented compute model
and a Global-View data model .
. the shared mem' sytems" OpenMP & pThreads
have a trivially Global View of everything .
. the HPCS lang's, including Chapel,
hava a Global View of everything .

too-low- & too-high-level abstractions
vs multiple levels of design:

. openMP, pthreads, MPI are low-level & difficult;
HPF, ZPL are high-level & inefficient .
Chapel has a mix of abstractions for:
# task scheduling levels:
. work stealing; suspendable tasks;
task pool; thread per task .
# lang' concept levels:
. data parallelism; distributions;
task parallelism;
base lang; locality control .
# mem'mgt levels:
. gc; region-based;
manual(malloc,free) .
. chapel`downloads for linux and mac .
readme for Chapel 1.1
The highlights of this release include:
parallel execution of all
data parallel operations
on arithmetic domains and arrays;
improved control over the
degree and granularity of parallelism
for data parallel language constructs;
feature-complete
Block and Cyclic distributions;
simplified constructor calls for
Block and Cyclic distributions;
support for assignments between,
and removal of indices from,
sparse domains;
more robust performance optimizations on
aligned arithmetic domains and arrays;
many programmability
and correctness improvements;
new example programs demonstrating
task parallel concepts and distributions;
wide-ranging improvements
to the content and organization of the
language specification.
This release of Chapel contains
stable support for the base language,
and for task and regular data parallelism
using one or multiple nodes.
Data parallel features
on irregular domains and arrays
are supported via a single-threaded,
single-node reference implementation.
impl' status:
No support for inheritance from
multiple or generic classes
Incomplete support for user-defined constructors
Incomplete support for sparse arrays and domains
Unchecked support for index types and sub-domains
No support for skyline arrays
No constant checking for domains, arrays, fields
Several internal memory leaks
Task Parallelism
No support for atomic statements
Memory consistency model is not guaranteed
Locality and Affinity
String assignment across locales is by reference
Data Parallelism
Promoted functions/operators do not preserve shape
User-defined reductions are undocumented and in flux
No partial scans or reductions
Some data parallel statements are serialized
Distributions and Layouts
Distributions are limited to Block and Cyclic
User-defined domain maps are undocumented and in flux
7.27 ... 28:
IJHPCA` High Productivity Languages and Models
(Internat' J. HPC App's, 2007, 21(3))


Diaconescua, Zima 2007`An Approach to Data Distributions in Chapel (pdf):
--. same paper is ref'd here:
Chapel Publications from Collaborators:
. they note:
"( This paper presents early exploratory work
in developing a philosophy and foundation for
Chapel's user-defined distributions ).
User-defined distributions
are first-class objects:
placed in a library,
passed to functions,
and reused in array declarations.
In the simplest case,
the specification of a
new distribution
can consist of just a few lines
of code to define mappings between
the global indices of a data structure
and memory;
in contrast, a sophisticated user
(or distribution writer)
can control the internal
representation and layout of data
to an almost arbitrary degree,
allowing even the expression of
auxiliary structures
typically used for distributed
sparse matrix data.
Specifically,
our distribution framework is designed to
support:
• The mapping of arbitrary data collections
to units of locality,
• the specification of user-defined mappings
exploiting knowledge of data structures
and their access patterns,
• the capability to control the
layout (allocation) of data
within units of locality,
• orthogonality between
distributions and algorithms,
• the uniform expression of computation
for both dense and sparse data structures,
• reusability and extensibility
of the data mapping machinery itself,
as well as of the common data mapping patterns
occurring in various application domains.

Our approach is the first
that addresses these issues
completely and consistently
at a high level of abstraction;
in contrast to the current programming paradigm
that explicitly manages data locality
and the related aspects of synchronization,
communication, and thread management
at a level close to what assembly programming
was for sequential languages.
The challenge is to allow the programmer
high-level control of data locality
based on the knowledge of the problem
without unnecessarily burdening the
expression of the algorithm
with low-level detail,
and achieving target code performance
similar to that of
manually parallelized programs.

Data locality is expressed via
first-class objects called distributions.
Distributions apply to collections
of indices represented by domains,
which determine how arrays
associated with a domain
are to be mapped and allocated across
abstract units of uniform memory access
called locales.
Chapel offers an open concept of distributions,
defined by a set of classes
which establish the interface between
the programmer and the compiler.
Components of distributions
are overridable by the user,
at different levels of abstraction,
with varying degrees of difficulty.
Well-known regular standard distributions
can be specified along with
arbitrary irregular distributions
using the same uniform framework.
There are no built-in distributions
in our approach.
Instead, the vision is that
Chapel will be an open source language,
with an open distribution interface,
which allows experts and non-experts
to design new distribution classes
and support the construction of
distribution libraries that can be
further reused, extended, and optimized.
Data parallel computations
are expressed via forall loops,
which concurrently iterate over domains.

. the class of PGAS languages
including Unified Parallel C (UPC)
provide a reasonable improvement
over lower-level communications with MPI.
. UPC`threads support block-cyclic
distributions of one-dimensional arrays
over a one-dimensional set of processors .
and a stylized upc forall loop
that supports an affinity expression
to map iterations to threads.

. the other DARPA`HPCS lang's
provide built-in distributions
as well as the possibility to create
new distributions from existing ones.
However,
they do not contain features
for specifying user-defined
distributions and layouts.
Furthermore,
X10’s locality rule requires
an explicit distinction between
local and remote accesses
to be made by the programmer
at the source language level.
The key differences between
existing work and our approach
can be summarized as follows.
First,
we provide a general oop framework
for the specification of
user-defined distributions,
integrated into an advanced
high-productivity parallel language.
Secondly,
our framework allows the
flexible formulation
of data distributions,
locale-internal data arrangements,
and associated control mechanisms
at a high level of abstraction,
tuned to the properties of
architectures and applications.
This ensures
target code performance
that is otherwise achievable only via
low-level control.
. Chapel Publications and Papers .
. Chapel Specification [current version (pdf)] .


Parallel Programmability and the Chapel Language (pdf)
bradc@cray.com, d.callahan@microsoft.com, zima@jpl.nasa.gov`
Int.J. High Performance Computing App's, 21(3) 2007

This paper serves as a good introduction
to Chapel's themes and main language concepts.
7.28: adda/concurrency/chapel/Common Component Architecture:
Common Component Architecture (CCA) Forum

2005 CCA Whitepaper (pdf):
. reusable scientific components
and the tools with which to use them.
In addition to developing
simple component examples
and hands-on exercises as part of
CCA tutorial materials,
we are growing a CCA toolkit of components
that is based on widely used software packages,
including:
ARMCI (one-sided messaging),
CUMULVS (visualization and parallel data redistribution),
CVODE (integrators), DRA (parallel I/O),
Epetra (sparse linear solvers),
Global Arrays (parallel programming),
GrACE (structured adaptive meshes),
netCDF and parallel netCDF (input/output),
TAO (optimization), TAU (performance measurement),
and TOPS (linear and nonlinear solvers).
Babel (inter-lang communication):
Babel is a compiler
that generates glue code from
SIDL interface descriptions.
(Scientific Interface Description Language)
SIDL features support for
complex numbers, structs,
and dynamic multidimensional arrays.
SIDL provides a modern oop model,
with automatic ref'counting
and resource (de)allocation.
-- even on top of traditional
procedural languages.
Code written in one language
can be called from any of the
supported languages.
Full support for
Remote Method Invocation (RMI)
allows for parallel distributed applications.

Babel focuses on high-performance
language interoperability
within a single address space;
It won a prestigious R&D 100 award in 2006
for "The world's most
rapid communication among
many languages in a single application."

. Babel currently fully supports
C, C++ Fortran, Python, and Java.
-- Chapel is coming soon:
CCA working with chapel 2009:
Babel migration path for chapel:
Collaboration Status: Active
TASCS Contact: Brad Chamberlain, Cray mailto:bradc@cray.com
Collaboration Summary:
Cray is developing a Chapel language binding
to the Babel interoperability tool.
The work is purely exploratory
(source is not publicly available yet)
and Babel is providing
whatever consulting and training services
needed to facilitate.
doc's:
Common Component Architecture Core Specification
Babel Manuals:
User's Guide (tar.gz)
Implement a Protocol for BabelRMI (pdf)
Understanding the CCA Specification Using Decaf (pdf)
CCA toolkit
CCA tut's directory
CCA Hands-On Guide 0.7.0 (tar.gz)
Our language interoperability tool, Babel,
makes CCA components interoperable
across languages and CCA frameworks.
Numerous studies have demonstrated
that the overheads of the CCA environment
are small and easily amortized
in typical scientific applications.

specification of components.
. using SIDL,
The current syntactic specification
can be extended to capture
more of the semantics
of component behavior.
For example,
increasing the expressiveness
of component specifications
(the “metadata” available about them)
makes it possible to catch
certain types of errors automatically
. we must leverage the
unique capabilities of
component technology
to inspire new CS research directions.
For example,
the CCA provides a dynamic model
for components,
allowing them to be
reconnected during execution.
This model allows an application to
monitor and adapt itself
by swapping components for others.
This approach, called
computational quality of service,
can benefit numerical, performance,
and other aspects of software.
Enhanced component specifications
can provide copious information
that parallel runtime environments
could exploit to provide
the utmost performance.
. the development and use of
new runtime environments
could be simplified by integrating them
with component frameworks.

PGAS replacing MPI in supercomputer lang's

7.26: news.adda/concurrency/PGAS (partitioned global address space):

. PGAS is a parallel programming model
(partitioned global address space) [7.29:
where each processor
has its own local memory
and the sharable portion of local space
can be reached from other processors
by pointer rather than by
the slower MPI (message passing interface) .
. since a shared space has an
"(affinity for a particular processor),
things can be arranged so that
local shares can be
accessed quicker than remote shares,
thereby "(exploiting
locality of reference).]

The PGAS model is the basis of
Unified Parallel C,
and all the lang's funded by DARPA`HPCS
(High Productivitiy Computing Systems)
{ Sun`Fortress, Cray`Chapel, IBM`X10 }.
[@] adda/concurrency/NUCC/lang's for supercomputer concurrency
The pgas model -- also known as
the distributed shared address space model,[7.29:
provides more of both performance
and ease-of-programming
than MPI (Message Passing Interface)
which uses function calls
to communicate across clustered processors .]

As in the shared-memory model,
one thread may directly read and write
memory allocated to another.
At the same time, [7.29:
the concept of local yet sharable]
is essential for performance .

7.26: news.adda/lang"upc (Unified Parallel C):
The UPC language evolved from experiences with
three other earlier languages
that proposed parallel extensions to ISO C 99:
AC, Split-C, and Parallel C Preprocessor (PCP). [7.29:

AC (Distributed Data Access):
. AC modifies C to support
a shared address space
with physically distributed memory.
. the nodes of a massively parallel processor
can access remote memory
without message passing.
AC provides support for distributed arrays
as well as pointers to distributed data.
Simple array references
and pointer dereferencing
are sufficient to generate
low-overhead remote reads and writes .
Split-C
. supports efficient access to a
global address space
on distributed memory multiprocessors.
It retains the "small language" character of C
and supports careful engineering
and optimization of programs
by providing a simple, predictable
cost model. -- in stark contrast to
languages that rely on extensive
compile-time transformations
to obtain performance on parallel machines.
Split-C programs do what the
programmer specifies;
the compiler takes care of
addressing and communication,
as well as code generation.
Thus, the ability to exploit
parallelism or locality
is not limited by the compiler's
recognition capability,
nor is there need to second guess
the compiler transformations
while optimizing the program.
The language provides a small set of
global access primitives
and parallel storage layout declarations.
These seem to capture
most of the useful elements
of shared memory, message passing,
and data parallel programming
in a common, familiar context.
Parallel C Preprocessor (pcp):
. a parallel extension of C for multiprocessors
(eg, scalable massively parallel machine, the BBN TC2000)
for sharing memory between processors.
The programming model is split-join
rather than fork-join.
Concurrency is exploited to use a
fixed number of processors more efficiently
rather than to exploit more processors
as in the fork-join model.
Team splitting, a mechanism to
split the processors into subteams
to handle parallel subtasks,
provides an efficient mechanism
for exploiting nested concurrency.
We have found the split-join model
to have an inherent
implementation advantage,
compared to the fork-join model,
when the number of processors becomes large .]
GCC Unified Parallel C (GCC UPC):
UPC 1.2 specification compliant, Based on GNU GCC 4.3.2
Fast bit packed shared pointer support
Configurable shared pointer representation
Pthreads support
GASP support, a performance tool interface
for Global Address Space Languages
Run-time support for UPC collectives
Support for uniprocessor
and symmetric multiprocessor systems
Support for UPC thread affinity
via linux scheduling affinity and NUMA package
Compatible with Berkeley UPC run-time version 2.8 and up
Support for many large scale machines and clusters
in conjunction with Berkeley UPC run-time
Binary packages for x86_64, ia64, x86, mips
Binary packages for Linux Fedora, SuSe, CentOS, Mac OS X, IRIX
. for uniprocessor and symmetric multiprocessor systems:
Intel x86_64 Linux (Fedora Core 11)
Intel ia64 (Itanium) Linux (SuSe SEL 11)
Intel x86 Linux (CentOS 5.3)
Intel x86 Apple Mac OS X (Leopard 10.5.7+ and Snow Leopard 10.6)
. GCC-UPC@HERMES.GWU.EDU archives .
. Programming in the pgas Model at SC2003 (pdf) .

Programming With the Distributed Shared-Memory Model at SC2001 (pdf):
. Recent developments have resulted in
viable distributed shared-memory languages
for a balance between ease-of-programming
and performance.
As in the shared-memory model,
programmers need not explicitly specify
data accesses.
Meanwhile, programmers can exploit data locality
using a model that enables the placement of data
close to the threads that process them,
to reduce remote memory accesses.
. fundamental concepts associated with
this programming model include
execution models, synchronization,
workload distribution,
and memory consistency.
We then introduce the syntax and semantics
of three parallel programming language
instances with growing interest:
Cray's CAF(Co-Array FORTRAN),
Berkeley's Titanium JAVA
and (IDA, LLNL, UCB) 1999` UPC (Unified Parallel C)
-- upc`history (pdf):
. IDA Center for Computing Sciences
University of California at Berkeley,
Lawrence Livermore National Lab,
... and the consortium refined the design:
Government: ARSC, IDA CSC, LBNL, LLNL, NSA, US DOD
Academia: GWU, MTU, UCB
Vendors: Compaq, CSC, Cray, Etnus, HP, IBM, Intrepid, SGI, Sun,
It will be shown through
experimental case studies
that optimized distributed shared memory
can be competitive with
message passing codes,
without significant departure from the
ease of programming
provided by the shared memory model .
. the openMP model is
all the threads on shared mem';
it doesn't allow locality exploitation;
modifying shared data
may require synchronization
(locks, semaphores) .
. in contrast,
the upc model is
distributed shared mem';
it differs from threads sharing mem' by
each thread has its own partition slice;
and within that slice, there's a
{shared, private} division:
slice#i has affinity to thread#i
-- that means, a thread tries to
keep most of its own obj's
within its own slice;
but, it's share.pointers can target
any other thread's sharable slice .
"(exploit locality of references) .
. message-passing as a
sharing mechanism
isn't a good fit for many app's in
math, science, and data mining .
. a dimension of {value, pointer} types is
{ shared -- can point in shared mem' .
, private -- points only to thread's private mem .
} -- both can access either {dynamic, static } mem .
. all scalar (non-array) shared objects
have affinity with thread#0 .
. pointers have a self and a target,
both of which can
optionally be in shared mem';
but it's unwise
to have a shared pointer
accessing a private value;
upc disallows casting of
private pointer to shared .
. casting of shared to private
is defined only if the shared pointer
has affinity with
the thread performing the cast,
since the cast doesn't preserve
the pointer`thread.identifier .

attributes of shared pointers:
. int upc_threadof(shared void *ptr);
-- thread that has affinity to pointer
int upc_phaseof(shared void *ptr);
-- pointer's index (pos' within the block)
void* upc_addrfield(shared void *ptr);
-- address of the targeted block
other upc-specific attributes:
upc_localsizeof(type-name or expression);
-- size of the local portion of a shared object.
upc_blocksizeof(type-name or expression);
-- the blocking factor associated with the argument.
upc_elemsizeof(type-name or expression);
-- size (in bytes) of the left-most type
that is not an array.
Berkeley UPC intro:
array` affinity granularities:
# cyclic (per element)
- successive elements of the array
have affinity with successive threads.
# blocked-cyclic (user-defined)
- the array is divided into
user-defined size blocks
and the blocks are cyclically distributed
among threads.
# blocked (run-time)
- each thread has affinity to a
tile of the array.
The size of the contiguous part
is determined in such a way
that the array is
"evenly" distributed among threads.

To define the interaction between
memory accesses to shared data,
UPC provides two user-controlled
consistency models { strict, relaxed }:
# "strict" model:
the program executes in a
Lamport`sequential consistency model .
This means that
it appears to all threads that the
strict references within the same thread
appear in the program order,
relative to all other accesses.
# "relaxed" model:
it appears to the issuing thread
that all shared references within the thread
appear in the program order.

The UPC execution model
is similar to the SIMD used by the
(Single Instruction Multiple Data)
message passing style (MPI or PVM).
-- an explicitly parallel model.
In UPC terms, the execution vehicle
for a program is called a thread.
The language defines a private variable
- MYTHREAD - to distinguish between
the threads of an UPC program.
The language does not define any
correspondence between
a UPC thread and its OS-level
counterparts,
nor does it define any
mapping to physical CPU's.
Because of this,
UPC threads can be implemented
either as full-fledged OS processes
or as threads (user or kernel level).
On a parallel system,
the UPC program running with shared data
will contain at least
one UPC thread per physical processor .