2010-08-31

the MathLink ABI

2019.10.9: summary:
. how I was introduced to the term
ABI (application binary interface)
in which programs and processes
communicate with each other;
I give it the meaning within addx
of a communication that involves
etrees (expression trees) rather than ascii text.

2010-08-30

lang's for supercomputer concurrency

adda/concurrency/lang's for supercomputer concurrency
7.27:
. the DARPA`HPCS program
(High Productivity Computing Systems)
is meant to tame the costs of HPC
(High Performance Computing)
-- HPC is the use of supercomputers for
simulations to solve problems in
chemistry, physics, ecology, sociology,
and esp'ly warfare .
. programmer productivity
means making it easier to develope
code that can make full use of
a supercomputer's concurrency .
. the main source of the cost
is a lack of smart software tools
that can turn science experts
into computer coders .
. toward this end, they are funding
the design of new concurrency lang's .
7.28:
. the DoD and DARPA already made
quite an investment in the
Ada concurrency lang',
a language designed for expressing
the sort of concurrency needed by
embedded system engineers; 7.31:
but, the software developer community
spurned Ada as soon as there was
something better ...
. pascal was popular before '85,
and when Ada arrived '83,
it was popular too;
then came the Visual version
of Basic (familiar and slick).
. the top demanded langs by year were:
'85: C, lisp, Ada, basic;
'95: v.basic, c++, C, lisp, Ada;
'05: Java, C, c++, perl, php,

. the only currently popular lang's
meant for exploiting multi-core cpu's
if not other forms of concurrency, are:
(3.0% rating) obj'c 2.0 (with blocks)
(0.6% rating) Go
(0.4% rating) Ada
7.29:
. whether or not Ada's concurrency model
is well-suited for supercomputers
as well as embedded systems,
it is not increasing coder'productivity .
. while Ada boosted productivity beyond
that offered by C,
it was nevertheless proven to do
less for productivity than Haskell
.

HPC Productivity 2004/Kepner{ 2003(pdf), 2004(pdf) }:
. lang's were compared for
expressiveness vs performance:
. the goal of a high-performance lang'
is to have the expressiveness
of Matlab and Python,
with the performance of VHDL
(VHDL is a version of Ada for ASICs ).
. UPC (Unified Parallel C) and Co-array Fortran
are half way to high-productivity
merely by using PGAS
(Partitioned global address space)
rather than MPI
(Message Passing Interface).
. the older tool set: C, MPI, and openMP
is both slow and difficult to use .

. the 2 lang's that DARPA is banking on now
are Cray`Chapel, and IBM`X10 Java .
. they were also funding Sun`Fortress
until 2006,
which features a syntax like advanced math
-- the sort of greek that physics experts
are expected to appreciate .

LLVM concurrency representations

Mac OS X 10.6 (and later):
. The OpenCL GPGPU implementation is built on
Clang and LLVM compiler technology.
This requires parsing an extended dialect of C at runtime
and JIT compiling it to run on the
CPU, GPU, or both at the same time.
OpenCL (Open Computing Language)
is a framework for writing programs
that execute across heterogeneous platforms
consisting of CPUs, GPUs, and other processors.
OpenCL includes a language (based on C99)
for writing kernels (functions that execute on OpenCL devices),
plus APIs that are used to define and then control the platforms.
Open for Improvement:
. With features like OpenCL and Grand Central Dispatch,
Snow Leopard will be better equipped
to manage parallelism across processors
and push optimized code to the GPU's cores,
as described in WWDC 2008: New in Mac OS X Snow Leopard.
However, in order for the OS to
efficiently schedule parallel tasks,
the code needs to be explicitly optimized
for for parallelism by the compiler.
. LLVM will be a key tool in prepping code for
high performance scheduling.
LLVM-CHiMPS (pdf)
LLVM for the CHiMPS 
(Compiling hll to Massively Pipelined System)
National Center for Supercomputing Applications/
Reconfigurable Systems Summer Institute July 8, 2008/
Compilation Environment for FPGAs:
. Using LLVM Compiler Infrastructure and
CHiMPS Computational Model
. A computational model and architecture for
FPGA computing by Xilinx, Inc.
- Standard software development model (ANSI C)
Trade performance for convenience
- Virtualized hardware architecture
CHiMPS Target Language (CTL) instructions
- Cycle accurate simulator
- Runs on BEE2
Implementation of high level representations:

# Limitations in optimization
- CTL code is generated at compile time
No optimization by LLVM for a source code in which no
such expressions can be optimized at compile time
- LLVM does not have a chance to dynamically optimize
the source code at run time
- LLVM is not almighty
Floating point math is still difficult to LLVM
Cray Opteron Compiler: Brief History of Time (pdf)
Cray has a long tradition of high performance compilers
Vectorization
Parallelization
Code transformation
...
Began internal investigation leveraging LLVM
Decided to move forward with Cray X86 compiler
First release December 2008

Fully optimized and integrated into the compiler
No preprocessor involved
Target the network appropriately:
.  GASNet with Portals . DMAPP with Gemini & Aries .
Why a Cray X86 Compiler?
Standard conforming languages and programming models
Fortran 2003
UPC & CoArray Fortran
. Ability and motivation to provide
high-quality support for
custom Cray network hardware
. Cray technology focused on scientific applications
Takes advantage of Cray’s extensive knowledge of
automatic vectorization and
automatic shared memory parallelization
Supplements, rather than replaces, the available compiler choices

. cray has added parallelization and fortran support .
. ported to cray x2 .
. generating code for upc and caf (pgas langs) .
. supports openmp 2.0 std and nesting .

. Cray compiler supports a full and growing set of
directives and pragmas:
!dir$ concurrent
!dir$ ivdep
!dir$ interchange
!dir$ unroll
!dir$ loop_info [max_trips] [cache_na] ... Many more
!dir$ blockable
man directives
man loop_info
weaknesses:
Tuned Performance
Vectorization
Non-temporal caching
Blocking
Many end-cases
Scheduling, Spilling
No C++, Very young X86 compiler
future:
optimized PGAS -- requires Gemini network for speed
Improved Vectorization
Automatic Parallelization:
. Modernized version of Cray X1 streaming capability
. Interacts with OMP directives
[OpenMP -- Multi-Processing]

DCT (discrete control theory) for avoiding deadlock

[8.30:
. exciting claims I haven't researched yet ...]
8.5: news.adda/concurrency/dct/Gadara`Discrete Control Theory:

Eliminating Concurrency Bugs with Control Engineering (pdf)
Concurrent programming is notoriously difficult
and is becoming increasingly prevalent
as multicore hardware compels performance-conscious developers
to parallelize software.
If we cannot enable the average programmer
to write correct and efficient parallel software
at reasonable cost,
the computer industry's rate of value creation
may decline substantially.
Our research addresses the
challenges of concurrent programming
by leveraging control engineering,
a body of technique that can
constrain the behavior of complex systems,
prevent runtime failures,
and relieve human designers and operators
of onerous responsibilities.
In past decades,
control theory made industrial processes
-- complex and potentially dangerous --
safe and manageable
and relieved human operators
of tedious and error-prone chores.
Today, Discrete Control Theory promises
similar benefits for concurrent software.
This talk describes an application of the
control engineering paradigm to concurrent software:
Gadara, which uses Discrete Control Theory
to eliminate deadlocks in
shared-memory multithreaded software.

promise pipelining

8.21: news.adda/co/promises/wiki brings understanding:
. yahoo!, this wiki page finally made sense of promises
as exemplified by e-lang's tutorial
which graphically showed things incorrectly;
so that unless you ignored the diagram
you couldn't possibly make sense of the tutorial .
[8.30: ### the following is just my
version of that page, not a working tutorial ###]

t1 := x`a();
t2 := y`b();
t3 := t1`c(t2);
. "( x`a() ) means to send the message a()
asynchronously to x.
If, x, y, t1, and t2
are all located on the same remote machine,
a pipelined implementation can compute t3 with
one round-trip instead of three.
[. the original diagram showed all involved objects
existing on the client's (caller's) node,
not the remote server's;
so, you'd have to be left wondering
how is the claimed pipelining
possible for t1`c(t2)
if the temp's t1, and t2
are back at the caller's?! ]
Because all three messages are destined for
objects which are on the same remote machine,
only one request need be sent
and only one response
need be received containing the result.
. the actual message looks like:
do (remote`x`a) and save as t1;
do (remote`y`b) and save as t2;
do (t1`c(t2)) using previous saves;
and send it back .
Promise pipelining should be distinguished from
parallel asynchronous message passing.
In a system supporting parallel message passing
but not pipelining,
the messages x`a() and y`b()
in the above example could proceed in parallel,
but the send of t1`c(t2) would have to wait until
both t1 and t2 had been received,
even when x, y, t1, and t2 are on the same remote machine.
. Promise pipelining vs
pipelined message processing:
. in Actor systems,
it is possible for an actor to begin a message
before having completed
processing of the previous message.
[. this is the usual behavior for Ada tasks;
tasks are very general, and the designer of one
can make a task that does nothing more than
collect and sort all the messages that get queued;
and then even when it accepts a job,
it can subsequently requeue it .]

the co operator

8.20: adda/co/the co operator:
. as shown in chapel,
instead of declaring a block to be a coprogram,
any stmt can be run as a coprogram with:
co stmt;
-- stmt can be procedure literal;
[8.28: I'm wondering why this couldn't be
obvious to the compiler
or to an llvm-type runtime .]
. co { set of stmt's}
-- runs each element in a set in parallel;
var's can be declared sets of stmts .

the Concurrency and Coordination Runtime (CCR)

8.19: adda/co/the ms`ccr model:
. the new view I got from reading about
ms`robotics`lang and the CCR
was that task entries are separate from tasks,
and can be shared among tasks .
. entries can be dynamically assigned to tasks;
agents can dynamically change .
. integrating this with OOP,
any msg can be sent to any task
or any shared queue;
and it can look at both the
name and the arg list`types,
to decide how to handle it .
thoughts on Ada's Tasking:
. an entry corresponds to an obj`msg,
each Ada task has its own queue
on which task entries are posted .
. Ada protected types saved time by
skipping the queue posting,
and going straight to atomic access
else waiting for another to finish .
. the queueing then must be occuring
with the scheduler:
tasks are suspended waiting for a resource,
and if each resource had its own list,
then every time that resource is finished,
the scheduler checks that resource's queue
for the next task to wake .

why isn't oop's modularity thread-safe?

8.6: adda/concurrency/oop`modular not threadsafe:

. assuming the oop lang's were very modular;
why wouldn't they be easy to distribute?
some have said it's the central heap model .
. in fact,
java was not really that modular:
whether it was from obj' vs value semantics
or letting other objects share in the
modification of locals,
or simply not being able to
finish one self change
before being read by another thread,
java was not always threadsafe .
. to be {concurrently modular, threadsafe},
class methods need to be atomic
as are Ada's protected types .

. would it help if they didn't share?
[8.17: ie,
why is the default assignment a pointer share;
or, why have sharing pointers without
distinguishing sharers from owners ? ]

[8.17: can't we find concurrency elsewhere?]
. the basic structure hasn't changed
(programs = dstrs + algor's):
obj`methods are serving up only brief procedures;
the bulk of processing comes from the
algorithm that employees the obj's .
[8.17: no:
. even if the methods are brief,
and even if there are many subroutine calls;
all the leaf calls are to obj`methods
-- and most time is actually spent in leaf calls;
however,
at the algorithm level,
we can identify many concurrable calls
for which the compiler can verify
that what we are calling concurrent
in fact involves no object var'sharing .]